White Paper Final Version

Data Issues in the Life Sciences
Anne E. Thessen* David J Patterson
Data Conservancy (Life Sciences)
Marine Biological Laboratory Woods Hole Massachusetts 02543
*Author for contact, athessen@mbl.edu
Data Conservancy: Data Issues in the Life Sciences (March 2011)
EXECUTIVE SUMMARY
1. The expansion of the Life Sciences with a data-centric 'Big New Biology' will provide opportunities to reveal previously obscure truths about the living world and will provide scientists with more resources to address large, challenging questions. 2. This change will require a major investment in infrastructure, changes to current data practices, and an appropriately trained community of informaticians. 3. Changes that will improve the reuse of data within the Life Sciences include: a. Incentives to improve researcher readiness to share data, b. Extensions and merger (integration) of metadata standards and ontologies that can be used to organize data, to improve the discovery of data, and to enable machine reasoning on data, c. More technical support and infrastructure to support people, projects, institutions, and programs nationally and internationally, d. Persistent registries and repositories structured to curate data and to enhance the suitability of data for reuse, e. Aggressive implementation of semantic web technologies, f. New university courses that combine Computer Sciences with Life Sciences in the context of informatics. 4. The Life Sciences are very heterogeneous in terms of data cultures. No single approach to change will suit the needs of all of the Life Sciences. The most successful strategies are likely to be those that address needs in the context of sub-disciplines. The International Nucleotide Sequence Database Collaboration that serves molecular biology with GenBank, EMBL and DDBJ provides a good model of a domain-specific solution to the challenges that will accompany a transformation to datacentricity. 5. We do not yet have a detailed understanding of the spectrum and nature of data cultures in the Life Sciences, nor is there a common understanding of the nature of data, nor is there a widespread understanding among biologists of the nature and benefits of a more data-centric discipline. Considerable effort is required to communicate the benefits of new data-centric dimensions within the Life Sciences, and to promote participation in the transformation.
INTRODUCTION
The urgent need to understand complex, global phenomena and the emergence of improved data management technologies are driving an agenda in the Life Sciences to enhance data-driven discovery (National Academy of Sciences 2009). The development demands new approaches to sharing and querying existing data (Hey et al. 2009, Kelling et al. 2009). This document addresses some of the more proximate issues that scientists will face as they progress towards this 'Big New Biology'. Data-driven discovery refers to the discovery of scientific insights and hypothesis generation through the novel management and analysis of pre-existing data. It relies on accessing and reusing data which will most likely have been generated to address other scientific problems in contrast to the more familiar process of acquiring new data to address questions. While still hypothesis-based, data-driven discovery differs in character from scientific inquiry based on laboratory experiments or by making new field observations. Data-driven discovery requires a large virtual pool of data across a wide spectrum of the Life Sciences. The availability of such a pool will allow biology to join the other 'Big' (= data-centric) sciences such as Astronomy and high-energy particle physics (Hey et al. 2009). Access to a pool will invite 'New' logic, strategies and tools (a 'macroscope') to discover trends, associations, discontinuities, and exceptions that reveal aspects of the underlying biology which are unlikely to emerge from more reductionist approaches (de Rosnay 1975, Ausubel 2009, National Academy of Sciences 2009, Patterson et al. 2010, Sirovich et al. 2010). The pool, and the resources from which it is macerated, may reveal factors, other than the properties that are intrinsic to biology, that shape knowledge. Insights into sociological trends which improve acuity or introduce distortions lead to a richer understanding of 'scientific certainty' (Evans & Foster 2011). The emergence of a data-centric Big New Biology requires a stable cyberinfrastructure (National Science Foundation 2003, 2006, Burton & Treloar 2009, European Science Foundation 2006, URL 1). Registries and repositories must grow to meet the challenges of making data discoverable and accessible. The emerging Knowledge Organization System (Morris 2010) will need elements that will aggregate disparate data sets. Such elements require flexible and evolving schemas that define categories of data across Life Sciences and ontologies to intelligently link vocabularies and to model existing knowledge. Semantic web technologies are needed to achieve flexibility of reuse. Enhanced user interfaces with organizational, analytical and visualization tools will be needed to allow scientists to interact with the data and associated infrastructure. Most existing environments for data management are limited in scope. The Big New Biology requires a new mesh of biological, computer and information sciences, as well as changes to current cultures, to achieve data-centric bridges among the subdisciplines of the Life Sciences.
The best current examples of data-centric environment within the Life Sciences are provided by the International Nucleotide Sequence Database Collaboration - a group of linked repositories (GenBank, EMBL and DDBJ) for genetic sequences that receives data from willing scientists around the world (Strasser 2008). Data are freely available for reuse. Many new tools have appeared to take advantage of the data and to create new knowledge from them. The community and publishers actively endorse this environment. Many more environments similar to this, but serving other subdisciplines, will be needed if a Big New Biology, focused on aggregating and querying existing data in novel ways, is to emerge. A task of the Life Sciences component of the Data Conservancy is to address sociological and technical issues that need to accompany the expansion of the Life Sciences as a Big New Biology.
What are the Life Sciences?

The Life Sciences investigate all aspects of living systems, past and present. The phenomena that are studied can endure from nanoseconds to billions of years, and involve events that occur on a physical scale that extends from molecules to ecosystems. More than 2 million species of life have been
described, one-tenth or less of the number that currently exists (Raup 1992, Chapman 2009). Events in each evolutionary lineage are unpredictable and there is no dominant organizing principle for biological data. This has led to different styles and methods of capturing data about nature, and to a diversity of data cultures. A large proportion of Life Sciences data is not in a sharable form (Heidorn 2008). Not all aspects of the Life Sciences are considered here. Subdisciplines that address the
interactions of organisms with the physical world, such as geochemistry and ecology, are not included because they are covered by the GEOBON initiative (Scholes et al. 2008) and, within the DataNet program, by the Data Observation Network for Earth (DataONE) (Reichman et al. 2011; URL 2). Medical sciences, while based on living organisms, are also excluded both because there is already a robust informatics community, well-developed schema management systems such as UMLS or SnoMed (URL 3, URL 4), a rich array of environments addressing issues of management of data relating to human health (such as DICOM URL 5), dedicated courses and journals (URL 6), software (URL 7), and communities addressing particular challenges (e.g. URL 8, URL 9; Bilimoria et al. 2003). We exclude the agricultural and food sciences because of their applied focus and because they too have an emerging infrastructure (Maurer & Tochterman 2010).
Academically, the Life Sciences have splintered into thousands of subdisciplines (Fig. 1), also collectively referred to as Biology. As a science, Biology has to describe and understand phenomena considerable that variation exhibit (the
phenomena are noisy), and in which the diversity of

Figure 1: A word cloud constructed from the names of some subdisciplines of the Life Sciences illustrates the scope of the discipline.
components and processes at all
scales contribute to a massively complex array of interactions. Within individual organisms, the biology that is expressed is determined in part by genetic makeup. How the genes are expressed is determined by which alleles are present and by other genes, the condition of the organism, interactions with other organisms, or by environmental conditions. Mutations can lead to changes in genetic makeup within the lifetime of an organism or in subsequent generations. The evolutionary events in any one of the millions of evolutionary lineages may constrain options, but cannot tell us what whether stasis, transformation, extinction, or speciation will happen next. There is a capacity for change in ways that cannot be predicted from the study of parts; this defining property of biology is referred to as emergence (Mayr 2004). As a result of inherent variation and complexity across a wide spectrum of scales, many parts of biology cannot be explained through general rules or through an exclusive focus on details. Biology is unlike those sciences in which the identification and cataloguing of components (such as the periodic table) and the discovery of the rules of interaction (such as Newtonian mechanics) explain large swathes of the discipline. We represent biology as being contained within an envelope (Fig. 2), one axis of which extends from the shortest events known in life (the sub-nanosecond phenomena associated with electron and ion movements) to the most enduring processes (the evolutionary processes that began about 3.5 billion years ago and have continued to the present day). The other axis extends from the smallest biological objects (bioactive ions) to the largest (the complete biosphere).
Levels of organization within biology

One means of categorizing the Life Sciences and seeking common aspects relating to data issues is group phenomena into 'levels of organization'. Those levels extend from molecular processes to ecosystem-level phenomena. Each
meter 10,000 kms
millimeter
level engages those below it and influences those above it. Each level requires appropriate instruments and associated data cultures. Molecular tools are becoming increasingly influential subdivisions at all are levels. arbitrary, The and
millisecond second year 1 bya nanometer
individual phenomena can affect aspects at many other levels. A
Figure 2: Biological phenomena extend from molecular events that involve sub-ngstom-sized objects and last for fragments of milliseconds, to events that extend across the globe and have endured for 3 billion years or more of Earth's history. The part of the envelope occupied by expressed biological phenomena is shown in a lighter shade. Overlain are rectangles to indicate whether data about the underlying phenomena are acquired through individual experience (yellow), use of tools (green) or by instruments (blue and red) or at the extreme, if knowledge is assembled largely by inference (purple).
single mutational change in human haemoglobin is the source of sickle-celled anemia, but also provides sufficient protection against parasitism by malaria to lead to large scale changes in cultural development of human populations in Africa (Fleming et al. 1979). Typical 'levels of organization' are: 1. Molecular biology and biochemistry. These subdisciplines address the structure, interactions and roles of molecules and their components. This domain includes molecular genetics,
biochemistry, metabolic pathways, and cell physiology. While some molecules can be more than 10 mm long (such as DNA), molecular phenomena are usually expressed at the sub-nanometer level, and in short to very short (sub-nanosecond) time scales. Data are typically collected using instruments such as sequencers, spectrometers and chromatographs. Data are often born digital and sharing is common at this level. Major repositories exist for molecular data (Table 1). 2. Cellular. All organisms (arguably except viruses) are made of cells and cell biology addresses components, events, and processes that occur within or via cells. Such events happen at scales mostly around 1-10 microns (10-6 meters) but may occur in cells that are a meter or so in length or may involve sub-ngstom (10-10 meters) components. Events are usually measured in seconds to Data Conservancy: Data Issues in the Life Sciences (March 2011) 6
minutes, but can extend to periods of hours (more rarely to years). Generally, cellular processes that are not part of molecular biology and biochemistry are studied through tools such as microscopes. There are no major internationally acclaimed repositories of data about cell
biology, but there are smaller cell image repositories (URL 10). There are no widespread traditions of data sharing outside publications and inter-scientist social interactions.
Repository AlgaeBase ArrayExpress Australia National Data Service ConceptWiki CSIRO Data.gov Diptera database EMAGE ENA Ensembl EUNIS Euregene Eurexpress EURODEER FishBase FlyBase GBIF GenBank GEO GNI INBIO INSPIRE KEGG Life Sciences Data Archive NASA MassBank MGI MorphBank OBIS OMIM PDB PRIDE PubMed Stanford Microarray Database tair Taxon Concept TOPP TreeBase TROPICOS UniProt WILDSPACE WRAM Type of Life Sciences Data algae names and references microarray general research data concepts fisheries catch natural resources data Dipteran information gene expression gene sequences genomes biodiversity renal genome transcriptome movement of roe deer fish information Drosophila genes and genomes occurrences gene sequences microarray names Costa Rican biodiversity spatial genes effects of space on humans mass spectra mouse images occurrences human genes and phenotypes molecule structure proteomics citations microarray Arabidopsis molecular biology species descriptions animal tagging phylogenetic trees plant specimens protein sequence and function life history information wireless remote animal monitoring Location http://www.algaebase.org/ http://www.ebi.ac.uk/arrayexpress/ http://www.ands.org.au/ http://conceptwiki.org/index.php/Main%20Page http://www.marine.csiro.au/datacentre/ http://www.data.gov/ http://www.sel.barc.usda.gov/diptera/biosys.htm http://www.emouseatlas.org/emage/ http://www.ebi.ac.uk/ena/ http://uswest.ensembl.org/index.html http://eunis.eea.europa.eu http://www.euregene.org/ http://www.eurexpress.org/ee/ http://sites.google.com/site/eurodeerproject/home http://www.fishbase.org/ http://flybase.org http://www.gbif.org/ http://www.ncbi.nlm.nih.gov/genbank/ http://www.ncbi.nlm.nih.gov/geo/ http://gni.globalnames.org/ http://www.inbio.ac.cr/es/default.html http://inspire.jrc.ec.europa.eu/index.cfm http://www.genome.jp/kegg/ http://lsda.jsc.nasa.gov/ http://www.massbank.jp/index.html?lang=en http://www.informatics.jax.org/ http://www.morphbank.net/ http://www.iobis.org/ http://www.ncbi.nlm.nih.gov/omim http://www.pdb.org/pdb/home/home.do http://www.ebi.ac.uk/pride/ http://www.ncbi.nlm.nih.gov/pubmed/ http://smd.stanford.edu/ http://www.arabidopsis.org/ http://taxonconcept.org http://www.topp.org/topp_census http://www.treebase.org/ http://www.tropicos.org/ http://www.uniprot.org/ http://wildspace.ec.gc.ca/more-e.html http://www-wram.slu.se/
Table 1. Examples of repositories for Life Sciences data.
3. Tissue-level events are coordinated interactions involving many cells within multicellular organisms and usually relate to processes that take minutes or hours (but may be as short as Data Conservancy: Data Issues in the Life Sciences (March 2011) 7
fractions of seconds or extend to years), they include developmental and physiological phenomena and are accessible by direct observation, through microscopes, and by instruments. There are no widespread traditions of data sharing outside publications and inter-scientist social interactions. 4. Organismal phenomena include behavior, growth, development, and appearance. The
temporal and spatial dimensions depend on what kinds of organisms are under consideration: individual bacteria may be less than 1 micron in size and may undergo an entire life cycle in hours; large organisms may extend to tens of meters and have life cycles that extend to thousands of years. Most of the data are collected by direct observation and are communicated through a narrative. Data are often held in small sets and can include complex data objects which cannot be directly digitized, such as specimens. 5. Populations are interacting collections of individuals of one (or more symbiotic) species. Populations, such as herds of wildebeest or schools of mackerel, inhabit a patch in space and time that are influenced by the size of the organisms, their diffusivity, their interactions with the physical world and other species, and by their evolutionary history. Patches may extend across a scale of millimeters to thousands of kilometers, and of periods that involve many generations from days to millennia. Populations may have a discrete genetic identity achieved through inbreeding. Phenomena are studied through disciplines such as ecology and genetics. Nongenetic data mostly are obtained through direct observation and disseminated through narrative publication. 6. Species-level phenomena occur within groups of organisms that have a sufficiently distinctive (genetic) identity to be treated as a species. A species may include one or more populations. Areas of activity include appearance (taxonomy), change (evolution), distribution (ecology), and loss (extinction). Processes can occur over millions of years. Distributional aspects may extend up to pan-global scales. Species-level phenomena can be observed or have to be inferred, and are typically described through narrative. There are approximately 1.9 million extant and 0.3 million extinct species currently described (Chapman 2009, Raup 1992) with 20,000 or so new species being described each year (SOS report 2010). 7. Ecological or ecosystem aspects address interactions within and among communities and with the physical world. The disciplines explore issues relating to abundance, patchy distributions in space and time, and roles within food webs, energy and nutrient flows, biogeochemistry, etc. Depending on sizes of organisms involved, the subject is addressed in scales that extend from sub-millimeter (for microbes) to the full extent of the Earth's surface. Ecological phenomena extend from minutes through to the full history of Earth. At the more extensive range, data may Data Conservancy: Data Issues in the Life Sciences (March 2011) 8
be derived from satellites and other remote sensing devices.
Many existing or emerging
databases exist for environmental data (Reichman et al. 2011). Understanding of some aspects may be inferred.
Different means of acquiring data

Data are acquired in different ways in different regions of the envelope (Fig. 2). We refer to data that are acquired by direct human observation (Fig. 2 yellow box) as firstperson data. These data are limited to objects and processes ranging from about 1 mm to 10 kms in size, and that endure from about a second to a decade. We can gather data from a more extended range through the use of tools - smaller objects can be detected with microscopes (Fig. 2 green box). The concept of tools that we use to extend our capacity to observe grades into the concept of instruments devices to which we defer responsibility for data collection. Narratives deriving from and including first person scientific data have been compiled in our literature for about 250 years. Included are some cellular and tissue phenomena, most population, organismic, species-level data, and some ecological phenomena. First-person data are often very selective and do not fairly represent the world from which they are drawn. Most first-person data are held in many, small sets (i.e. they make up the long tail of biological data Heidorn 2008). While typically 'small science' in nature, some participatory environments (such as eBird URL 11) are bringing together the efforts of tens of thousands of observers. There is a strong tradition of transforming data into knowledge by use of the narrative, and then discarding the data. Instruments are devices that acquire data on our behalf when the phenomena are either too small, too big, too short or too long to be observed directly. Events too short or physically too small to be observed directly include molecular and biochemical phenomena, such as molecular genetics, ion movements by molecular pumps, metabolic processes such as photosynthesis (Fig. 2 blue box). Instrument data tend to be born digital and are often associated with experiments. Some areas, especially molecular biology, have good repositories and a more sophisticated culture of data organization and sharing relative to other areas of Life Sciences. Data on long term or extensive processes (Fig. 2 red box) are acquired through instruments such as monitoring platforms and satellites. Data from these sources are also captured and preserved as electronic files. Some phenomena, such as evolutionary processes and the various transformations of the biosphere that extend to billions of years, are informed through observations of fossils, geology, and geochemistry (Fig. 2 purple box). Understanding often relies on small fragments of information. There is a culture of preserving specimens and samples that inform those areas.
SOCIOLOGICAL ISSUES
As the study of human social behavior, sociology includes the behavior and practices of scientists. We refer to the sociological factors that determine the destiny of data as data cultures. If we are to promote a shift to a Big New Biology, we need to understand current practices, their diversity, and what elements favor that transformation, and which aspects will hinder it (Evans and Foster 2011).
Data cultures
The phrase data culture refers to the explicit and implicit (Evans & Foster 2011) data practices and expectations of the relevant scientific community. Data cultures relate to the social conventions of acquisition, curation, preservation, sharing, and reuse of data. While there is no published, detailed survey of data cultures in the Life Sciences, there have been sufficient studies to confirm that there is no single data culture for the Life Sciences (Norris et al. 2008, Gargouri et al. 2010, Key Perspectives Ltd. 2010). This is unsurprising given the scope and scale of the Life Sciences. The cultures range from the field biologist whose data are captured in short-lived notebooks to the molecular biologist whose data are born digital in near terabyte quantities and are widely shared through global data repositories. If the goal is to make data digital, standardized and openly accessible in a reusable format, then the current data cultures provide starting points which determine cultural and technical changes that are needed before that vision can be realized. We do not know how many cultures there are, nor if cultures are discrete or a rich continuum. We discuss below some factors that influence or define data cultures in the Life Sciences. 1. What are data? The term data is not used consistently. For some the term is limited to raw data, for others the term widens to include any kind of information or indeed process that leads to insights. We seek to limit the term to discriminate what is neutral, objective, and largely independent of context or observer. It is this class of 'raw data' we refer to here as 'data'. As data become constrained, filtered and selected, they acquire or are even assigned a meaning in the context of what they apply to. This process, coupled with others, transforms data into information (Ackoff 1989; Fig. 3). Knowledge is comprised of those elements of information that are
Figure 3: Data are neutral, objective, and largely independent of context or observer. Raw data are analyzed, filtered, and given meaning within a context - that is, they become information. Information that is universally agreed is knowledge and the agreed composite of knowledge and its application is what we regard as wisdom.
10
universally accepted. Wisdom is the application of knowledge (Morris 2010). 2. Contextual categorization of the data The context in which biological data are acquired or generated is important to understanding how that data can be appropriately reused. The context may be formed by observer interpretation, because of the tools or instruments used, or may be imposed because data are gathered in an experimental (unnatural) setting. In addition, individuals and technologies are selective and capture a limited subset of all available data, and data are affected by choice of instrument and analytical processes. Context can be represented through the application of appropriate metadata. We categorize the following broad types of data based on context. A. Observational data relate to an object or event actually or potentially witnessed by an agent. An agent may be a person, team, project, initiative; and they may call upon tools and instruments. Key metadata will identify the agent, specify date, location, and contexts such as experimental conditions if relevant or the equipment that was used. Within the Life Sciences, the metadata should include taxon names, the basis for the identification and/or pointers to reference (voucher) material. 1. Descriptive data are non-experimental data collected through observations of nature. Ideally, descriptive data can be reduced to values about a specified aspect of a taxon, system, or process. Each value will be unique, having been made at one place, at one time, by one agent. Observations may be confirmed but not replicated such that it is important to preserve these data. Preservation often does not occur as data of this type are often discarded after completion of the research narrative. A formal framework for descriptive data has been developed in the context of the OBOE project (Madin et al. 2007a). Descriptive data can be collected by instruments or by individuals (i.e. are firstperson data). First-person data may not completely represent the world. Mistakes can be made, such as misidentification of taxa (MacLeod et al. 2010). Researchers may be selective about the data they seek to gather, either intentionally or unintentionally, such that data-sets have limited applicability. For example, counts of bird species can be biased because noisy birds are more likely to be seen and counted than quiet birds (Bibby et al. 2000). GBIF, the Global Biodiversity Information Facility (URL 12, data accessed Feb 2011), contains data on more than 250,000 occurrences of birds worldwide, but only on 8,962 nematode occurrences despite nematode abundances of tens of millions of individuals within a square meter column in the soil (Lal 2006). Some individuals may discard data that are not in keeping with their expectations. Few or no raw data may be Data Conservancy: Data Issues in the Life Sciences (March 2011) 11
recorded such that the information may only be available in an interpreted form, for example as drawings rather than photographs. The acquisition of first-person data is rate-limited, constrained by the number of observers. The rate of collecting first person data tends to plateau much sooner than instrument-derived data. Data born digital, such as molecular data, continue to show geometric rates of growth as indicated by the growth of GenBank (Fig. 4a), now custodians of approximately over 100 billion bases and over 12 Petabytes of data. In comparison, the description of new species, a process that is still dominated by the has
narrative
approach,
leveled-out at about 20,000 species per year (Fig. 4b, SOS report 2010). Biological phenomena that endure for less time or are outside physical scales that we can easily register are
New species descriptions
1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 1750 1800 1850 1900 1950 YEAR
original descriptions currently valid valid and unchanged
Linnaeus (1758)
accessed through instruments (Fig. 2). Molecular
sequencing devices are the source of vast amounts of comparative information about biology across many levels of organization. New high
Figure 4: (a) Growth of sequence information in Genbank (1982 present) vs (b) number of described species (of fish). Molecular data shows continuing growth, whereas new taxonomic insights that depend on first -person observations have plateaued for about 150 years.
throughput machines can sequence billions of bases in days, with single machines generating terabytes of raw data in that time (Doctorow, 2008) while projects such as the 1000 Genomes project generate almost 100 TB of raw data per week (Rhm & Blakeley 2009). Large-scale instrumentation, such as satellites, collect data from large swaths of the globe. The NASA SeaWiFs satellite can gather information from the entire globe in two days (Hooker & McClain 2000). 2. Experimental Data are obtained when a scientist changes or constrains the conditions under which the expression of a phenomenon occurs. Experiments can be conducted across a broad range of scales - from electrophysiological investigations of sub Data Conservancy: Data Issues in the Life Sciences (March 2011) 12
millisecond processes within cells (Bunin et al. 2005) to manipulations of oceanic ecosystems (Coale et al. 2004). The intent is to dissect the elements of the phenomenon by changing conditions to uncover causal relationships, or to identify variant and invariant elements of biological processes. The experimental paradigm characterized much of the research in the Life Sciences in the 20th century. The paradigm assumes that it will uncover robust underlying phenomena such that the experiment, if repeated, will produce the same results. Given the variation that is inherent in biology, it does not follow that outcomes of experiments can ever be perfectly replicated. Raw data are contextualized by the experimental framework, and may have limited or no value in other contexts. It is important for metadata to include information about source and storage of material before the experiment, experimental conditions, equipment controls and treatments. B. Processed data are obtained through a reworking, recombination or analysis of raw data. There are two primary types. 1. Computed data result from a reworking of data to make them more meaningful or to normalize them. In ecology, information about the productivity of an ecosystem is important, but productivity or the extent of the ecosystem are rarely measured directly. Rather they are computed using information or data from other sources to generate measurements of the amount of carbon or mass that is generated per unit area per unit time. While computed data may be held in the same regard as raw data, choices or errors in formulae or algorithms may diminish or invalidate the data created. Raw data and information on how computed data were derived (provenance) are important for reproducibility. The metadata should provide this information. It is expected that
computed data will grow as the virtual data pool expands. 2. Simulation data are generated by combining mathematical or computational models with raw data. Often models seek to make predictions of processes, such as the future distribution of cane toads in Australia under various climatic projections. The proximity of predictions to subsequent observations is used to test the concepts on which the model is based and to improve the model and our associated understanding of Biology. Metadata differ dramatically from other data types in that date of the run, initial conditions of the model, resolution of the model output, time step, etc. are important. Rerunning the model may require preservation of initial conditions, model software, and even the operating system (URL 13). Simulation data become less useful as they age and can become a storage burden. Data Conservancy: Data Issues in the Life Sciences (March 2011) 13
These categories of data can be used as part of the framework for managing Life Sciences data within the Data Conservancy.
Early adopters
Early majority
Late majority
3. Data readiness
Innovators Laggards
Readying data to be contributed to a shared pool often involves a series of steps or stages that relate to the capture, digitization, access, discoverability, structure, and mobility of data. The situation with molecular data achieved by the International Nucleotide Sequence Database Collaboration comprising the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and the NCBI GenBank in the USA is exemplary. Molecular data tend to be born digital, and are submitted in standard formats to centralized repositories in which they are freely available for reuse in a standard form. Yet, set in the context of the Rogers' adoption curve (Fig. 5; Rogers 1983), and as suggested by Figure 6, Life Sciences, generally, are closer to the 'early adopters' stage of transition to open access than other sciences. It is still unusual for data created by individuals or small groups to be made ready and openly available for sharing (Davis 2009).
0 50 100 150 200 250 Physics Sociology Psychology Law Management Education Business Health Science Political Science Economics Biology
Figure 5: Rogerss adoption curve describes the acceptance of a new technology. Life Sciences is still in the 'Early Adopters' phase for accepting principles of data readiness.
The Long Tail in Biology

Quantities
Figure 6: Relative percentage improvement in citation associated with article being published as open access. The increase is less for biology than other disciplines. Redrawn from Harnad (2010, URL 24)
Observational and processed data have characteristics that vary widely in terms of quantity, digital status, and openness that make up overall availability (Heidorn 2008, Key Perspectives Ltd. 2010). The distribution of data-packages can be represented as a hollow curve (Fig. 7). To the left are a small but growing number of agents producing data in large packages at a high rate. That end of the spectrum includes high throughput biology such as remote monitoring programs or molecular analyses of natural communities (Sogin et al. 2006). Data are often collected via instruments such as sequencing machines which, as indicated above, can produce terabytes of data in a matter of hours. A major challenge at this extreme end of the spectrum is that amounts of data produced often exceed the ability of the hardware to Data Conservancy: Data Issues in the Life Sciences (March 2011) 14
serve data to remote clients (Kahn 2011) and the ability of software to manipulate it (Doctorow 2008). As a result centralized analysis of large, unprocessed federated data files has to be relinquished in favor of distributed analysis of smaller,
Small number of providers with lots of data.
processed data files. To the right of the hollow curve is the long tail of biology, which reflects the many providers with small amounts of data (<1 GB; Heidorn 2008, Science staff
Figure 7: The long tail of biology. Data quantity, digitization and openness can be described using a hyperbolic (hollow) curve with a small number of providers providing large quantities of data, and a large number of individuals providing small quantities of data.
Large number of providers with small amounts of data.
editorial 2011, Reichman et al. 2011). This 'small science' end of the spectrum includes much of the 'first-person' data, generated by scientists who often do not retain or share data in structured, electronic formats, and also includes large numbers of citizen scientists. For some of the agents, the highest order of data structure is the
spreadsheet, and in some cases data may not even be in digital form. As a result, readiness for a virtual data pool is poor. The value of the data may be high because the data relate to unrepeatable observations and are hard to replace if they are replaceable at all (Heidorn 2008). It is thought that the bulk of intellectual output lies in the long tail (Heidorn 2008). Less than 10% of the long tail has been organized into repositories (Reichman et al. 2011, Science staff editorial 2011). Digitization Those areas of biology that are instrument-based generate data in a digital form (i.e. data are born digital) and researchers often operate within an infrastructure that favors data sharing and data reuse. While data may be digital and structured, they may be saved in nonstandard formats, some of which have a proprietary nature. Nonetheless, having data in a digital form is a prelude to mobilizing, normalizing, standardizing, and sharing data. Digital technologies increasingly penetrate into areas of first-person data with various computer-based recording devices or web-based environments, such as ScratchPads (URL 14), LifeDesks (URL 15) and the Biofinity Project (URL 16), that transform data into standardscompliant, sharable formats. Mobile phone technologies offer the means of recording biodiversity data that are both automatically date/time-stamped and georeferenced such as WildObs (URL 17) BlueBill (URL 18) CyberTracker (URL 19) and EpiCollect (Aanensen et al. 2009).
15
Much biological legacy material has yet to be digitized. Many laboratory records, especially those of scientists at the ends of their professional careers, are in formats such as notebooks, punch-cards, or 5" discs that are hard to transform. An estimated 80% (2.5 billion) of museum and herbarium
specimens await cataloguing and digitizing (Vollmar et al. 2010, Schofield et al. 2010). The massive shortfall has led to the view that digitization is a key bottleneck to mobilizing biodiversity data (Vollmar et al. 2010). The primary impediment is lack of funding (Beaman et al. 2007, Vollmar et al. 2010, Science staff editorial 2011). Given that the typical cost of metadata capture from non-standard sources such as labels, card indices and ledgers, is about 10 minutes per item, the global cost to digitize all available data can be estimated as thousands of staff years (Heidorn & Wei 2008). That has prompted digitization agendas, such as the GBIF Global Strategy and Action Plan for digitization of natural history collections, the US National Science Foundation Advancing Digitization of Biological Collections (ADBC) program (URL 20), and the Belgian Biodiversity Platform Digitization program (URL 21). The Biodiversity Heritage Library, a growing international consortium of libraries, aims to make upwards of 500 million pages of text about natural history available, but at this time, only about 10% has been digitized. Digitization agendas need to address data quality control (e.g. URL 22) especially with digital libraries as Optical Character Recognition software performs poorly with older documents. Openness A desire for open access to scientific data is highlighted in the Berlin Declaration (2003). The open philosophy underpins some supportive tools, infrastructure (OAI-PMH, URL 23) and the cyberinfrastructure agenda (National Science Foundation 2003, 2006, European Science Foundation 2006). To be open, data must be Internet-discoverable and accessible. We are not aware of any study on data openness in the Life Sciences. It is however evident that data on the left side of the hollow curve (Fig. 6) are often well indexed and open to use. Larger repositories, such as molecular repositories and ocean data repositories (e.g. WOCE, PANGEA, OBIS, Data Conservancy, GenBank, EMBL and DDBJ). typically adopt open access principles. Larger initiatives may provide an elaborate infrastructure with associated informatics tools and support personnel who promote data discoverability, standardization, sharing, and reuse. In addition to these, there may be thousands of molecular databases (Key Perspectives Ltd. 2010). In contrast, data in the long tail are often not easy to discover and may not be openly available. Pressure to decrease data openness comes from those areas of the Life Sciences with relevance to human well-being and economy; from publishers who regard access to data as a revenue line, from individuals seeking to promote their careers (Will you be scooped or will you be famous? is a quote that sums up a widespread attitude, Gleditsch et al. 2003), and institutes which promote an argument that they
16
own the data of the staff employed by them. Scientists who make data open have greater impact (Harnad 2008), although the Life Sciences is one of the discipline areas least reactive to open data (Fig. 6, Norris et al. 2008).
The agents
The term 'agent' refers to individuals, groups or organizations that help shape data cultures. Scientists. As major producers and consumers of Life Sciences data, scientists are important participants in Big New Biology. Within the US there are almost 100,000 biologists (excluding agriculture and health sciences) working outside of academia (URL 25). An estimate for the number within academia can be calculated from data on the approximately 2,500 colleges and universities (URL 26) that employ almost 300,000 academics in science and engineering, 40% of whom work in the Life Sciences (URL 27). US research and development endeavors account for approximately one-third of the global effort (URL 28, URL 29) such that changing data practices will directly or indirectly affect as many as 200,000 life scientists in the US and about half a million professionals worldwide. As personal computers and Internet access have become essential for biological research (Stein 2008) scientists views and practices of data sharing have changed. Biologists are increasingly publishing data through repositories like GenBank (URL 30), their own web sites, or are participating in collaborative environments such as those that allow data to be annotated (e.g. EcoliWiki, URL 31 or DNA Subway for genome annotation, URL 32), or to capture field data using digital data recording services such as provided by Artportalen (URL 33) or eBird (URL 11). An increasing number of databases are providing web services for moving data and new tools for visualizing data (e.g. GeoPhyloBuilder, URL 34, Kidd & Liu 2008). Data processing and management pipelines such as Kepler and VisTrails (URL 110, 122) are emerging. Yet, for these changes to dominate across the breadth of the discipline and influence the full life cycle of the data, researchers must feel comfortable with design and performance of software systems (Stein 2008). There must be good dialog between the biologists and computer programmers for new tools to emerge or for existing tools to be used effectively (Lee et al. 2006). Increasingly, biologists will need to be trained in computer and information science (Stein 2008) and include archiving machine-readable data as part of their normal workflow (Whitlock 2011). Computer scientists need to develop sensitivity to biology and biological thinking if they are to provide tools that delight life scientists. Scientists, especially those associated with 'small science', will need to be more engaged in mobilizing data than at present (Froese et al. 2003, Heidorn 2008, Costello 2009, Smith 2009). When asked, many scientists do share specific data sets with close colleagues (Science staff editorial 2011), yet
17
seem to be insufficiently incentivized to mobilize their data openly despite the fact that papers with openly available data gain more citations (Piwowar et al. 2007). Often biologists perceive risks of making data available to outweigh the rewards (Porter and Callahan 1994, Key Perspectives Ltd. 2010). . Publishers Publishers of scientific journals are increasingly involved in data management (Whitlock et al. 2011). Publishers may provide the same services for data that they provide for manuscripts (i.e. peer review, citability, etc. Vision 2010). This model will incur charges for access to data, which many scientists feel is inappropriate (Key Perspectives Ltd. 2010). Another model would be to charge for access to metadata. Thomson Reuters BIOSIS indexes more than half a million Life Sciences abstracts yearly (URL 35), compiling metadata such as organism names and Enzyme Commission numbers. That metadata can be used to discover data sets, and Thomson Reuters charges for access to the discovery services. Some journals require deposition of data as a condition of publication. An example of this is the joint data archiving policy (JDAP, URL 39). JDAP has grown from its original consortium of evolution and ecology journals to include more than a dozen journals (Vision 2010). Dryad (URL 40; White et al. 2008), GenBank (URL 30; Bilofsky & Christian 1988), Protein Data Bank (URL 41; Berman et al. 2006) and TAIR (URL 42; Rhee et al. 2003) are examples of repositories that benefit from deposition requirements from publishers. Until relatively recently, publishers have controlled the dissemination of the narrative. Some publishers retain control over their articles while others have moved to an open-access model. An example of open access in use for Life Sciences journals is the Pensoft model, which allows for XML submission and workflows (URL 36, URL 37) in publications such as ZooKeys, PhytoKeys, and PLOS Biology (URL 123). Some publishers are experimenting with enhanced publication to allow researchers to share datasets, illustrations and audio files. Funding Agencies Funding agencies worldwide have been called upon to financially support informatics research and the creation of tools and digital libraries that will underpin the shift towards a Big New Biology paradigm (Hey et al. 2009, National Academy of Sciences 2009). Data cost money and the reuse of data represents a higher return for each research dollar spent. In recognition of the importance of data sharing to their investment, funding agencies are increasingly imposing data-sharing requirements on their researchers (Table 2). However, many funding agencies, especially outside the US and Europe, do not have data policies or plans to make data available. Of those that do, many require scientists to submit data management plans as a part of their proposals. The plans (discussed below) are designed to explain Data Conservancy: Data Issues in the Life Sciences (March 2011) 18
where data will be deposited, under what terms data may be accessed, and what standards will be used. Many agencies believe in open access to data at the end of a project and have specific timelines for data release. They often acknowledge that the data provider should have a period of exclusive right of first use of data. Governments The realization of a Big New Biology will require significant investment in and reorganization of technical and human infrastructure, the creation of new agencies, new policies and implementation frameworks, as well as national and transnational coordination the scale of which will require governmental and intergovernmental participation. Issues that require high-level attention are illustrated by the OECD report that established GBIF (OECD 1999). GBIF now has 56 national participants and influences national agendas. Especially relevant is the commitment to data sharing with its Suwon declaration (URL 43). This underscores the importance of data sharing to science, conservation and sustainability. INSDC, which collates the sharing of molecular data via the US-based NCBI Genbank, the European EMBL, and the Japanese DDBJ, offers another example of international informatics initiatives in the Life Sciences (URL 44). It is at this level that it is appropriate to address the economics of an informatics infrastructure. In establishing GBIF, OECD compared the cost of the molecular informatics infrastructure (millions of dollars) against the benefits to pharmaceutical, health and agricultural businesses worth billions of dollars (OECD, 1999). The costs of international cooperation on biodiversity informatics must be set against the assessed economic value of the world's natural capital, which is in the 10s of trillions (millions of millions) of dollars (Costanza et al. 1997, TEEB 2010). The OECD also estimates costs of sustaining infrastructure will be 25% of the costs of generating raw data. Yet, an allocation of as little as 5% of research funding could provide billions of dollars for data preservation (Schofield et al. 2010). Several countries have established governmental digital data environments inclusive of the data.gov environments (URL 45, URL 46, URL 47), or more specialist agencies such as Conabio in Mexico (URL 48), ABRS, ERIN and ALA in Australia (URL 49, URL 50, URL 51), ITIS in US (URL 52) or the European Environment Agency (URL 53). Universities With in excess of 20,000 universities (and institutions modeled on Universities) worldwide (Webometrics Ranking of World Universities; URL 54), employing an estimated 5-10 million academics and associated researchers, universities are arguably the largest research and development initiative worldwide. Collectively they are a major source of new data and given their international communal character, will be major customers of the data pool. The support, infrastructure and services that 19
Universities provide will be a major determinant of the flow and fate of data.
There are signs that in
some environments, Universities are federating data activities. An example is provided by the SURF foundation (URL 38) that seeks to unite Dutch research institutes through the application of new technologies. It currently emphasizes 5 disciplines; Life Sciences are not included. Universities may or may not regard themselves as owners (having IP rights) of data and so may regulate access to data generated in-house or as part of collaborative projects. They may or may not have policies that require the retention of data. US universities are bound to comply with the requirements of OMB Circular A-110 (Uniform Administrative Requirements for grants and agreements with Institutions of Higher Education, Hospitals, and Other Non-Profit Organizations URL 55). This specifies that financial records, supporting documents, statistics, and all other records produced in connection with a financial award, including laboratory data and primary data are to be retained by the institution for a specified period. OMB A-110 also states The Federal awarding agency(ies) reserve a royalty-free, nonexclusive and irrevocable right to reproduce, publish, or otherwise use the work for Federal purposes, and to authorize others to do so." Many leading universities have explicit institute-wide data policies that focus on administrative data and administrative agenda rather than on promoting the use of data for academic purposes (e.g. "(This) University must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, primacy and compliance with laws and regulations governing the conduct of the research" URL 56). Policies typically refer to a finite obligation, usually in the range of 3 to 7 years. Universities will play a significant role in educating staff and students as to the value of data. They will be the focus of reshaping the skill base on which the Big New Biology will rely (Doom et al. 2002). New trans-discipline curricula will ensure that biologists gain informatics skills and that computer scientists develop sensitivity to the challenges and needs in Biology. Museums and Herbaria Museums and herbaria play a special role within the Life Sciences. Along with libraries, they have a mandate for the long-term preservation of materials. Those materials include several billion specimens of plants, animals and fossils collected by biologists over 3 centuries (Chapman 2005a, OECD 1999, Vollmar et al. 2010). Those collections provide invaluable information as to changing distributions of species, provide access to extinct species, and inform research into defining species. They have special value in some phenomena that motivate the agenda for Big New Biology, such as distribution of invasive species, consequences of deforestation, and so on. Chapman (2005a) provides an exhaustive treatment of potential and actual value of primary biodiversity records.
20
Data Management Plan Deposit Standards Compliant Attribution Local Archive Open Source QA/QC
Confidentiality IPR/Licensing
Metadata Deposit
Provides Data for Free
Free Access to Publication
Funding Agency US X X
Country
Notes
Gordon and Betty Moore Foundation
Policy http://moore.org/docs/GBMF_Data%20Sharing%20Philosop hy%20and%20Plan.pdf X
Genome Canada X X X X X X
Canada
www.genomecanada.ca/medias/PDF/EN/DataReleaseand ResourceSharingPolicy.pdf X
National Institutes of Health
US
http://grants.nih.gov/grants/policy/data_sharing/
Biotechnology and Biological Sciences Research Council UK X X X UK UK
www.bbsrc.ac.uk/publications/policy/data_sharing_policy. html X X X
X X
X X
Natural Environment Research Council
Data must be made available no later than the publication date or the date the patent has been filed (which ever comes first) at the end of the project Applies to projects requesting > $500,000, data must be released no later than the acceptance of publication of the main findings from the final data set data release no later than publication or within 3 years of generation, Researchers are expected to ensure data availability for 10 years after completion of project Data must be made available within 2 years from the end of data collection

www.nerc.ac.uk/research/sites/data/policy.asp X www.welcome.ac.uk/About-us/Policy/Policy-and-positionstatements/WTX035043.htm X US China Australia US Austria US US India http://genomicsgtl.energy.gov/datasharing http://english.cas.cn/ http://www.arc.gov.au/default.htm http://www.nsf.gov/bfa/dias/policy/dmp.jsp X X X X X X X X X X X X X X X Data can be embargoed for 2 years http://www.fwf.ac.at/en/public_relations/oai/index.html http://science.nasa.gov/earth-science/earth-sciencedata/data-information-policy/ http://www.ncdc.noaa.gov/oa/about/open-access-climatedata-policy.pdf http://rdpp.csir.res.in/csir_acsir/Home.aspx US http://www.nprb.org/projects/metadata.html Japan http://www.jst.go.jp/EN/index.html South Africa http://www.nrf.ac.za/ X X
Table 2: List of funding agencies and characteristics of their data policies
Wellcome Trust
Department of Energy Chinese Academy of Sciences Australian Research Council National Science Foundation
Requires deposit of 1) protocols 2) raw data 3) other relevant materials no later than 3 months after publication Requires deposit or no further funding No policy Data must be available no more than 2 years after end of project
Austrian Science Fund
NASA
NOAA Council for Scientific and Industrial Research
North Pacific Research Board Japan Science and Technology Agency National Research Foundation
Plan being developed in 2010 Data must be transferred to NPRB by the end of the project None None
21
Citizen Scientists Citizen scientists are non-professionals who participate in scientific activities. Citizen science covers many subjects, but the appealing richness of nature, its accessibility, and our reliance on natural resources ensures that biology attracts an especially high participation by the citizenry (Silvertown 2009). The academic skills of citizen scientists cover a massive spectrum, from those with casual interests in nature or science to individuals who publish in the scientific literature. There are tens of millions of birders in the US (Kerlinger 1993), a number that translates to more than 100 million worldwide. The number of recreational fishermen in marine waters approaches that of birdwatchers (Arlinghaus & Cooke, 2009, Cisneros-Montemayor & Sumaila 2010), and an estimated 500 million people have livelihoods attached to fishing (URL 57). That suggests that the potential citizen scientist community exceeds 1 billion people. This remarkable pool can be called upon to add the sightings (occurrence of a given species at a particular location at a particular time) which can be used to monitor the changing distributions and abundances of endemic and invasive species. The Swedish ArtPortalen (URL 33) has in 10 years compiled more than 26 million sightings at a rate of about 10,000 per day, illustrating the irreplaceable role of the citizen scientist. Several mobile phone apps exist that allow natualists to record species occurrences in the field (BirdsEye from eBird, URL 58 and Observer from WildObs, URL 59). Data on occurrences, or of the first occurrences of flowering or appearance of migratory species, can be called on to test scientific hypotheses as to the impact of climate change on the biosphere. Citizen scientists are significant monitors of endangered species providing the first evidence that some presumed-extinct species, such as the coelocanth (URL 60), wollemi pine (URL 61), ivorybilled woodpecker (URL 62), Lord Howe Island stick insect (URL 63) and mountain pygmy possum (URL 64) are still with us. Repositories A repository provides services offered to a community for management and dissemination of data including, ideally, protection of the integrity of the data, long term preservation, access and migration to new technologies (Lynch 2003). Most repositories typically handle a specific data type at a particular granularity. Thousands of repositories already exist for managing Life Sciences data and hold tens of millions of items (Table 1; see Jones et al. 2006 and URL 65 for more). However, it is estimated that less than 1% of ecology data is captured in this way (Reichman et al. 2011). Repositories range in
functionality from basic databases that store data to collaborative databases that incorporate analysis functions (WRAM, Wireless Remote Animal Monitoring, URL 66). The pathways in and out can determine whether or not a repository is populated and whether data within the repository are reused (Wren & Bateman 2008).
22
Many repositories are difficult to access or are not maintained (Wren & Bateman 2008). Failure of a repository can result from policy shifts, funding instability, management issues, or technical failures (Lynch 2003). Such failures can undermine acceptance of digital scholarly work by the community at large. As data repositories become more important over time, they must be trusted to provide high quality services reliably (Schofield et al. 2010, Klump 2011). The trustworthiness of archives can be assessed using criteria catalogues (Klump 2011) available from organizations like the Digital Curation Center (Innocenti et al. 2007) and the International Standards Organization (ISO 2000). The Center for Research Libraries has assembled a list of ten principles for data repositories that addresses administrative and technical concerns (URL 67).
Data Policies
A data policy is a list of philosophical statements and procedures that describe the beliefs and regulations of an agent concerning the production, sharing and reuse of data. Some sources for policies can be found in Table 2 and at Biosharing.org (URL 68). Discipline-based efforts to accommodate data-sharing problems and funder-mandated protocols have led to a piecemeal array of policies (Field et al. 2009). Researchers may feel that their ability to comply is limited by inadequate funding, time, software, hardware, expertise or personnel. Many datasharing policies are ignored or even resisted (Savage & Vickers 2009). An impediment to sharing data across the Life Sciences is the absence of an over-arching framework for data-sharing. Generalized policies that can be applied widely within the Life Sciences can be used to promote desirable trends of community participation, infrastructures, tools and repositories that achieve consistency and distribute costs. The development of such policies is as much of a challenge as the technical aspects of data management. General policies need to be extendable to suit the needs of subdisciplines or agents involved. Field et al. (2009) recommend the following steps in developing data policy. 1. Identify science driver(s) necessitating a formal data policy for a particular community 2. Create a working body to bring the data policy to fruition 3. Conduct an initial poll of researcher and funder priorities with respect to data policy development 4. Identify the full range of stakeholders 5. Research current policies and draw from them and the literature 6. Draft a straw man document and define key aspects of the policy a. Scope of policy b. Applicability
23
c. Funding levels required 7. Subject the straw man to internal and the external rounds of consultation followed by iterative improvement 8. Obtain formal sign-off or endorsement by the organization of a final draft and post final draft onto appropriate public website and publicize 9. Set into motion support for policy 10. Monitor compliance and enforce policy 11. Extend policy to cover subareas of science/data as required 12. Evolve or deprecate policy as required The following list of key issues (Step 6 above) was developed during a Data Conservancy (Life Sciences) workshop held in the summer of 2010 that was attended by computer, life and information scientists from academia, government and the private sector. The issues are: 1. Scientists have the right to first use of data they produce 2. Data providers should have a choice of licenses 3. Repositories must provide access to data they hold 4. Scientists must receive attribution when data they produce is used 5. Open formats should be used 6. Existing standards should be used where available 7. Tools and formats used should be free and open-source 8. Data should be collected with interoperability in mind 9. Some data are valuable and should be preserved over time The above can be embellished to create more granular data policies that meet the needs of subdomains. Once the scope and purpose is agreed, the key issues can be embedded in a Data Policy template (Box 1) and distributed to interested parties. As data policies become effective they help to define the data cultures.
24
1 Overview Title of plan; Author; Date; Revision; etc. Project name; Award information; Funding Agencies; etc; reference to main proposal 2 Expected Data 2.1 Data What data get created by the project, and in what form? What raw data are generated, what processed data are generated? What data are expected to be managed by the project for sharing and later archiving? Who is expected to use the (shared) data? 2.2 Data Formats What data formats will be used for data generated? What tools will be required to read the data? 2.3 Data Generation Acquisition How are the data generated and how are they acquisitioned? What quality control/standards are applied to data generation, acquisition and storage? When are data generated? What is the frequency and rate of data generated? 2.4 Software What software does the project create? What will be managed and what wont be managed? Will software be archived? Will software be made available for sharing? Will there be any licensing, if so what? 2.5 Documentation and Metadata What data and metadata standards will be employed? How will metadata be generated (automatically or manually, or both)? How will metadata be stored and managed? How will unique identifiers be managed? What naming schemes will be used? (Data Dictionaries/Taxonomies/Ontologies?) 3 Data Storage and Preservation 3.1 Storage and Backup During the Project Who is responsible for the stored data? Who is responsible for data backups? What digital and non-digital data will be stored? Where will the data be stored and backed up, what policies will be in place? What will be the access controls on data stored? What are the backup procedures for data generated? 3.2 Data Capacity / Volume Volumes of data and rates of creation and ingestion? 3.3 Security Are there any data with specific security issues? How will security be enforced in the system? 3.4 Operation Storage Post-Project Completion How will data be stored after the project has been completed? What mechanisms, policies, agreements, etc. will be used to manage data after the project has been completed? 3.5 Long Term Archiving and Preservation What data will be archived? Where will data be archived? Who will manage and administer the archive? What metadata will be required? Data Conservancy: Data Issues in the Life Sciences (March 2011) 25
What will be the access controls? What will be the retention and disposition policies? 3.6 Roles and Responsibilities Who makes decisions regarding the overall data management (e.g., PI)? Who makes decisions regarding day-to-day data management (e.g., PI)? What is the role and responsibility of the organization that preserves the data? 4 Data Retention How long will each type of data be kept? And why? When will data be made available for sharing? Are there any data embargoes, if so what? When will the data be made public? What is the archival lifecycle and retention policy for archived data? 4.1 Operational Data Who will be responsible for the data in the near-term following project completion? 4.2 Archival Data Who will be responsible for the data for long-term archiving (beyond the most active use of the data)? 5 Data-Sharing and Dissemination What data will be shared? When will data be shared? What restrictions are there on subsequent data use? How will the data be made available? What metadata will be generated to ensure the data are accessible? 5.1 Stakeholders Who will data be made available to? What data will be made available to what stakeholders? 5.2 Privacy and Confidentiality Are there any data with privacy issues? Are there any data relating to human subjects, and what policies need to be adhered to? How will any such privacy requirements be enforced? 5.3 Ownership, Copyright and IP Are any of your data copyrightable (i.e. non-factual in nature)? If so, who holds that copyright (e.g. PI, university, funder)? 5.4 Third Party Data Are any of the data owned by someone else? What are the conditions of use, sharing and dissemination? 5.5 Legal and Regulatory Describe any other legal and/or regulatory constraints on sharing and dissemination of data. 5.6 Re-Use What is the policy on re-use of the data, citations, and production of derivatives? 5.7 Ethical Requirements Does this work involve human subjects, and if so what policies and procedures must be adhered to? What other ethical requirements are in place for the data generated?
Box 1. Elements of a data management plan based on the "Data Conservancy' template (URL 69).
26
TECHNOLOGICAL ISSUES
Visualization Analysis Aggregation Manipulation
The second array of challenges that will need to be addressed as we move towards Big New Biology are the technical issues that affect the accessibility and reuse of data.
Trust Discovery Free & open Published Archived Attributed Registered
Data re-use Access
Processing Annotation Vetting
Making data accessible

The effective reuse of data requires that an array of conditions (Fig. 8) are optimized. Data are digital. Digitization is a prerequisite for data mobility. As noted earlier, considerable amounts of relevant data are not yet in a digital format. Non-digital formats include notes, books, photographs and
Data pool Data flow
Normalized Structured Digital
Data generation
Observations Experiments Models Processed
Figure 8: A Big New Biology can only emerge with a framework that optimizes reuse. Ideally, data should be in forms that can flow from source into a common pool and can flow back out to consumers, be subject to quality control, or be enhanced through analysis to rejoin the pool as processed data.
micrographs, papers, and specimens. Digital metadata about non-digital materials have value as they make the data discoverable and increase incentives for digitization. Data are structured. Digital data may be unstructured (e.g. in the form of free text or an image) or they may be structured into categories that are represented consecutively or periodically through the use of a template, spreadsheet or database. The simple structure of a spreadsheet allows records to be represented as rows. Each record contains data in categories defined by metadata (headers) at the top of each column. Data occur within the cells formed by the intersection of a row and a column. A source may mix both structured and unstructured data such as when fields include free-form text, images, or atomic data. Unstructured data, such as the legacy data to be found in the estimated 500 million pages of text, can be improved through annotation with metadata and this can be achieved by curators or by applying tools (such as natural language processing tools) that discover elements that can be treated as metadata. Data are normalized. Normalization brings information contained within different structures to the same format (or structure). Normalization may be as simple as consistently using one type of unit. Placing data within a template is a common first step to normalization. Normalization is a prerequisite for aggregating data. When data are structured and normalized, they can be mobilized in simple formats (tab delimited or comma delimited text files) or can be transformed into other structures to meet agreed upon
27
standards. DiGIR is an early example of a data transformation tool (URL 70). More contemporary tools, such as TAPIR or IPT from GBIF (URL 71) can output data in an array of normalized forms. Data are standardized. Standardization indicates compliance with a widely accepted mode of normalizing. Standards provide terms that define data and relationships among categories of data. Two basic types of standards that are indispensable for management of biological data are metadata and ontologies. Metadata are terms that define data (data about data) in ways that may serve different purposes, such as helping people to find data of relevance - discovery (Michener 2006) or to bring data together - federation. Metadata standards articulate how data should be named and structured, thus reducing the heterogeneity of terms. Standards mandate the types of metadata that are appropriate for different types of observations. Sets of metadata terms agreed upon by a community are referred to as controlled vocabularies, one of the most extensive bearing on the Life Sciences being the Ecological Metadata Language (EML; Fergraus et al. 2005). By articulating what metadata should be applied and how they should be formatted, standards introduce the consistency that is needed for interoperability and the context for machine reasoning. For example, a marine bacterial RNA sequence collected from the environment ideally might be accompanied by metadata on location (latitude, longitude, depth), environmental parameters, collection metadata (collection event, date of collection, sampling device), and an identifier for the bacterium. Without such metadata, the scope of possible queries is much reduced. Examples of minimum reporting requirements have been established by the MIBBI project (Taylor et al. 2008). Numerous metadata guides are available within Life Sciences (Table 3). There are software programs available to assist in the collection and organization of metadata (such as Morpho, URL 72, Higgins et al. 2002; Metacat, URL 73, Jones et al. 2002; MERMAid, URL 74). An ontology is a formal statement of relationships among concepts (represented by metadata terms) that allows for discovery of data through relationships. Ontologies may use formal descriptive languages to define relationships within systems of metadata. Ontologies are regarded as having great promise (Madin et al. 2007b): "An ontology makes explicit knowledge that is usually diffusely embedded in notebooks, textbooks and journals or just held in academic memories, and therefore represents a formalization of the current state of a field. If ontologies are properly curated over the longer term, they will come to be seen as modern-day (albeit terse) textbooks providing online and up-to-date biological expertise for their area. In another sense, they will provide the common standards needed for producing a strong biological framework for integrating data sets. Ontologies therefore provide the formal basis for an integrative approach to biology that complements the traditional deductive methodology" (Bard & Rhee 2004). Data Conservancy: Data Issues in the Life Sciences (March 2011) 28
Standard
ABCD Bioontology BIRN Cardiac Electrophysiology Ontology CMECS Comparative Data Analysis Darwin Core Dublin Core Ecological Metdata Language Environment Ontology Evolution ontology Experimental Factor Ontology Federal Geographic Data Committee Fungal Anatomy Gene Ontology Homology ontology Hymenoptera Anatomy Ontology HUPO Infectious Disease ontology International Standards Organization Marine Metadata Interoperability Microbiological Common Language Miriam National Biodiversity Information Infrastructure Ontology of Microbial Phenotypes
Location
http://www.bgbm.org/TDWG/CODATA/Schema/default.htm http://www.bioontology.org/ and http://www.birncommunity.org/
Type
Schema Ontology Repository
Ontology http://bioportal.bioontology.org/ontologies/39038 Coastal and marine ecological classification standard http://www.csc.noaa.gov/benthic/cmecs/cmecs_dVocabulary Ontology http://sourceforge.net/apps/mediawiki/cdao/index.php?title=Main_Page http://wiki.tdwg.org/twiki/bin/view/DarwinCore/ http://dublincore.org/ http://knb.ecoinformatics.org/software/eml/ http://www.environmentontology.org/ http://code.google.com/p/evolution-ontology/ http://www.ebi.ac.uk/efo/ http://www.fgdc.gov/ http://www.yeastgenome.org/fungi/fungal_anatomy_ontology/ http://www.geneontology.org/ http://bioportal.bioontology.org/ontologies/42117 http://www.psidev.info/index.php?q=node/159 http://www.infectiousdiseaseontology.org/Home.html http://www.iso.org http://marinemetadata.org/ Verslyppe et al. 2010 http://www.ebi.ac.uk/miriam/main/datatypes/ http://www.nbii.gov/portal/community/Communities/NBII_Home/ Ontology http://sourceforge.net/projects/microphenotypes/ Ontology Repository http://www.obofoundry.org/ http://obofoundry.org/wiki/index.php/PATO:Main_Page http://www.plantontology.org/ http://wiki.tdwg.org/twiki/bin/view/SDD/Version1dot1 http://wiki.tdwg.org/SPM http://lod.taxonconcept.org/ontology/txn.owl http://www.tdwg.org/activities/tnc/tcs-schema-repository/ http://www.bgbm.org/TDWG/acc/Referenc.htm https://www.phenoscape.org/wiki/Teleost_Anatomy_Ontology Ontology Ontology Schema Schema Ontology Schema Standards Body Ontology Vocabulary Metadata Metadata Standards Ontology Ontology Ontology Ontology Vocabulary Ontology Standards Body Metadata Metadata Metadata Ontology Ontology Ontology Standards Body
Open Biological and Biomedical Ontologies Phenotype Quality Ontology Plant Ontology SDD Species Profile Model TaxonConcept Taxonomic Concept Schema TDWG Teleost Anatomy Ontology
Table 3: Examples of standards and their location
Ontologies are part of 'Knowledge Organization Systems'. Those relating to biodiversity have been discussed by Morris (Morris 2010). Ontologies contribute to the semantic annotation of data and the artificial intelligence it enables. As an example, a simple search for information about the bird, robin, seeks to match some or all of character string r-o-b-i-n or to character strings in text within a data object or annotating the data object. The system cannot discriminate among data on American robins, European robins, Robin Reliant cars, Robin Wright Penn, or Robin the boy-superhero. However, if the query for 'robin' can be placed in the context of an ontology, such as one that declares that the context is that the robin in question is a member of the turdidae, an informed computer could use this to return only relevant results. In addition to more precise searching, ontological structures allow the computer to perform Data Conservancy: Data Issues in the Life Sciences (March 2011) 29
inference, a form of artificial intelligence. For example, an ontology that establishes that turdidae is_a bird and wing is part_of a bird, allows the inference that an American robin has wings and that data may be discoverable. Larger interconnected ontologies allow the assembly of more complex inferences. Many ontological structures are available for use in Life Sciences (Table 3). Some, such as the observational (URL 75, URL 76, URL 77) and taxonomic ontologies (below), have broad applicability the first within the field of ecoinformatics and the second to biodiversity informatics. Users can adopt existing structures or create their own using an ontology editor such as Protg (URL 78) or OBOEdit (URL 79). The search engines, Swoogle (URL 80) and Sindice (URL 81), search over 10,000 ontologies and can return a list of those that contain a term of interest. Services such as these help users to determine if an existing ontology will meet his/her needs. Often, a user may need to use parts of existing ontologies or merge several ontologies into a single new one. Defining relationships between terms in different ontologies can be accomplished through the use of automated alignment tools such as SAMBO and KitAMO (Lambrix & Tan 2008). The development and integration of ontologies is best carried out using formal languages (such as OWL, URL 82) and by individuals versed in its logical foundations. Standards in the biodiversity sciences are well served by the Biodiversity Information Standards (TDWG) organization, initiated in 1985 (URL 83). TDWG has been a prime mover in developing organizational frameworks for biodiversity information. GBIF has also been a source of standards innovation and development. Their intent is to provide a common framework for federating data and therefore data reuse. Unfortunately, there may be competing systems of standards and not all aspects of biology have established standards. Various efforts are under way to create broad scope ontologies (URL 84, URL 85, URL 86). The promise of ontologies is as yet not fully realized as "The semantic web is littered with ontologies lacking ... data" (Joel Sachs, pers. comm.). Next generation tools and interfaces hopefully will be better fitted for use by general biologists. The most extensive system of potential metadata for the Life Sciences is the latinized binomial names (such as Homo sapiens) introduced for species in the mid-18th century by Linnaeus. They have been used since then to annotate virtually every statement about any of our current catalog of 2.2 million living and extinct forms of life. Inevitably, they will be replaced by molecular identifiers, but at this time they are well suited to form the basis of a names-based cyberinfrastructure for Biology (Patterson et al. 2008, 2010). This approach has been used for life-wide data-organization projects such as the Encyclopedia of Life (URL 87). Placement of names within hierarchical classifications offers ontological frameworks to organize the names. The conversion of names into a formal ontology has been explored through projects such as ETHAN (URL 88). Our current understanding of biodiversity and the system of names is maintained by a specialist group of 5,000-10,000 professional taxonomists worldwide (Hopkins & Freckleton 2002), who generally are unaware of the informatics potential of names as a near universal Data Conservancy: Data Issues in the Life Sciences (March 2011) 30
indexing system for biological data. The Global Names Architecture is a new global initiative that links names databases and associated services to deliver names-based services to end users (Patterson et al. 2010). Data are atomized. Atomization refers to the reduction of data to minimal semantic units. In such a form, data may exist as numerical values of variables (e.g. length of tail: 5.3 cm) binary statements (e.g. chloroplasts: absent), or association with metadata terms from agreed upon vocabularies (e.g. "has lodicules of lower floret of pedicellate spikelet of tassel" Zea mays ontology ID ZEA:0015118, URL 89). Atomized data on the same subject can be brought together if the data are classified in a standard way. Atomization is necessary for most types of analysis of data from one or more datasets. Atomized data stand in contrast to complex data such as images or large bodies of text. Data centers can foster atomization by providing services that transform data sets. Many older data centers capture data as files (or packages of files). The responsibility for extraction of data atoms falls to the user. This can be time consuming if there is no universal format for files, suggesting that, in the future, atomization needs to occur at or near the source of raw data, becoming part of the responsibilities of the author of the data, the software in which data are logged, or data centers. Data are published. Projects participating in a Big New Biology will increasingly make data visible and accessible (i.e. published). Scientists may publish data by their display in unstructured or structured formats on local, project, or institutional web sites. The scientists may take no responsibility for shifting the data to a central repository. In science generally, over three-quarters of the published data are in local repositories (Science staff editorial 2011). Local archives can provide few guarantees of persistence (see 'Data are Archived' below) and in such environments, the responsibilities for discovery of data, negotiations with copyright holders and acquisition of data with the consumer. This is timeconsuming and unlikely to be done on a large scale. Publication is better served through the use of central, domain-specific repositories because they are more likely to persist, provide better services, and offer the framework around which third-parties develop value-adding services. The molecular data environment consortium of ISNDC is a good example of this model. Only a small fraction of data are deposited in such environments (less than 10% of the science community generally -Science staff editorial 2011), with costs and absence of an organizational framework (metadata and archiving environments) being cited as reasons. There are repositories for heterogeneous datasets (such as oceanographic data bases URL 90, URL 91, URL 92), but increasingly it will be more rewarding to publish via repositories that provide the services that will facilitate reuse. Such services will include data standardization, quality control, and atomization. Given the desire to intercept the data life cycle as close to source as possible, repositories or their agents can develop data capture tools that ideally provided with services (APIs) to export data to the central repositories. Data Conservancy: Data Issues in the Life Sciences (March 2011) 31
Publication of atomized data is essential for large scale data reuse. Data must be able to move from one computer to another in an intelligent way. Scientific initiatives can add RSS feeds, web services, and APIs (Application Programming Interfaces) to their web sites to broadcast new data and to respond to requests for data. An API facilitates interaction between computers in the same way that a user interface facilitates interactions between humans and computers. These additions incur overhead and are probably best served through community repositories. Without such services, data may need to be 'screen scraped' from the web site, a process that is usually costly (because the solution for each site will differ) and, at worst, may require manual copying. Data are archived. It is preferable that data, once published, are persistent. Projects, initiatives and host institutions have little incentive to preserve data for the long term as the process incurs a cost, and repositories that emerge within projects may have limited life spans (e.g. OBIS, URL 93). Central repositories that are not dependent on short-term funding are better positioned to archive data making them persistent. The three global molecular databases that make up the International Nucleotide Sequence Database Collaboration provide an excellent example of how domain-specific repositories may operate. Because they are not funded through short-term projects, and because they mirror each other, such repositories guarantee the persistence of data, and empower scientists to develop projects that involve substantial analyses of shared data (Tittensor et al. 2010). Persistence can be assisted by components (libraries and museums) that specialize in the preservation of artifacts or by governmental intervention (the US-based National Institutes of Health support GenBank). An alternative solution to persistence is an effective business model that allows a data-center to be sustained by income from services that it sells; or by providing essential services that ensure support from the community of users. Examples of
commercial models include the Chemical Abstracts Service of the American Chemical Society (URL 94) or Thomson Reuters' Zoological Record (URL 95). Data are free and open. Open Access, the principle of providing unconstrained access to information on the web, improves the uptake, usage, application and impact of research output (Harnad 2008). Open Access has been applied widely to the process of publication, where it is seen as an alternative to the model in which publishers act as gatekeepers. Open Access has been applied less to data, and while this extension is natural, it is not straightforward (Vision 2010). Attitudes about sharing data freely within Life Sciences vary broadly. In sub-disciplines like genomics, data sharing is the norm with some researchers sharing their data immediately via blogs or wikis. Communities that value data sharing may have no formal recognition for such activities nor supportive technical infrastructure. Other communities have a strong sense of data ownership and are antagonistic to open data sharing. Researchers in these communities expect to be directly involved in any further analysis of their data.
32
Databanks for these communities often require registration and/or a fee to gain access. Some data may be regarded as too sensitive to be made fully accessible (Key Perspectives Ltd. 2010). Web-accessible Life Sciences data are acquired through four main routes (Key Perspectives Ltd. 2010): 1. Website of journal in which the data are published. This is typically in pdf form, which is not ideal for reuse. 2. Website of individual researcher or group. The quality is often good, but often do not comply with standards; data at the sites can be hard to find and navigate. 3. Web-based databases maintained by individuals or groups. These are often funded on a short term basis. Typically they contain data from a project or from colleagues and collaborators. This is probably the most abundant type of database. Coverage is far from comprehensive, and data may not comply with standards. 4. Public databanks with the molecular databases being the best examples but there are also major repositories for records in the context of geospatial metadata (URL 12, 94). Data are trusted. Once data are accessed, consumers may reveal errors and/or omissions. Biological data can be very 'dirty', especially if they were acquired without expectation that they would be shared later. Any data cleaning procedures should be documented to aid the consumer in assessing whether the source is 'suitable for their purpose' (Chapman 2005b). The creation of 'quality loops' allow comments to flow back to source where data can be annotated or modified, and returned to users for renewed vetting. Webhooks (URL 96) offer a mechanism to exploit APIs to have comments returned to source. Any editing of data can lead to the undesirable outcome that variant forms of the same data may co-exist. To some extent, 'versioning' of data sets can be used to discriminate between modified datasets. Users can cite the version they called upon for their analyses. Data are attributed. Scientists gain credit in part through attribution. The permanent
association of identifiers with data offers a means of linking attribution to the data and of tracking reuse. The association of authors' names with data motivates contributions (or lack of credit demotivates them). Attribution favors the development of quality loops to correct errors or otherwise comment on the data. Special care is needed when attributing data resulting from the combination of one or more existing sets so that all intellectual investment is properly credited. Dryad, a JDAP partner, provides data citations through the use of DataCite DOIs with an unrestrictive Creative Commons Zero license, thus promoting clear citation and reuse of data (Vision 2010). Data can be manipulated. A value of having large amounts of data available on the web is that it allows users to explore, in addition to search for, data. Data exploration can be used to vet datasets, check a hunch (hypothesis), or simply indulge basic curiosity. A desirable component of dataData Conservancy: Data Issues in the Life Sciences (March 2011) 33
environment pools are tools that draw data together, analyze or visualize them. Off-the-shelf software packages such as Microsoft Excel, which are easy to use, are unlikely to meet all of the challenges of research. Exploratory systems that are flexible include: Humbolt (Kobilarov & Dickinson 2008) which operates like a faceted filter for Linked Data; Parallax which accesses data in Freebase and has the ability to interact with data on multiple web pages at once (Huynh & Karger 2008); and Microsoft Pivot (URL 97) allows a user to interact with large amounts of data from multiple Internet sources. Visualizations have the capacity to reveal patterns, discontinuities and exceptions that can inform us as to underlying biological processes, appropriateness of data sets or consistency of experimental protocols. Visualizations can be used to display results with analyses of large data sets. Many Life Sciences data sets can be drawn together and visualized using the geospatial element such as in LifeMapper (URL 98). Through visualizations we may help address the challenge stated by Fox and Hendler (2011) that "... many of the major scientific problems facing our world are becoming critically linked to the interdependence and interrelatedness of data from multiple instruments, fields and sources. The absence of effective visualization is creating a bottleneck within data-intensive sciences (Fox and Hendler, 2011). Solutions need to be found in relatively simple low end visualizations (as wonderfully catalogued in URL 99) to high end tools designed for the data deluge that themselves may call on graphics and visualization standards to be pipelined into rich, complex, and flexible aids. Data are registered and discoverable. Registries index data resources to alert potential users to their availability. Search engines, the normal indexers of web-accessible materials, are not good at revealing database contents - only about half of the open data in repositories are indexed by search engines (McCown et al. 2006). Discovery is made possible by the addition of coarse grained discovery metadata. Registry functions need to add and expose discovery metadata to make datasets more visible. As an example, GBIF provides registry level service for biodiversity data (URL 100). Registries that cover software (URL 101, URL 102) or web services (URL 103) are valuable in promoting awareness of tools for data capture, conversion and processing. Successful domain repositories, such as GenBank, have well-structured and detailed metadata that enable detailed search and enhanced discoverability. In the absence of such registries, researchers turn to peers, publications or the thousands of minor data sets available via the Internet. Under these circumstances, it is hard to know when, or if, all relevant data are found. There is a need for a broad-spectrum registry and indexing service (like a Google for data) where researchers can post pointers to their own data, search for desired data and have a means to quickly preview the results. Examples of this exist in Europe with OpenDOAR (URL 104) and in India with Database of Biological Database (URL 105), each with thousands of listings. Semantic annotation of data greatly increases discoverability, and is discussed below.
34
Reusing Biological Data

With the exception of molecular data, there is not a well-developed tradition of repurposing open biological data. One reason is the dominance of narrative tradition in biology. Traditionally, narrative biologists collect a tiny fraction of available data, and do so selectively to emphasize representative, exemplary or outlying observations. They interpret data as published conclusions, in which significant credit is gained for original and confirmed intuitive leaps. In this narrative approach, science is
assembled (in part) through retention, improvement, or rejection of the stories. A key aspect of the narrative approach is the data upon which the stories are based are rarely preserved or even recorded. Instead, value has traditionally been associated with manuscript publication. In most sub-domains of biology, biological data are not held in digital form nor have investments been made in an infrastructure for proper curation. Transferring data from manuscript (paper or pdf) and filing cabinet to structured digital files, web-accessible databases, and to semantic-web-enabled repositories will require a change in culture and a significant additional investment in data management. A second factor that deters reuse of data is the complexity of the subject and that biological data often only make sense in the context of many parameters, conditions and terms expressed over broad temporal and spatial scales. If data are to be reused to address complex questions, they need to be intensively annotated with consistent metadata, and these are generally lacking (Jones et al. 2006). The challenges in achieving consistency of data and metadata can be illustrated with a conceptually simple parameter such as growth rate. 1. Ambiguity of terms - Growth rate can refer to two different concepts: an increase in the number of individuals over time or an increase in the size or mass of individuals. The precise meaning is rarely explicitly defined within data sets. Data can only be shared if the meaning is disambiguated. 2. Heterogeneity of metadata Growth rates can be measured either in experimental or in 'field' contexts. The rate of growth can be influenced by many factors, such as ambient temperature, location, available food, life cycle stage, competitors, and so on. A considerable body of
metadata, entered in a consistent way, is necessary if a user wishes to rely on data that are strictly comparable. 3. Derived data - Growth rates are rarely measured directly. Raw data may be collected as a change in mass or individuals over time. Growth rate is calculated from such data. There are many formulae to calculate growth rate including: (ln x2 lnx1)/t2 t1 Birth death dN/dt = rN 35
P(t) = 1/(1+e-t)
Raw data in isolation, such as number of organisms present at a given time, are meaningless without the other measurements used to make calculations. Metadata describing the experimental method for determining growth rate, the specific calculations and statistical information about those calculations must also be captured. 4. Heterogeneity of units Growth rate data can be reported using multiple units, from pounds per year, to h-1. Within reason, units must be interoperable to promote data aggregation and reuse. In summary, the diversity of data practices present major challenges to reuse of data. The absence of an infrastructure that can cope with this heterogeneity hinders the application of computational solutions to broad biological problems. The costs of adding metadata and ontologies, of normalizing and standardizing data and of extracting data from the narrative will be considerable, and the task unfamiliar to many. The discipline remains in need of dialog to determine the most cost-effective ways to integrate our past efforts, and align our current efforts with the vision of a data-intensive future.
The semantic web and big new biology

The semantic web has many definitions, but here we think of it as a technical framework that promotes automated sharing and re-use of data across disciplines (Campbell & MacNeill 2010). A semantic infrastructure will permit machine-mediated answers to more complex queries than at present (Stein 2008). The semantic approach has advantages of being flexible, evolvable, and additive. The foundations for automated reasoning lie in the annotation of data with agreed metadata, linked through a network of ontologies, and queried using conventions (languages) such as RDF, OWL, SKOS and SPARQL (Campbell & MacNeill 2010). The mass of appropriately annotated data that can be accessed through the Internet is referred to as LOD (Linked Open Data). Through common metadata, the data can be linked to form a Linked Open Data Cloud (Fig. 9).
Berners-Lee has promoted four guidelines for linked data (Berners-Lee 2006): 1. The use of a standard system of URIs as names for things 2. The use of HTTP URIs so that the names can be 'looked' up and the data accessed 3. When a URI is looked up, it should return useful information using standards (RDF, SPARQL) 4. Include links to other URIs so that they can discover more things.
36
A URI (Uniform Resource Identifier) is a type of persistent identifier (see below) made up of a string of characters that unambiguously (at least in an ideal world, see Booth 2010 for discussion) represents data or metadata and can be used by machines to access the data. Different data-sets can be linked if they share the same URIs. For example, several marine data-sets could be linked by using the same URIs for investigator or sampling event. The most useful classes of terms that are likely to serve the needs of the Life Sciences are georeferences (which can link data from the same location held in different repositories), names of taxa (the common denominator to the majority of statements about biodiversity), identities of people that can be interconnected through devices such as FOAF (Friend-of-afriend) to find collaborators, relevant data, as well as to dig into the world of scientific literature, the latter being linkable through devices such as DOIs to show citation trends, influential publications, etc. (Fig. 10).
Figure 9: Linked open data cloud diagram, by Richard Cyganiak and Anja Jentzsch http://lod-cloud.net/. The circles represent sources of data and the arrows show how they are connected. DBpedia (a central hub) features the contents of Wikipedia in a structured form.
RDF is a language that defines relationships between things. Relationships in RDF are usually made in three parts (often called triples), Entity:Attribute:Value. Entity refers to, for example, organisms, parts or collectives of organisms to which the statement refers. The attribute defines what the statement is about, and the value provides the datum. A machine-readable form in RDF may be a statement that Data Conservancy: Data Issues in the Life Sciences (March 2011) 37
American Each term
robin:has_color:red. is ideally by defined controlled

LINKED DATA CLOUD
DISTRIBUTION SPECIMENS COLLECTIONS * COLLECTORS * MUSEUMS PUBLICATIONS * DATABASES * ON-LINE IMAGE GALLERIES ENVIRONMENTAL DATA LEGISLATIVE BOUNDARIES DEMOGRAPHICS USAGE BANK NOMENCLATORS * PUBLICATIONS * DATABASES * OCCURRENCES * TAXONOMIES PHYLOGENIES MOLECULAR SEQUENCES AUTHORS * PUBLICATION HISTORY * CO-WORKERS COLLECTORS * COLLECTION SITES * GEOSOCIAL NETWORKS * ON-LINE CATALOGS VIRTUAL LIBRARIES CROSS REFERENCING IMPACT FACTORS TAXONOMIC USAGES * NOMENCLATURAL USAGES * OCCURRENCE RECORDS *
stringently
vocabularies and ontologies, and each part represented within the triple as a URI. The Value can be a URI or a literal - the actual value. An advantage of RDF is that it allows datasets to be merged, for example TaxonConcept and
GAZETTEERS GIS WEB SERVICES
NAMES INDEX CLASSIFICATION / LIST REPOSITORY RECONCILIATION SERVICES DISAMBIGUATION SERVICES SEARCH ENGINES
EXTERNAL SNS FOAF
BIBLIOGRAPHIC DATA DOIS CROSS REFERENCING
GEOSPATIAL
NAMES
PEOPLE
PUBLICATION DATA
Wikipedia (URL 106). A goal of the Linking Open Data project is to promote a data commons by
Figure 10: Four classes of terms can provide the means to link data relating to the Life Sciences. They are references to location, names of organisms, people, and publications.
registering sets in RDF. As of September 2010, the project had grown to 25 billion triples and 395 million RDF links (Fig. 9). The EU project, Linking Open Data 2, received 6.5 million to expand Linked Data by building tools and developing standards (URL 107). Transformation of data from printed narrative or spreadsheet to semantic-web formats is a large challenge. Based on existing ontologies, there is enough information to create 1014 triples in biomedicine alone (Mons & Velterop 2009). At the time of writing, this quantity far exceeds the capacity of any system to process the information. While Life Sciences stand to benefit greatly from the advantages of linked data (Reichman et al. 2011), the current structure lacks mechanisms for ensuring quality, provenance and attribution. Provenance especially is important for Life Sciences data and several software packages currently exist for tracking it (such as Kepler, URL 108; Taverna, URL 109; VisTrails, URL 110). Bechhofer et al. (2010) advocate the use of Research Objects (ROs) as a mechanism for describing the aggregation and investigation of semantic resources with the capacity of capturing the additional value necessary to make the semantic web work for science. Provenance of ROs would also satisfy recent calls for open science where not only data but methods and analyses are also open (Reichman et al. 2011). Semanticization enables nanopublication, a form of publication that extends traditional narrative publication (Groth et al. 2010) and allows attribution to be associated with the semantic web (Mons & Veltrop 2009). Nanopublications relate to publication of triples. A uniquely identifiable triple is a statement. A triple with a statement for a subject is called an annotation and a set of annotations that refer to the same statement is called a nanopublication. The annotations add the attribution and context to the statement. The concept is not widely accepted. Data Conservancy: Data Issues in the Life Sciences (March 2011) 38
Persistent or Globally Unique Identifiers (GUIDs) are used to distinguish individual data elements (Richards et al. 2011). The attachment of a globally unique persistent identifier to data can be used to declare their provenance (source). It allows an author to be identified and to gain credit (attribution). It also provides the mechanism through which questions about data can be returned to source and the record confirmed, corrected or rejected. Identifiers can be used to establish versions and to identify data that are to be deprecated (Van de Sompel et al. 2010). Web-resolvable identifiers can be used to represent the Subject, Attribute and even Value in triples in a semantic world to make each element of a triple unambiguous. There are several desirable properties of globally unique identifiers. The first is that every datum has a single identifier. Secondly, the identifier should be resolvable such that users, once they have a GUID, can view data. An extension of this is that the GUID is dereferenceable, which means that the identifier can be converted into the data to which it refers. Once applied, a GUID should be persistent, but equally, there should be the capacity to version data. Any spreadsheet or database can include identifiers. Scientists have learned the value of a system that allows them to distinguish all records so that they can revisit data if required. To serve its role, an identifier needs to be made unique in its context, whether by instituting a system of incrementing identifiers, or by linking the identifier to a source file, and/or date, and/or author, and/or event. As we move into a semantic world, the principle extends to a requirement for each identifier of all records to be unique. Such an extension requires a universal system of unique identifiers that can be applied to any and all data. Unfortunately, as is often the case in biology, there is no single system of unique identifiers. There are four major types of GUIDs. The first is simply a very large alphanumeric identifier, such as 5fabfc40-0c3d-11e0-81e0-0800200c9a66, a Universally Unique Identifier or UUID. On-line environments can be called upon to provide such numbers (URL 111). They are designed to be globally unique without requiring a central registry. Unless used at the source and then respected, more than one UUID may become associated with a record. This is an unresolved problem. UUIDs lack any inherent property that allows them to be dereferenced. However, they can be included within other identifiers that are resolvable, or agencies can be used to provide a resolving service. A UUID can be resolved within the context of the semantic web, by including them within pointers to web-visible resources. A second type of identifier is the PURL (Persistent Uniform Resource Locators, URL 112). Some organizations (primarily, the Online Computer Library Center or OCLC that supports WorldCat, the world's largest library catalog) use PURLs that point indirectly to resources. Even if data move, the associated PURLs can remain persistent and simply activate a redirect to the new location of the data. This reduces instability.
39
A third category is Digital Object Identifiers (DOIs), a system of identifiers that are especially popular in publishing. Their use in identifying forward and backward citations, a service provided by companies such as CrossRef (URL 113), is a good example of the value that comes with a standard system of referencing articles. DOIs need to be bought and, as they contain no inherent resolvable elements, need to be supported by DOI-resolution services. DOIs are currently not widely applied to data, but there are exceptions such as the Pangaea database - a central archive of the World Data-Center for Marine Environmental Sciences (URL 114, URL 115) and the Dryad database (Whitlock 2011). The concept of using DOIs has appeal both as a resolvable identifier for datasets (Thorisson 2009) and for data elements (URL 116). Because of their use in the publishing world, the citation of DOIs have the potential to direct recognizable credit to scientists who make data available. The last system, LSIDs (Life Science Identifiers), uses Internet addresses, but avoids the problem of broken links using Universal Resource Names (URNs), a special category of URIs that identify a source but not a particular web address. Examples of LSIDs for Emiliania huxleyi, are: urn:lsid:algaebase.org:taxname:47515 urn:lsid:marinespecies.org:taxname:115104 urn:lsid:catalogueoflife.org:taxon:ec6ab24e-29c1-102b-9a4a-00304854f820:ac2008
Note that the last includes a UUID and a version number. LSIDs are endorsed by TDWG (URL 117) and GBIF (URL 118). Clearly, one problem with GUIDs is to narrow the spectrum to a single system. Other issues relate to having the IDs applied close to the source, managing GUIDs and updating them as a component of quality loops. The responsibility to create GUIDS should not fall to individual scientists because of the added burdens and instability of the research base and because of risks of duplication and poor quality control. GUID management is best deferred to registries, repositories, or institutions. DataONE is an example of a federated data repository that provides unique identifiers for every data object (Reichman et al. 2011).
DISCUSSION & CONCLUSIONS

A Big New Biology holds much promise as a means of integrating biological knowledge into the processes of addressing some large proximate challenges. Macroscopic tools will enable discovery of hidden features and better descriptions of relationships within the massive complexity of the biosphere. Yet, progress towards the vision varies enormously from the successes with high-throughput biology to the lack of activity within small science biology. Considerable effort is needed to catalog current practices, and to define the sociological transformations that will be required. If the transformation is to be purposeful, then it will need general oversight, discipline-specific reviews, and a description of the Data Conservancy: Data Issues in the Life Sciences (March 2011) 40
actual and desirable components of the 'Knowledge Organizational System' for Biology and their relationships. Some obvious challenges relate to standards and associated ontologies, incentivizing participation, and assembling an appropriate infrastructure and skill base.
Standards and Ontologies

Data standards bring order to the virtual data pool on which Big New Biology will rely. While complex and finely grained metadata are needed for analyses and for the world of Linked Open Data, the first challenge is to improve the discoverability of data. This process has traditionally been supported by word-of-mouth at conferences or in publications. With standards, registries can enable users to find data sets containing information about taxa, parameters, times, processes, or places of interest. If metadata are absent or incomplete, then the data sets cannot be discovered or reused and cannot contribute to Big New Biology. The costs of adding consistent metadata will be considerable. Automated data discovery, aggregation and analysis require more comprehensive standards than those currently available for many of the Life Sciences. Instead of a comprehensive system of standards, there is a piecemeal system of metadata, vocabularies, thesauri, ontologies, and data transfer schemas that overlap, compete, and have gaps. Greatest progress is being made outside the Life Sciences (such as georeferencing), or in high-investment areas where data are born digital (such as in genomics, Taylor et al. 2008). Given the richness of biodiversity and interactions, a comprehensive system of standards will necessarily be extremely complex. This creates a tension: whether to promote the comprehensive annotation of data with a significant overhead that deters participation versus pursuing a more palatable minimalistic annotation that can set a grander process in motion. As the commitment to standards is not widespread, the minimalistic approach is more likely to gain traction. The perspective that "The semantic web is littered with ontologies lacking ... data" noted above also warns us against starting with complex structures. However, the assembly of metadata and their inter-relationships will need a framework that is designed to allow initial discipline-specific standards to become more finely grained and for the parts to merge into a dynamic grand schema. The world of linked open data provides a good model for this, but given that few data are appropriately annotated, it has yet to realize its' potential. Two organizational frameworks for data are as yet under-exploited. The first is the system of georeferencing that is in use in rich applications in earth sciences, cartography, and so on. Information on occurrences of species is compiled in central databases such as GBIF and OBIS, has been and is being collected in vast quantities by a myriad of citizen scientists. Its potential is well illustrated by some largescale applications such as the impressive charting of bird migrations (Marris 2010), meta-analyses of oceanic biota (Webb et al. 2010), or web sites that emphasize locally relevant biota (URL 119). Less well developed, but arguably with more potential for many subdisciplines of the Life Sciences, is the Data Conservancy: Data Issues in the Life Sciences (March 2011) 41
transformation of taxonomic and phylogenetic knowledge into an information management system that uses Latin names and molecular identifiers as metadata and classifications and phylogenies as ontological frameworks for the metadata (Patterson et al. 2010).
Incentives
Despite widespread calls for scientists to make data more widely available, this has yet to happen for many subdisciplines within the Life Sciences (Dittert et al. 2001, Harnad 2008, Mandavilli 2011). A current impediment to data sharing is that the benefits derived are often greater for the consumer than the producer (Porter and Callahan 1994). Only about 10% of data make their way to open repositories (Savage and Vickers 2009, Science staff editorial 2011). The reasons are the lack of resources,
infrastructure, and incentives for sharing. Sociological, financial, legal and technical barriers must be surpassed for communities to become directly involved in populating and maintaining data pools, a requisite for success and scalability. In surveys, (Froese et al. 2003, Kohnke et al. 2005, RIN 2008, Costello 2009), scientists give the following reasons not to share data. 1. Intellectual property: A scientists funding and professional recognition relies on receiving proper credit for work done. Until scientists receive credit for data publication, there will be little motivation to redirect efforts from rewarding activities (such as exploring nature or writing papers) towards data mobilization. This problem can be solved with an infrastructure capable of creating citations for data, tracking data use or temporarily delaying public access to data (Froese et al. 2003). 2. Legal/confidentiality issues: Some data cannot legally be shared, such as data concerning people (Guttmacher et al. 2009), endangered species (Froese et al. 2003), proprietary information, or because of claims by employers or funders that they have copyright over data. The infrastructure must have mechanisms to protect necessary confidentiality in respect of sensitive data. Some data can be anonymised, and in the case of endangered taxa, protection can be accomplished by fuzzing data, such as by pooling data so that exact locations or identities are obscured (Froese et al. 2003). 3. Misuse or misinterpretation: A fear of misuse or misinterpretation, which, once in the literature, cannot be unpublished is a deterrent to sharing data. This is not a new problem, but it will increase as data producers lose control and no longer act as 'gate-keepers'. Part of the solution lies in developing stringent metadata and format standards such that data are released only when there are sufficient metadata to ensure that all users understand the context and limitations of the data. Until such time, disclaimers can alert consumers about inappropriate reuse (Froese et al. 2003).
42
4. Exposure of mistakes: Publication can expose problems with data, such as errors in data encoding, metadata, analysis or the selective use of data to emphasize particular arguments. Such issues are easily addressed. Given the noisy and rich nature of biology, there can be no such thing as a 'perfect data set; all are incomplete. Errors or gaps uncovered by subsequent users can be dealt with openly and honestly, thereby enhancing the body of scientific data. 5. Sustainability: Project-based data repositories run a risk of being abandoned at the end of the funding cycle. This increases doubts within the scientific community that data curation activities are a good use of resources. It is often cheaper to curate data properly than it is to gather it again (Heidorn 2008), and some data, such as data on past distributions of species, are irreplaceable and thus priceless. From an economic perspective, persistent discipline-specific repositories (such as the molecular data bases) are very attractive. There are considerable academic benefits from engaging with repositories. Scientists who share data often report increased book and/or photograph sales, increased web site hits and higher visibility for their projects (Froese et al. 2003). There is greater citation impact for open-access articles (Gargouri et al. 2010). In larger consortia, scientists (such as those studying phylogenetic relationships) who pool data are able to answer questions they could not answer if they were limited to the data that they themselves generated. Some publishers are incentivizing early data-sharing by granting an embargo to the data producers (Kaye et al. 2009) to alleviate fears of being scooped (Reichman et al. 2011). An emphasis on carrots such as these may be much more effective means of promoting datasharing than the sticks (in the form of funding agency requirements, Kaye et al. 2009; Table 2). Infrastructure In addition to challenges to
Life Science Challenges: Climate Change, Invasive Species, Resource Use Data-Driven Discovery Education/Outreach
incentivize scientists in the direction of datasharing, the infrastructure for a Big New Biology is incomplete. Funding agencies, like the National Science Foundation in the US, require projects to have plans for data management - a requirement that presumes data persistence. The infrastructure needed to guarantee persistence will require an investment well beyond the usual 3-5 year funding cycle into multi-decadal periods and coordination dimensions. that has international
Tools and Services Provenance/Sustainability Data Registration/Sharing Triples Ontology Digital API Data Portals Markup Languages Network/Bandwidth Grid Computing Logic Metadata Atomic Reasoning Semantic Web Preparation Data Models Interoperability High Performance Computing Storage/Computation Bioinformatics Web Services
Technology
Figure 11: Technical infrastructure needed for a Big New Biology to fully emerge. Based on Sinha et al. 2010.
The infrastructure requires
43
tools to capture data, policies, data standards, data identifiers, registration of discovery-level metadata, and APIs to share data (Fig. 11). There is as yet no index of data-sharing services nor a framework in which such elements could be integrated. There is little assessment of which elements of data plans will lead to persistence of data or their reuse. In the absence of these elements, principle investigators are left to make their own policies, use their own systems, and to finance the processes. The results will likely remain heterogeneous thereby limiting interoperability, and/or will be inefficiently implemented, and/or will provide no guarantee of persistence. At this time, research scientists need to be supported by data managers and data archivists. Institutional libraries and museums are well placed to shift their agendas to include the preservation of digital artifacts and so may fill this gap, providing institutional, regional or discipline-based services. The growing technical barrier to data sharing of lack of bandwidth to distribute data is illustrated by problems experienced in high throughput molecular biology. This problem can be expected to extend to other disciplines. Proposed solutions include Bio-Mirror (URL 120) which consists of several servers holding the same data, or the Tranche Project (URL 121), which shares repository functions across servers. The latter has a high administrative overhead. Another solution is a peer-to-peer sharing system such as BitTorrent (Langille & Eisen 2010) which overcomes potential bandwidth problems by sharing data sets without a central repository. The BitTorrent protocol is being modified to serve scientific data as BioTorrents (Langille & Eisen 2010). Users of BioTorrents can benefit from lower bandwidth use, faster transfer times and data publication. Given that bandwidth problems will persist as the symptoms of the interplay between rates of data-generation and the limitations of the infrastructure to make data freely available to all, we may expect to see a growth of specialist centers that will offer analysis, visualization, and data transformation services on behalf of the users.
Conclusion
A large pool of open and well organized data will create new opportunities for research in biology. The full potential can only be achieved with investment from scientists, governments,
institutions, publishers, and software developers. Areas in need of urgent attention include incentives, elements of knowledge management systems (especially metadata, ontologies, and Linked Open Data), registries, and repositories. Small science elements can coalesce through the better use of georeferencing and names-based ontological frameworks. Training environments such as Universities need to create a new cadre of scientists trained in computer sciences and biology. A much better understanding of existing data cultures is needed to develop appropriate solutions that will transform and align the traditions of each community with a common goal. None of this will come quickly or cheaply, but these transformations are needed if we are to deparochialize the Life Sciences. Data Conservancy: Data Issues in the Life Sciences (March 2011) 44
ACKNOWLEDGMENTS
The authors would like to thank Dmitry Mozzherin, David Shorthouse, Nathan Wilson, Jane Maeinschein, Peter DeVries, Holly Miller, and members of the Data Conservancy Life Sciences Advisory group for valuable comments.
45
REFERENCES
Aanensen, D.M., D.M. Huntley, E.J. Feil, F. al-Own and B.G. Spratt. 2009. EpiCollect: Linking smartphones to web applications for epidemiology, ecology and community data collection. PLoSONE 4:e6968. Ackoff, R. 1989. From data to wisdom. Journal of Applied Systems Analysis 16:3-9. Arlinghaus, R. and S.J. Cooke. 2009. Recreational fisheries: socioeconomic importance, conservation issues and management changes. In: Adams, B. [Ed.] Recreational Hunting, Conservation, and Rural Livelihoods: Science and Practice. Blackwell, Oxford. Ausubel, J.H. 2009. A botanical macroscope. Proc. Natl. Acad. Sci. 106:12569. Bard, J.B.L. and S.Y. Rhee. 2004. Ontologies in biology: design, applications and future challenges. Nature Reviews Genetics 5:213-222. Beaman, R., J.A. Macklin, M.J. Donoghue and J. ,Hanken. 2007. Overcoming the digitization bottleneck in natural history collections: A summary report on a workshop held 7-9 September 2006 at Harvard University. http://www.etaxonomy.org/wiki/images/b/b3/Harvard_data_capture_wkshp_rpt_2006.pdf . Bechhofer, S., J. Ainsworth, J. Bhagat, I. Buchan, P. Couch, D. Cruickshank, D. De Roure, M. Delderfield, I. Dunlop, M. Gamble, C. Goble, D. Michaelides, P. Missier, S. Owen, D. Newman and S. Sufi. 2010. Why linked data is not enough for scientists. In: 6th IEEE e-Science conference. http://eprints.ecs.soton.ac.uk/21587/5/research-objects-final.pdf Berlin Declaration on Access to Knowledge in the Sciences http://www.zim.mpg.de/openaccess-berlin/berlin_declaration.pdf. and Humanities. 2003.
Berman, H., K. Henrick, H. Nakamura and J.L. Markley. 2006. The worldwide protein data bank (wwPDB): ensuring a single uniform archive of PDB data. Nucleic Acids Research 35:D301D303. Berners-Lee, T.Y. Chen, L. Chilton, D. Connolly, R. Dhanaraj, J. Hollenbach, A. Lerer and D. Sheets. 2006. Tabulator: exploring and analyzing linked data on the semantic web. Proceedings of the 3rd International Semantic Web User Interaction Workshop (SWUI0), Athens, Georgia. Bibby, C.J., N.D. Burgess, D.A. Hill and S.H. Mustoe. 2000. Bird Census Techniques 2nd Ed. Elsevier, London, 303pp. Bilimoria, K.Y., A.K. Stewart, D.P. Winchester and C.Y. Ko. 2003. The National Cancer Database: A powerful initiative to improve cancer care in the United States. Annals Surgical Oncology 15:683-690. Bilofsky, H.S. and B. Christian. 1988. The GenBank genetic sequence data bank. Nucleic Acids Research 16:1861-1863. Booth, D. 2010. Resource identity and semantic extensions: making sense of ambiguity. Semantic Technology Conference, San Francisco, USA http://dbooth.org/2010/ambiguity/ Bunin, V.D., O.V. Ignatov, O.I. Gulii, A.G. Voloshin, L.A. Dykman, D. ONeil and D. Ivnitskii. 2005. Investigation of electrophysical properties of Listeria monocytogenes cells during the interaction with monoclonal antibodies. Biofizika 50:316-321. Burton A. and A. Treloar. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition. International Journal of Digital Curation 4:44-56.
46
Campbell, L.M. and S. MacNeill. 2010. The semantic web, linked and open data: a briefing paper. JISC cetis. http://wiki.cetis.ac.uk/images/1/1a/The_Semantic_Web.pdf. Chapman, A.D. 2005a. Uses of primary species-occurrence data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. Chapman, A.D. 2005b. Principles of data quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. Chapman, A.D. 2009. Numbers of Living Species in Australia and the World, 2nd edition. Australian Biological Resources Study, Australia. Cisneros-Montemayor, A.M. and U.R. Sumaila. 2010. A global estimate of benefits from ecosystem based marine recreation: Potential impacts and implications for management. Journal of Bioeconomics 12:245-268. doi: 10.1007/s10818-010-9092-7. Coale, K. H., K.S. Johnson, F.P. Chavez, K.O. Buesseler, R.T. Barber, M.A. Brzezinski, W.P. Cochlan, F.J. Millero, P.G. Falkowski, J.E. Bauer, R.H. Wanninkhof, R.M. Kudela, M.A. Altabet, B.E. Hales, T. Takahashi, M.R. Landry, R.R. Bidigare, X. Wang, Z. Chase, P.G. Strutton, G.E. Friederich, M.Y. Gorbunov, V.P. Lance, A.K. Hilting, M.R. Hiscock, M. Demarest, W.T. Hiscock, K.F. Sullivan, S.J. Tanner, R.M. Gordon, C.N. Hunter, V.A. Elrod, S.E. Fitzwater, J.L. Jones, S. Tozzi, M. Koblizek, A.E. Roberts, J. Herndon, J. Brewster, N. Ladizinsky, G. Smith, D. Cooper, D. Timothy, S.L. Brown, K.E. Selph, C.C. Sheridan, B.S. Twining and Z.I. Johnson. 2004. Southern Ocean iron enrichment experiment: carbon cycling in high- and low-Si waters. Science 304:408-414. Costanza, R., R. D'arge, R. de Groot, S. Farber, M. Grasso, B. Hannon, K. Limburg, S. Naeem, R.V. O'Neill, J. Paruelo, R.G. Raskin, P. Sutton and M. van den Belt. 1997. The value of the world's ecosystem services and natural capital. Nature 387:253-260. Costello, M. 2009. Motivating online publication of data. BioScience 59:418-426. Davis, P.M. 2009. Author-choice open access publishing in the biological and medical literature: a citation analysis. Journal American Society Information Science and Technology 60:3-8. De Rosnay, J. 1975. Le macroscope: vers une vision globale. Seuil, Paris. Dittert, N., M. Diepenbroek and H. Grobe. 2001. Scientific data must be made available to all. Nature 14: 393 Doctorow, C. 2008. Big data: welcome to the petacentre. Nature 455: 16-21. Doom, T., M. Raymer, D. Krane, and O. Garcia. 2002. A proposed undergraduate bioinformatics curriculum for computer scientists. ACM SIGCSE Technical Symposium on Computer Science Education (33rd, Covington, KY) pp. 78-81. ESF (European Science Foundation) 2006 Press Release: A cyberinfrastructure network for Europe http://www.esf.org/media-centre/press-releases/ext-singlenews.html?tx_ttnews[tt_news]=129&cHash=98c6548070c4afa002061d23560e8f96. Evans, J.A. and J.G. Foster. 2011. Metaknowledge. Science 331: 721-725. Fergraus, E.H., S. Andelman, M.B. Jones and M. Schildhauer. 2005. Maximizing the value of ecological data with structured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. Bulletin of the Ecological Society of America 86:158-168.
47
Field, D. S.A. Sansone, A. Collis, T. Booth, P. Dukes, S.K. Gregurick, K. Kennedy, P. Kolar, E. Kolker, M. Maxon, S. Millard, A.M. Mugabushaka, N. Perrin, J.E. Remacle, K. Remington, P. RoccaSerra, C.F. Taylor, M. Thorley, B. Tiwari and J. Wilbanks. 2009. Omics data-sharing. Science 326:234-236. Fleming, A.F., J. Storey, L. Molineaux, E. Iroko and E.D. Attai. 1979. Abnormal haemoglobins in the Sudan savanna of Nigeria. I. Prevalence of haemoglobins and relationships between sickle cell trait, malaria and survival. Ann. Trop. Med. Parasitol. 73:161-172. Fox, P. and J. Hendler. 2011. Changing the equation on scientific data visualization. Science 331:705-708 Froese, R., D. Lloris and S. Opitz. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271. Gargouri, Y., C. Hajjen, V. Larivire, Y. Gingras, L. Carr, T. Brody and S. Harnad. 2010. Self-selected or mandated, open access increases citation impact for higher quality research. PLoS ONE 5:e13636. Gleditsch, N.P. and H. Strand. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives 4:89-97. Groth, P., A. Gibson and J. Velterop. 2010. The anatomy of a nanopublication. Information Services & Use 30:51-56. Guttmacher, A.E., E.G. Nabel and F.S. Collins. 2009. Why data-sharing policies matter. PNAS 106:16894. Harnad, S. 2008. Waking OAs Slumbering Giant: The University's mandate to mandate open access. New Review of Information Networking 14:51-68. Heidorn, P.B. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280299. Heidorn, P.B. and Q. Wei. 2008. Automatic metadata extraction from museum specimen labels Proc. Intl Conf. on Dublin Core and Metadata Applications pp. 57-68. Hey, T., S. Tansley and K. Tolle. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp. Higgins, D., C. Berkley and M.B. Jones. 2002. Managing heterogeneous ecological data using Morpho. 14th Int. Conference on scientific and satistical Database Management (SSDBM02), p.69. Hooker, S.B. and C.R. McClain. 2000. The calibration and validation of SeaWiFS data. Progress in Oceanography 45:427-465. Hopkins, G.W. and R.P. Freckleton. 2002. Declines in the numbers of amateur and professional taxonomists: implications for conservation. Animal Conservation 5:245-249. Huynh, D.F. and D.R. Karger. 2009. Parallax and companion: set-based browsing for the data web. In: Proceedings of WWW 09. Innocenti, P., A. McHugh, S. Ross, R. Ruusalepp. 2007. Digital Curation Centre (DCC) and DigitalPreservationEurope (DPE) audit toolkit: DRAMBORA. International Conference on Digital Preservation (iPRES), Beijing. ISO. 2000. ISO 9000:2000: quality management systems fundamentals and vocabulary. Standard, International Organization for Standardization (ISO), Geneva, Switzerland. Jones, M.B., C. Berkley, J. Bojilova and M. Schilhauer. 2002. Managing scientific metadata. Internet Computing IEEE 5:59-68.
48
Jones, M.B., M.P. Schildhauer, O.J. Reichman and S. Bowers. 2006. The new bioinformatics: integrating ecological data from the gene to the biosphere. Annu. Rev. Ecol. Evol. Syst. 37:519-544. Kahn, S.D. 2011. On the future of genomic data. Science, 331: 728-729. Kaye, J., C. Heeney, N. Hawkins, J. de Vries and P. Boddington. 2009. Data sharing in genomics reshaping scientific practice. Nature Reviews Genetics 10:331-335. Kelling, S., W.M. Hochachka, D. Fink, M. Riedewald, R. Caruana, G. Ballard and G. Hooker. 2009. Data-intensive science: a new paradigm for biodiversity studies. BioScience 59:613-619. Kerlinger, P. 1993. Birding Economics and Birder Demographics Studies Conservation Tools. In: Finch, D. and P. Stangel [Eds.] Proc. Status and Managem. of Neotrop. Migr. Birds. Rocky Mountains Forest and Range Experimental Station, Fort Collins, CO. USDA Forestry Service General Technical Report RM-229. pp. 3238. Key Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability. DCC Scarp Synthesis Report. ISSN 1759-586X. Kidd, D.M. and X. Liu. 2008. GEOPHYLOBUILDER 1.0: an ARCGIS extension for creating geophylogenies. Molecular Ecology Resources 8:88-91. Klump, J. 2011. Criteria for the trustworthiness of data-centres. D-Lib Magazine vol. 17. Kobilarov, G. and I. Dickinson. 2008. Humboldt: exploring linked data. In: Proceedings of the WWW 08 Workshop on Linked Data on the Web. Kohnke, D., M.J. Costello, J. Crease, J. Folack, R. Martinez Guingla and Y. Michida. 2005. Review of the International Oceanographic Data and Information Exchange (IODE). Intergovernmental Oceanographic Commission (IOC) IOC/IODE-XVIII/18. Lal, R. 2006. Encyclopedia of Soil Science Volume 2, 2nd Edition. Taylor & Francis Group. Lambrix, P. and H. Tan. 2008. Ontology alignment and merging. In: Burger, A., D. Davidson and R. Baldock [Eds.] Anatomy Ontologies for Bioinformatics: Principles and Practice. Springer. pp 134-149. Langille, M.G.I. and J.A. Eisen. 2010. BioTorrents: A file sharing service for scientific data. PLoS ONE 5(4):e10071. Lee, C.P., P. Dourish and G. Mark. 2006. The human infrastructure of cyberinfrastructure. Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work. doi:10.1145/1180875.1180950. Lynch, C.A. 2003. Institutional repositories: essential infrastructure for scholarship in the digital age. Libraries and the Academy 3:327-336. MacLeod, N., M. Benfield and P. Culverhouse. 2010. Time to automate identification. Nature 467:154155. Madin, J., S. Bowers, M. Schildhauer, S. Krivov, D. Pennington and F. Villa. 2007a. An ontology for describing and synthesizing obervation data. Ecological Informatics 2:279-296. Madin, J.S., S. Bowers, S.M. Schildhauer and M.B. Jones. 2007b. Advancing ecological research with ontologies. TREE 23:159-168. Mandavilli, A. 2011. Trial by twitter. Nature 469: 286-287. Marris, E. 2010. Supercomputing for the http://www.nature.com/news/2010/100811/full/466807a.html birds. Nature 466:807.
49
Maurer, L. and K. Tochterman. 2010. Information and communication technologies for biodiversity conservation and agriculture. Shaker Verlag, Aachen. Mayr, E. 2004. What makes biology unique? Cambridge University Press. McCown, F., X. Liu, M.L. Nelson, M. Zubair. 2006. Search engine coverage of the OAI-PMH Corpus. IEEE Internet Computing, 10: 66-73 doi:10.1109/MIC.2006.41 Michener, W.K. 2006. Meta-information concepts for ecological data management. Ecological Informatics 1:3-7. Mons, B. and J. Velterop. 2009. Nano-publication in the e-science era. In: Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009), Washington DC, USA. Morris, R. 2010. GBIFKOS Draft White Paper v 2010_11-25-0400. http://community.gbif.org/pg/file/BMorris/read/10694/gbifkos-draft-white-paper-v2010_11250400.
National Academy of Sciences. 2009. A New Biology for the 21st Century. ISBN-13: 978-0-309-14488-9. 112 pp. Norris, M., C. Oppenheim and F. Rowland. 2008. The citation advantage of open access articles. Journal of the American Society of Information Science and Technology 59: 1963-1972. NSF (National Science Foundation). 2003. Revolutionizing science and engineering through cyberinfrastructure: report of the national science foundation blue-ribbon advisory panel on cyberinfrastructure http://www.nsf.gov/od/oci/reports/atkins.pdf 84 pp. NSF (National Science Foundation). 2006. NSFs Cyberinfrastructure Vision for 21st Century Discovery ver 5.0. NSF Cyberinfrastructure Council http://www.nsf.gov/od/oci/ci_v5.pdf 32pp. OECD. 1999. Final Report of the megascience forum working group on biological informatics. OECD, Paris. Patterson D.J., S. Faulwetter and A. Shipunov. 2008. Principles for a names-based cyberinfrastructure to serve all of biology, In: Minelli, A., L. Bonato and G. Fusco [Eds.] Updating the Linnaean Heritage: Names as Tools for Thinking About Plants and Animals. pp 153-163. Patterson, D.J., J. Cooper, P.M. Kirk, R.L. Pyle and D.P. Remsen. 2010. Names are key to the big new biology. TREE 25:686-691. Piwowar, H.A., R.S. Day and D.B. Fridsma. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE 3:e308. Porter, J.H. and J.T. Callahan. 1994. Circumventing a dilemma: historical approaches to data-sharing in ecological research. In: Michener, W.K., J.W. Brunt and S.G. Stafford [Eds.] Environmental Information Management and Analysis: Ecosystem to Global Scales. Taylor & Francis Ltd, London. pp. 193-202. Raup, D. 1992. Extinction: Bad Genes or Bad Luck? Norton and Co., New York. Reichman, O.J., M.B. Jones, and M.P. Schildauer. 2011. Challenges and opportunities to open data in ecology. Science 331:703-705. Rhee, S.Y., W. Beavis, T.Z. Berardini, G. Chen, D. Dixon, A. Doyle, M. Carcia-Hernandez, E. Huala, G. Lander, M. Montoya, N. Miller, L.A. Mueller, S. Mundodi, L. Reiser, J. Tacklind, D.C. Weems, Y. Wu, I. Xu, D. Yoo, J. Yoon and P. Zhang. 2003. The Arabidopsis information resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research 31:224-228. Data Conservancy: Data Issues in the Life Sciences (March 2011) 50
Richards, K., R. White, N. Nicolson and R. Pyle. 2011. A beginners' guide to persistent identifiers. GBIF, Copenhagen. RIN (Research Information Network). 2008. To share or not to share: publication and quality assurance of research data outputs. A report commissioned by the Research Information Network. http://www.rin.ac.uk/data-publication. Rogers, E.M. 1983. Diffusion of innovations. 3rd Edition. New York: Free Press. Rhm, U. and J.A. Blakeley. 2009. Data Management for High-Throughput Genomics. CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research. Savage, C.J. and A.J. Vickers. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078. Schofield, P.N., J. Eppig, E. Huala, M. Hrabe de Angelis, M. Harvey, D. Davidson, T. Weaver, S. Brown, D. Smedley, N. Rosenthal, K. Schughart, V. Aidinis, G. Tocchini-Valentini and J.M. Hancock. 2010. Sustaining the data and bioresource commons. Science 330:592-593. Scholes, R.J., G. M. Mace, W. Turner, G.N. Geller, N. Jrgens, A. Larigauderie, D. Muchoney, B.A. Walther, and H.A. Mooney. 2008. Towards a global biodiversity observing system. Science 321: 1044-1045. Science staff editorial. 2011. Challenges and opportunities. Science 331, 692-693. Silvertown, J. 2009. A new dawn for citizen science. Trends in Ecology and Evolution, 24: 467-471. Sinha, A.K., Z. Malik, A. Rezgui, C.G. Barnes, K. Lin, G. Heiken, W.A. Thomas, L.C. Gundersen, R. Raskin, I. Jackson, P. Fox, D. McGuinness, D. Seber, H. Zimmerman. 2010. Geoinformatics: transforming data to knowledge for geosciences. GSA Today 20:4-10. Sirovich, L., M.Y. Stoeckle and Y. Zhang. 2010. Structural Analysis of Biodiversity. PLoS ONE 5:e9266. doi:10.1371/journal.pone.0009266. Smith, V.S. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113. Sogin, M.L., H.G. Morrison, J.A. Huber, D. Mark Welch, S.M. Huse, P.R. Neal, J.M. Arrieta and G.J. Herndl. 2006. Microbial diversity in the deep sea and the underexplored rare biosphere. PNAS 103: 12115-12120. SOS report 2010. International Institute for species exploration: www.species.asu.edu/2010 Stein, L.D. 2008. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Reviews Genetics 9:678-688. Strasser, B. J. 2008. GenBank - Natural history in the 21st century. Science, 322: 537-538. Taylor, C.F., D. Field, S.A. Sansone, J. Aerts, R. Apweiler, M. Ashburner, C.A. Ball, P.A. Binz, M. Bogue, T. Booth, A. Brazma, R.R. Brinkman, A.M. Clark, E.W. Deutsch, O. Fiehn, J. Fostel, P. Ghazal, F. Gibson, T. Gray, F. Frimes, J.M. Hancock, N.W. Hardy, H. Hermjakob, R.K. Julian Jr., M. Kane, C. Kettner, C. Kinsinger, E. Kolker, M. Kuiper, N. Le Novre, J. Leebens-Mack, S.E. Lewis, P. Lord, A.M. Mallon, N. Marthandan, H. Masuya, R. McNally, A. Mehrle, N. Morrison, S. Orchard, J. Quackenbush, J.M. Reecy, D.G. Robertson, P. Rocca-Serra, H. Rodriguez, H. Rosenfelder, J. Santoyo-Lopez, R.H. Scheuermann, D. Schober, B. Smith, J. Snape, C.J. Stoeckert Jr., K. Tipton, P. Sterk, A. Untergasser, J. Vandesompele and S. Wiemann. 2008. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology 26:889-896.
51
TEEB. 2010. The economics of ecosystems and biodiversity: Mainstreaming the economics of nature: A synthesis of the approach, conclusions and recommendations of TEEB. United Nations Environment Program. Thorisson, G.A. 2009. Accreditation and attribution in data-sharing. Nature Biotechnology 27:984-985. Tittensor, D.P., C. Mora, W. Jetz, H.K. Lotze, D. Ricard, E. van den Berghe and B. Worm. 2010. Global patterns and predictors of marine biodiversity across taxa Nature 466:10981101. Van de Sompel, H., R. Sanderson, M.L. Nelson, L.L. Balakireva, H. Shankar and S. Ainsworth. 2010. An HTTP-based versioning mechanism for linked data. In: Bizer, C., T. Heath, T. Berners-Lee and M. Hausenblas [Eds.] Proceedings of the Linked Data on the Web Workshoop (LDOW2010) Raleigh, North Carolina, USA. Vision, T.J. 2010. Open data and the social contract of scientific publishing. BioScience 60:330-330. Vollmar, A., J. Macklin and L.S. Ford. 2010. Natural history specimen digitization: challenges and concerns. Biodiversity Informatics 7:93-112. Webb, T.J., E. Vanden Berghe and R. ODor. 2010. Biodiversitys big wet secret: The global distribution of marine biological records reveals chronic under-exploration of the deep pelagic ocean. PLoS ONE 5:e10223. White, H.C., S. Carrier, A. Thompson, J. Greenberg and R. Scherle. 2008. The dryad data repository: a singapore framework metadata architecture in a DSpace environment. Proc. Intl. conf. on Dublin Core and Metadata Applications 157-162. Whitlock, M.C. 2011. Data archiving in ecology and evolution: best practices. TREE 26: 61-65. Whitlock, M.C., M.A. McPeek, M.D. Rausher, L. Rieseberg and A.J. Moore. 2010. Data archiving. The American Naturalist 175:145-146. Wren, J. and A. Bateman. 2008. Databases, data tombs and dust in the wind. Bioinformatics 24:21272128.
URL REFERENCES
URL 1: www.gloriad.org URL 2: www.dataone.org/ URL 3: http://www.nlm.nih.gov/research/umls/ URL 4: http://www.ihtsdo.org/snomed-ct/ URL 5: http://www.eu-egee.org/fileadmin/documents/UseCases/MedicalDataManagement.html URL 6: http://www.informatics-review.com/journals/index.html URL 7: http://freemedsoftware.org/ URL 8: http://www.alzforum.org/ URL 9: http://www.facs.org/cancer/ncdb/index.html URL 10: http://www.stemcellresources.org/library_images.html URL 11: www://ebird.org URL 12: www.gbif.org URL 13: http://video.google.com/videoplay?docid=4000153761832846346# URL 14: http://scratchpads.eu/ URL 15: www.lifedesks.org URL 16: http://biofinity.unl.edu URL 17: http://wildobs.com/about/iphone URL 18: http://bluebill.tidalwave.it/mobile/ Data Conservancy: Data Issues in the Life Sciences (March 2011) 52
URL 19: http://www.cybertracker.org/ URL 20: http://www.nsf.gov/pubs/2010/nsf10603/nsf10603.htm URL 21: http://www.biodiversity.be/files/5/8/9/589.pdf URL 22: http://wwold.gbif.org/prog/digit/data_quality/DataCleaning URL 23: http://www.openarchives.org/ URL 24: http://www.slideshare.net/oaod2010/oa-oa-self-archiving-oa-publishing-and-data-archiving URL 25: http://www.bls.gov/oco/ocos047.htm URL 26: http://www.globalcomputing.com/american-universities.htm URL 27: http://www.nsf.gov/statistics/seind10/c5/c5h.htm URL 28: http://www.nsf.gov/statistics/nsb1003/ URL 29: http://www.nsf.gov/nsb/publications/pub_summ.jsp?ods_key=nsb1003 URL 30: http://www.ncbi.nlm.nih.gov/genbank/ URL 31: http://ecoliwiki.net/colipedia/index.php/Welcome_to_EcoliWiki URL 32: http://dnasubway.iplantcollaborative.org/ URL 33: http://www.artportalen.se/default.asp URL 34: https://www.nescent.org/sites/evoviz/GeoPhyloBuilder URL 35: http://thomsonreuters.com/content/science/pdf/BIOSIS_Factsheet.pdf URL 36: http://www.pensoft.net/journals.php URL 37: http://www.plosbiology.org/static/help.action#xmlContent URL 38: http://www.surffoundation.nl/en/actueel/Pages/Researchersenhancetheirpublications.aspx URL 39: http://datadryad.org/jdap URL 40: http://datadryad.org/ URL 41: http://www.wwpdb.org URL 42: http://www.arabidopsis.org/ URL 43: http://www2.gbif.org/SignedSUWONdeclaration_small.pdf URL 44: http://www.insdc.org/policy.html URL 45: http://www.data.gov/ URL 46: http://data.australia.gov.au/ URL 47: data.gov.uk URL 48: http://www.conabio.gob.mx/ URL 49: http://www.environment.gov.au/biodiversity/abrs/ URL 50: http://www.environment.gov.au/erin/ URL 51: http://www.ala.org.au/ URL 52: http://www.itis.gov/ URL 53: http://www.eea.europa.eu/data-and-maps URL 54: http://www.webometrics.info/methodology.html URL 55: http://www.whitehouse.gov/omb/circulars_a110 URL 56: ora.ra.cwru.edu/University_Policy_On_Custody_Of_Research_Data.pdf URL 57: ftp://ftp.fao.org/FI/brochure/climate_change/policy_brief.pdf URL 58: http://www.getbirdseye.com/ URL 59: http://wildobs.com/about/observer URL 60: http://www.extinctanimal.com/the_coelacanth.htm URL 61: http://www.wolganvalley.com/pdf/wolgan-valley/en/media-centre/factsheets/Wolgan%20Valley%20Wollemi%20Pine%20Fact%20Sheet.pdf?1=6 URL 62: http://www.cryptomundo.com/cryptozoo-news/ibw-rainsong/ URL 63: http://www.kidcyber.com.au/topics/Lordhowestick.htm URL 64: http://animaldiversity.ummz.umich.edu/site/accounts/information/Burramys_parvus.html URL 65: repository66.org URL 66: www-wram.slu.se URL 67: http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-andcertifying/core-re Data Conservancy: Data Issues in the Life Sciences (March 2011) 53
URL 68: http://otter.oerc.ox.ac.uk/biosharing/ URL 69: http://dataconservancy.org/sites/default/files/NSFDataManagementPlanTemplatev1.1.pdf URL 70: http://digir.sourceforge.net/ URL 71: http://ipt.gbif.org/ URL 72: http://knb.ecoinformatics.org/morphoportal.jsp URL 73: http://knb.ecoinformatics.org/software/metacat/ URL 74: http://www.ncddc.noaa.gov/metadataresource/metadata-tools URL 75: http://marinemetadata.org/references/oboeontology URL 76: http://www.nceas.ucsb.edu/ecoinfo URL 77: https://sonet.ecoinformatics.org/ URL 78: http://protege.stanford.edu/ URL 79: http://oboedit.org/ URL 80: http://swoogle.umbc.edu/ URL 81: http://sindice.com/ URL 82: http://www.w3.org/TR/owl-ref/ URL 83: http://www.nhm.ac.uk/hosted_sites/tdwg/first_minutes.pdf URL 84: http://www.loa-cnr.it/index.html URL 85: http://www.tonesproject.org/ URL 86: http://www.geneontology.org/ URL 87: http://www.eol.org/ URL 88: http://spire.umbc.edu/ont/ethan.php URL 89: http://bioportal.bioontology.org/visualize/3294 URL 90: http://woce.nodc.noaa.gov/wdiu/ URL 91: www.nodc.noaa.gov/ URL 92: http://www.ices.dk/ocean/ URL 93: http://www.iobis.org/ URL 94: www.cas.org/ URL 95: http://thomsonreuters.com/products_services/science/science_products/a-z/zoological_record/ URL 96: http://iphylo.blogspot.com/2011/02/web-hooks-and-openurl-making-databases.html URL 97: http://www.getpivot.com/ URL 98: http://www.lifemapper.org/ URL 99: http://www.visual-literacy.org/periodic_table/periodic_table.html URL 100: http://www.gbif.org/informatics/standards-and-tools/integrating-data/resource-discovery/ URL 101: http://en.bio-soft.net/geshi.html URL 102: http://www.equisetites.de/palbot/software/software.html URL 103: www.biocatalogue.org URL 104: http://www.opendoar.org/ URL 105: http://www.biodbs.info/ URL 106: http://www.slideshare.net/pjdwi/biodiversity-informatics-on-the-semantic-web URL 107: http://lod2.eu/Welcome.html URL 108: https://kepler-project.org/ URL 109: http://www.taverna.org.uk/ URL 110: http://www.vistrails.org/index.php/File:Eva.png URL 111: http://www.famkruithof.net/uuid/uuidgen URL 112: http://purl.oclc.org/docs/index.html URL 113: http://www.crossref.org/ URL 114: http://www.awi.de/index.php?id=3086&L=0 URL 115: http://www.wdc-mare.org/ URL 116: http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.html URL 117: http://wiki.tdwg.org/GUID/
54
URL 118: http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-review-ofgbif-lsid-guid-task-group-report/ URL 119: http://zipcodezoo.com/ URL 120: http://www.bio-mirror.net/ URL 121: https://trancheproject.org/ URL 122: https://kepler-project.org/ URL 123: http://www.plosbiology.org/
55

White Paper Final Version

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

White Paper Final Version

Uploaded by

Copyright:

Available Formats

Data Issues in the Life Sciences

Anne E. Thessen* David J Patterson

Data Conservancy (Life Sciences)

Marine Biological Laboratory Woods Hole Massachusetts 02543

*Author for contact, athessen@mbl.edu

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

What are the Life Sciences?

Data Conservancy: Data Issues in the Life Sciences (March 2011)

phenomena are noisy), and in which the diversity of

components and processes at all

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Levels of organization within biology

individual phenomena can affect aspects at many other levels. A

Table 1. Examples of repositories for Life Sciences data.

be derived from satellites and other remote sensing devices.

Many existing or emerging

Different means of acquiring data

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

original descriptions currently valid valid and unchanged

accessed through instruments (Fig. 2). Molecular

The Long Tail in Biology

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

There are signs that in

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Provides Data for Free

Free Access to Publication

Gordon and Betty Moore Foundation

Policy http://moore.org/docs/GBMF_Data%20Sharing%20Philosop hy%20and%20Plan.pdf X

National Institutes of Health

Biotechnology and Biological Sciences Research Council UK X X X UK UK

Natural Environment Research Council

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Table 2: List of funding agencies and characteristics of their data policies

Austrian Science Fund

NOAA Council for Scientific and Industrial Research

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data re-use Access

Processing Annotation Vetting

Making data accessible

Data pool Data flow

Normalized Structured Digital

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Table 3: Examples of standards and their location

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Data Conservancy: Data Issues in the Life Sciences (March 2011)

Reusing Biological Data

Data Conservancy: Data Issues in the Life Sciences (March 2011)

The semantic web and big new biology

Data Conservancy: Data Issues in the Life Sciences (March 2011)

American Each term

robin:has_color:red. is ideally by defined controlled

GAZETTEERS GIS WEB SERVICES

EXTERNAL SNS FOAF

BIBLIOGRAPHIC DATA DOIS CROSS REFERENCING