Empowering Industrial Research With Shared Biomedical Vocabularies

Drug Discovery Today Volume 00, Number 00 September 2011
REVIEWS
Empowering industrial research with shared biomedical vocabularies

Lee Harland1,10, Christopher Larminie2, Susanna-Assunta Sansone3, Sorana Popa4, M. Scott Marshall5, Michael Braxenthaler6, Michael Cantor7, Wendy Filsell8, Mark J. Forster9, Enoch Huang10, Andreas Matern11, Mark Musen12, Jasmin Saric13, Ted Slater14, Jabe Wilson15, Nick Lynch16, John Wise17 and Ian Dix18
1 2
Connected Discovery Ltd., 27 Old Gloucester Street, London WC1N 3AX, UK GlaxoSmithKline, Computational Biology, 2F157 Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, UK 3 Standards and Data Sharing Infrastructure Team, e-Research Centre, University of Oxford, 7 Keble Rd, Oxford OX1 3QG, UK 4 lndal, 431 83 Mo lndal, Sweden Knowledge Management and Information Science, R&D Information, AstraZeneca R&D Mo 5 Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands 6 Pharma Research and Early Development, Hoffmann-LaRoche Inc., 340 Kingsland St, Nutley, NJ 07110, USA 7 Pzer Worldwide Research and Development, 235 E 42nd ST, MS 150/5/60N, New York, NY 10017, USA 8 Unilever R&D, Colworth Science Park, Sharnbrook, Bedfordshire MK44 1LQ, UK 9 Syngenta R&D Information Systems, International Research Centre, Jealotts Hill, Berkshire RG42 6EX, UK 10 Pzer Worldwide Research and Development, 35 Cambridge Park Drive, Cambridge, MA 02140, USA 11 Thomson Reuters Life Sciences, 22 Thomson Place, Boston, MA 02210, USA 12 Stanford University, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA 13 Scientic Information Centre, Boehringer Ingelheim Pharma GmbH & Co. KG, 88397 Biberach, Germany 14 Merck Sharp & Dohme Corp., 33 Avenue Louis Pasteur, Boston, MA 02115-5727, USA 15 Science & Technology, Corporate Markets, Elsevier Pharma and Biotech Group, Elsevier, 32 Jamestown Road, London NW1 7BY, UK 16 AstraZeneca UK, Alderley Park, Maccleseld SK10 4TG, UK 17 The Pistoia Alliance1 18 Knowledge Management & Information Science, R&D Information, AstraZeneca, 26F17 Mereside, Alderley Park, Maccleseld SK10 4TG, UK
The life science industries (including pharmaceuticals, agrochemicals and consumer goods) are exploring new business models for research and development that focus on external partnerships. In parallel, there is a desire to make better use of data obtained from sources such as human clinical samples to inform and support early research programmes. Success in both areas depends upon the successful integration of heterogeneous data from multiple providers and scientic domains, something that is already a major challenge within the industry. This issue is exacerbated by the absence of agreed standards that unambiguously identify the entities, processes and observations within experimental results. In this article we highlight the risks to future productivity that are associated with incomplete biological and chemical vocabularies and suggest a new model to address this long-standing issue.
Introduction
Commercial life science organizations are evolving; they are exploring new mechanisms to adjust to well-documented
Corresponding author: Harland, L. (lee@connecteddiscovery.com)
1
http://pistoiaalliance.org
economic and productivity challenges. At the same time, thanks to the rapid technological advances within biology they are facing an explosion in the volume and complexity of available data. Efcient management, processing and application of internal and external data are vital to research and development productivity [1,2]. Yet, an integrated view across experiments, literature
1359-6446/06/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.drudis.2011.09.013
www.drugdiscoverytoday.com 1 Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
Reviews INFORMATICS
DRUDIS-901; No of Pages 8 REVIEWS Drug Discovery Today Volume 00, Number 00 September 2011
and databases presupposes that the entities of interest, such as molecules, compounds, cells, observations and even people can be identied and recorded unambiguously. In information systems, identity can be asserted through the use of reference vocabularies in the form of lists, taxonomies and ontologies (Box 1). While these various structures support different use cases, they all provide a mechanism to dene the things within the data unequivocally [3]. Unfortunately, coverage of biomedical and chemical concepts is patchy at best, with many scientic domains devoid of representation. Even where specic vocabularies exist, they have often been developed for specic purposes, and are unable to support different applications. For example, a thesaurus is useful for text-mining, but may be poor for classication tasks. This compounds the problem, requiring industry informaticians to source or develop multiple vocabulary variants within one scientic domain. Human disease is a good example of this, being represented by the International Classication of Diseases (http://www.who.int/whosis/icd10), Medical Subject Headings (http://www.nlm.nih.gov/mesh), National Cancer Institute Thesaurus (http://ncit.nci.nih.gov/ncitbrowser), Systematized Nomenclature of Medicine-Clinical Terms (http://www.ihtsdo. o rg ) , H u m a n Di s eas e O n t o l o g y ( http:/ /d o-wi ki .nubic. northwestern.edu) and many other proprietary resources. A further consequence is a combinatorial explosion of cross-referencing required to align the same entity across each source. Alternatively, technical or legal restrictions could prevent the cross-referencing of proprietary vocabularies, resulting in incomplete integration. In this article, we consider the impact that this chaotic terminological landscape has on our ability to provide effective information support to industrial research. We present arguments for increased industry participation in developing and sustaining these foundational resources and propose one potential path forward.
BOX 1
Defining vocabulary .
Agreeing on a common definition of the words term, lexicon, dictionary, vocabulary, taxonomy and ontology can in itself be difficult. Here, we use the word vocabulary to refer to a terminological resource that provides the identification and definition of entities (also known as concepts) within a scientific domain. This includes taxonomies, ontologies and terminologies as defined by Smith et al. [35] and glossaries, dictionaries, lexicons, thesauri, taxonomies and ontologies as defined by Vanopstal et al. [36]. Each of these vocabulary forms fit with different use cases and applications, but all provide an ability to unambiguously identify biomedical entities. The various important elements are described below: Basic identity (based on requirements defined by Laibe and Le ` re [37]) Nove
Domain: Which area(s) of biomedical science are covered. Identier: A unique, permanent, standards-compliant and noneditable string of characters that can be used to represent the entity. Ideally the identier should be free from any semantics itself. Preferred term: A human readable name for an individual entity. Denition: A human readable, succinct denition of the real-world entity that this entry in the vocabulary represents.
Reviews INFORMATICS
Organizational structure
Dictionaries: Lists of entities with basic identity criteria. Thesauri: Entities with lists of natural language synonyms particularly suited to search and text-mining applications. Taxonomies: Provides a parent/child hierarchy that might form an important part of the identity and denition of an entity (e.g. Asthma is a sibling of Respiratory Tract Disease). Ontology: Contains entities and their relationships to each other. Most expressive knowledge representation, relating entities to each other through multiple associations (for example is a, part of, found in and interacts with).
Origin and use of industry vocabularies

The vocabularies used in industry originate from three main sources. Many are based on, or incorporate elements of those developed by the academic/nonprot sector, of which over two hundred are listed at the National Center for Biomedical Ontology (NCBO: http://bioportal.bioontology.org). Others are sourced from commercial suppliers, having either been developed specically for the industry customer or as part of a larger product. Internal vocabulary groups within industry provide the remainder, augmenting existing resources and lling gaps as necessary. Generally, these industry-tuned vocabularies do not make their way into the public domain nor are they shared beyond one company. Although notable exceptions exist (such as the transition of the Bayer Code for identifying crop plants and pests: http://pp1.eppo.org), these are few and far between. Furthermore, although some public resources receive ad hoc funding or intellectual contribution from industry, at present this is not coordinated nor part of any long-term strategy. Because all companies are working with the same fundamental science and as such are constructing and maintaining similar vocabularies, there is clearly replication of effort [4]. Perhaps the most fundamental role for vocabularies within industry concerns the storage and retrieval of data and information in electronic repositories. In particular, results from major experiments, animal studies or human clinical samples represent
2
Linguistic features
Synonyms: Per entity, a set of other words (and ideally, language origin) that are used to describe the same entity within natural language. Homonyms, antonyms, part of speech fragments: Assist in dealing with ambiguity particularly in term recognition in textmining and search applications.
Other important elements

Provenance: The origin of the concept (source, author, date added/modied, among others). Cross-references: Assertions that describe the relationship of concepts between individual vocabularies. Crucial when integrating data indexed with different vocabularies. Domain-specic features: Additional elements which are important within certain specic areas. An example of this is a molecular structure representation (such as an InChI [38]) for entities within vocabularies of drugs and other small molecules.
www.drugdiscoverytoday.com Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
DRUDIS-901; No of Pages 8 Drug Discovery Today Volume 00, Number 00 September 2011 REVIEWS
signicant resource and nancial investments. For both efciency and indeed regulatory reasons, it is essential that these data can be found by those who require it. Searching repositories can be made more efcient by using thesauri that mitigate the problem of multiple names for a single biomedical entity. For example, ensuring that data concerning the protein carboxypeptidase B2 can be found, no matter which of its many synonyms (CPB2; carboxypeptidase U, CPU; plasma carboxypeptidase B, pCPB; thrombin-activatable brinolysis inhibitor, TAFI) are actually used. Alternatively, vocabularies can be used to restrict scientists to specic, controlled terms within data entry software. Obtaining compliance is straightforward when the user is presented with a dened list (typically in a drop-down menu), but is much more difcult where they are entering text freely. However, systems that make the selection of vocabulary terms within spreadsheets and documents as intuitive as spell-checking should improve this, and are now becoming a reality [5,6] and see Microsoft1 News Center: http://www.microsoft.com/presspass/ press/2009/mar09/03-11MSCreativeCommonsPR.mspx). Furthermore, over the next few years the concept of humans sharing the process of documenting methods and results with the machines performing the experiments might also become routine practice [7]. In such scenarios it will be clearly advantageous for both parties to use a common language. In an information-rich industry, scientists need to go beyond simple search requests and ask more detailed questions. Table 1 outlines some common information tasks, all of which require a fundamental ability to join information from multiple sources. Crucially this includes internal and external systems, and for the later, both free and commercial content, highlighting the need for universally accessible vocabularies. Taxonomies and ontologies further aid the interpretation of integrated data by providing the ability to lter results to scientically useful groupings, such as inammatory diseases, antirheumatoid agents or G-protein coupled receptors. Vocabularies are also crucial to more advanced tasks, such as systems modeling in important areas, such as neural circuitry [8], carcinoma classication [9] and drug toxicity [10]. There is a similar vocabulary dependency in common data mining approaches that provide insights into off-target interactions for drugs [11], new therapeutic opportunities [12], high-throughput screening data [13] and others [14]. Finally, they also power the many successful text-mining techniques that provide surveillance across biomedical literature, patents, regulatory documents and tweets that would otherwise be impossible to monitor fully [15,16].
expression, cell population and protein marker studies were performed by several contract research organizations (CROs) and subsequently integrated for the analysis. The authors describe how their efforts were hindered by a lack of vocabulary standards, with signicant laborious, manual intervention required to match up ethnicity, study regions, drugs and drug types across the results. Because the relationship between large organizations and CROs is evolving from a customer-provider to a research partner model [19], it is likely that these phenomena will arise more frequently. Indeed, the blurring of lines between internal and external teams is apparent in many business strategies, such as industryacademic partnerships, precompetitive initiatives, product in-licensing and open innovation [20]. It is also true that the depth of biomedical knowledge required to take some innovative products to market might be beyond that which any one company can build alone [21]. Thus, the future would appear to be a highly dynamic system with different, transient external partnerships evolving as projects progress. It is self-evident that these highly networked and uid business models will be signicantly hindered by inefcient data exchange and analysis. Indeed, Vargas et al. [22] have highlighted how data standardization issues have already become a problem in major industryacademia collaborations. Similarly, a recent Harvard Business Review identied this area as one requiring substantial improvement, concluding that: A common standard for sharing drug asset data would unleash tremendous innovation (The HBR List: Breakthrough Ideas for 2010: https://archive.harvardbusiness. org/cla/web/pl/product.seam?c=2275&i=2277&cs=7b9e2623ca9 d337e9e6dd0e21012b011). From a technical perspective, the drive for integration has led many industry and academic informaticians to explore Semantic Web technology [14]. This approach holds promise in addressing major information challenges by combining data integration with powerful querying and inferencing capabilities. The World Wide Web Consortium Heath-care and Life Sciences Interest Group (http://www.w3.org/2001/sw/hcls) and others have developed several industry-relevant use cases that identify scientic relationships previously hidden in existing data [2326]. Yet integrating and inferencing using the Semantic Web is completely dependent upon the proper identication of the biomedical entities in these data [2,27]. Thus, there is an intrinsic link between the availability of good vocabularies and the future success of this technology within industry.
A new approach
The traditional individual company approach to vocabulary provision has at best, provided limited support of the overall information needs of industry scientists. However, the ever-increasing volume and complexity of preclinical and translational data suggests that this path is unsustainable and cannot meet the levels of coverage now needed. Furthermore, individual standards add little to the ability to integrate other sources of data, being useful only after laborious (and often ambiguous) cross-referencing exercises. We must realize a new environment where project teams are unhindered in using data from whichever partners they choose to engage with and any electronic systems they wish to interrogate. Only open, shared standards that are available to all information producers and consumers are able to fulll this need. Thus, we propose that industry develops a new strategy in this area, based
Emerging challenges
The negative impact of partial and missing vocabularies on industrial research is not a new issue [1]. However, in the current, rapidly evolving environment, new scientic, business and technical indicators suggests this problem will become even more acute. Within human health, there is an increased interest in using clinical data to drive and augment basic research, especially when combined with in vitro and animal model studies [17]. A good example of the direction in which many are headed is provided by a recent biomarker study from Genentech (http://www.gene.com/ gene/index.jsp), that focused on samples taken from over 3000 patients with rheumatoid arthritis [18]. Multiple genetic, gene
Reviews INFORMATICS
TABLE 1
Information strategies for typical R&D questions

Business area Understanding disease biology Understanding drug mechanisms Scientic question Which proteins are in the EGFR pathway? What mechanisms of action have been tested for the treatment of melanomas? Approach and sources Integrate public and commercial pathway databases with text-mined and other results Query internal portfolio systems, drug databases, literature and patents for targets, pathways and bioprocesses connected to general melanoma and specic subtypes of the disease Query internal and external pharmacology databases in addition to text-mined data from public literature Integrate competitor intelligence databases, text-mining results, omics analyses and more to identify novel compound-disease or target-disease associations Take drug portfolio and query internal database and public literature for behavioral signals for candidate and launched compounds Query drug event databases, both internal and external and integrate the results Query literature, clinical trial and drug event databases, both internal and external and integrate the results Mine/integrate literature, grant applications, patents, conference abstracts to identify key opinion leaders Integrate drug, target and mechanism data across many different diseases where immunomodulation is important. Normalize company names based on mergers/acquisitions. Optionally cluster targets by pathway or mechanism. Analyze literature, patent, grant and conference information to identify biomedical concepts Vocabularies required Gene/protein; pathway Gene/protein; target; compound; drug; drug class; disease; mechanism of action; intracellular process; physiological process; study type Target; compound; drug; pharmacological assay
Chemistry design and synthesis
Which internal and external compounds are strong, specic antagonists of the CCR1 receptor? Can any of our compounds be repurposed for additional indications?
Drug repurposing
Drug repurposing, safety monitoring
Which of our drugs affect behavior in animal assays? What adverse events for launched anti-inammatory drugs have been observed in model organisms? Can we monitor relevant resources for safety indications for a set of drugs? Who is an external expert on pancreatitis?
Differentiation, safety mitigation
Pharmacovigilance
Strategic collaborations
Competitor intelligence
How does our internal immunomodulation portfolio compare to the competition
Competitor intelligence, strategic investment
What new trends are emerging in a particular research area
EGFR: epidermal growth factor receptor; CCR1: chemokine receptor type 1.
Reviews INFORMATICS 4
REVIEWS Gene/protein; target; pathway; disease; mechanism of action; biological process; physiological process; symptom Compound; drug; animal model; behavioral endpoint Compound; drug; drug class; disease; symptom; species; adverse event; toxicity endpoint; toxicity assay Drug; drug class; symptom; adverse event; toxicity endpoint People; institute; city; country; disease Drug Discovery Today Volume 00, Number 00 September 2011
DRUDIS-901; No of Pages 8
Gene/protein; pathway; target; disease; company; mechanism of action; intracellular process; physiological process; pharmaceutical type; clinical trial; clinical outcome
General biomedical dictionaries; institution; pharmaceutical type; biotechnology concept
on the precompetitive development of open research vocabularies. Such an approach could provide many benets, including: (i) Cost savings in vocabulary development by sharing the work. (ii) Less redundancy, greater coverage and more concepts for same effort. (iii) Wider body of experts to draw on giving broader scientic representation. (iv) Opportunity to cover multiple languages and integrate nonEnglish information. (v) Proactively maintained, rather than ad hoc patches when gaps are found. (vi) Increased efciency, more time exploiting, less time plumbing. (vii) Better analytical capabilities, better results and greater business impact. Of course, collaboration can be non-trivial and initial projects might take considerable time and effort to deliver as the participants establish the most effective way to operate. Additionally, different organizations might have irreconcilable views on the construction and primary application of any specic vocabulary [3,28]. Consequently, not all internal vocabularies should be candidates for externalization, and examples that are highly tuned to one companys specic need would not be relevant. However, even these efforts would benet by being built upon a backbone of common open concepts, providing better connections to the wider information network. Given the vast landscape of biomedicine, it should be possible to identify many major areas of common need, such as open cell/tissue hierarchies and inter-relationships, catalogs of animal models (and relationships to human biology), pathophysiological processes and disease phenotypes. Although agreement on aims and scope should be possible, there might still be differences of opinion regarding vocabulary content, either in the entities themselves or their relationships to one another. For example, participants might disagree on what constitutes valid symptoms for a disease or the inclusion of ambiguous synonyms in thesauri. However, this problem is relatively simple to mitigate by correct recording of provenance within each vocabulary entry, enabling consumers to include or exclude elements from different contributors as they see t. It is important to recognize that the information and informatics infrastructure varies between life science companies and the ability for each member to benet will progress at differing rates. This might add to the complexity and cost of any solution to ensure that it is workable for all participants. Alternatively, those companies with a more advanced infrastructure could identify ways to donate these systems and experiences into the public domain. Ultimately, no one will want to be at a disadvantage and hence building awareness of the advantages of collaboration is crucial to gain enough support and resources to ensure continued momentum. Although these concerns are applicable to many precompetitive initiatives, they do not seem to have had too great a negative impact thus far. However, they could inuence individual decisions whether to engage or refrain from certain vocabulary projects.
required because these collaborations will require time and money. Does partnership make economic sense? Here it is helpful to look at a real example, such as the Medical Dictionary for Regulatory Activities (MedDRA) vocabulary, used primarily within pharmacovigilance and developed by the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). MedDRA access, maintenance and support are funded through tiered subscriptions, with a maximum rate of US $62,850/year for companies with an annual revenue >US $5 billion, but much lower for smaller businesses and free to nonprot and regulatory authorities (MedDRA: MSSO: http://www.meddramsso.com/ public_subscription_rates.asp). The subscription provides an organization with access to over 670,000 highly annotated concepts, translated into multiple languages, in addition to meetings, training and support. Contrast this with an internal approach; the equivalent amount would likely not even cover one employee year, and could never reach the levels of support offered by the group of international subject experts, vocabulary experts, IT experts, and leadership from the ICH MedDRA Management Board. Clearly this joint maintenance model is far more benecial from the nancial stand point than the situation that each company individually carries such maintenance cost for its own in-house medical vocabulary. Of course, MedDRA is only one specic example that has a relatively large number of subscribers and we cannot extrapolate these gures to every vocabulary. However, they do illustrate that industry has already developed one model of operation in this space and that it currently provides signicantly extended capabilities at a reasonable cost.
External partnerships
The benets of open, public data standards for industry align well with strategies put forward in this area for public science [21,22,2931], highlighting potential synergies between the two domains. Such partnerships would offer a chance to kick-start cross-industry projects by leveraging existing experience and resources, and providing a neutral territory in which to collaborate. Furthermore, there are several public organizations and systems that could provide some of the core capabilities that will be required (Table 2). Clearly, there will be challenges in adapting the mechanisms and conventions that are employed by the nonprot sector to the requirements of industry. Box 2 highlights some of the major issues that would need to be addressed, but given that industry is a major consumer of public research, there is much to be gained by exploring the possibility of greater alignment. Part of this strategy must include dialog with funding agencies concerning the long-term support for shared development and infrastructure. This will require adaptable solutions that take advantage of multinational grants, industry funding and in-kind contributions. Crucially, these initiatives should not be seen as simply industry projects, but rather opportunities for industry and academia to solve a common problem. Although industry can (and should) engage much more in public vocabulary infrastructure, it cannot be the sole mechanism for funding this work. Long-term support of co-ordination groups, such as the Open Biomedical Ontologies (OBOs) Foundry, cross-national support for projects involving the US-based NCBO and mechanisms to integrate and co-ordinate with the other organizations, including those listed in Table 2,
Economics
Although the strategic partnerships suggested above might support this culture shift, investment from industry will still be
Reviews INFORMATICS
TABLE 2
Potential nonprot partners for industry vocabulary management

Name National Center for Biomedical Ontology (NCBO) Concept Wiki and Concept Web Alliance Miriam Registry and Identiers.org Shared Names Okkam Open Biological and Biomedical Ontologies (OBO) Foundry BioSharing Pistoia Alliance SAGE Commons Innovative Medicines Initiative Elixir National Center for Advancing Translational Sciences (NCATS) Description Provides tools and services for vocabulary access, management and annotation Provides community annotation and vocabulary management Provides a system for identifying and cataloguing scientic concepts Provides persistent identiers for biomedical concepts Provides EU funded infrastructure for systematic entity identication Provides a home for many core vocabularies. Promotes best practice, alignment and development standards Community of journals, funders and standardization efforts; standards-centered catalogs of data sharing resources Not-for-prot, precompetitive alliance to promote industry standards Not-for-prot, precompetitive alliance for building computational disease models that require vocabularies Publicprivate pharmaceutical research and development initiative sharing many information management challenges European initiative to create long-term life science research and translational medicine informatics and information infrastructure New NIH-funded institute to bridge the translation gap in the precompetitive space. Shared data and information challenges with industry Network of clinical and translational research centers including information systems in the USA. Shared data and information challenges Provides standards to support clinical research data interoperability. Potential to collaborate on relevant terminologies Not-for-prot association that develops and promotes use of the SNOMED-CT vocabulary for health information exchange Brings together the regulatory authorities and pharmaceutical industry of Europe, Japan and the US to discuss scientic and technical aspects of drug registration. Developer of the MedDRA vocabulary Develops and supports the use of Semantic Web technologies for health care and life science Further Information http://www.bioontology.org http://www.conceptwiki.org http://www.ebi.ac.uk/miriam, http://www.identiers.org http://www.sharednames.org http://www.okkam.org http://www.obofoundry.org http://www.biosharing.org http://www.pistoiaalliance.org http://www.sagebase.org/commons http://www.imi.europa.eu http://www.elixir-europe.org [17] Drug Discovery Today Volume 00, Number 00 September 2011
Clinical and Translational Science Awards Consortium Clinical Data Interchange Standards Consortium (CDISC) International Health Terminology Standards Development Organization International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH)
World Wide Web Heath-care and Life Sciences Interest Group
Reviews INFORMATICS 6
REVIEWS https://www.commonfund.nih.gov/ctsa http://www.cdisc.org http://www.ihtsdo.org http://www.ich.org
DRUDIS-901; No of Pages 8
http://www.w3.org/2001/sw/hcls
BOX 2
The missing pieces.

Although commercial organizations and cross-industry groups collaborate on some open, public vocabularies, the majority are led by nonprofit consortia. To develop resources and infrastructure of value to all parties, the following challenges will need to be addressed:
Where are living vocabularies to be hosted? How exactly will the vocabularies be served? How will unique identiers be assigned? And who will perform cross-referencing? Where do industry vocabularies that do not have an existing public domain owner reside? And who is responsible for them? What provision for change management will be supported? Who is authorized to make changes? How will updates be validated? Can the right technical frameworks be put in place to capture provenance and manage data integrity (prevent malicious damage, accidental loss)? What will be the schedule for release of new versions of the vocabulary? How might multiple companies reconcile overlapping vocabularies? Who will provide advice and how? How will multi-lingual requirements be managed? This will be important for those companies that require data and document indexing documents from ofces in distributed geographical locations. Even in the most buoyant economic climate industry could not provide an active participant on each vocabulary in the public domain, how will relationships between those which are funded and those which are not be affected? How do collaborative models cope with changing priorities? What is of interest today might not be in vogue tomorrow, meaning a rapid withdrawal of funding or intellectual input. What expectations can industry have on publicly funded groups to provide such services? And how do we agree fair funding commitments?
business benets to content providers and to partner with them to achieve this in a way that provides value to both parties. Secondly, we are not calling for universal access to commercial data, but rather that the concepts within that data are identied using standards that present few barriers to widespread adoption. Providers are still able to create proprietary assertions and databases, but in the knowledge that the cells, symptoms, processes and pathologies within them will fully connect with their customers other data. Finally, the lack of an extensive community of smaller commercial vocabulary providers suggests that building industryrelevant vocabularies to order might not be an area of high economic value on its own. Perhaps smaller companies could identify entrepreneurial approaches that combine collaborative vocabulary services with additional value-added (i.e. revenuegenerating) opportunities. This could include agreeing to host and maintain a particular vocabulary that is relevant to their core business on behalf of the community. In return, they would not only obtain wider input on a resource of utility to themselves, but potentially also develop a wider network of potential customers. Major content providers could also have such a role here, assisting in an area in which they already have a great deal of experience.
Concluding remarks
The availability of high-quality biomedical vocabularies is an often-overlooked but crucial component to future success in life science research. Our aim is to raise awareness of this issue across the industry and to develop the necessary support, participation and funding required to address it. Specically, we propose the initiation of pilot studies to explore and validate the hypothesis that shared vocabularies will be mutually benecial and cost effective. It should not be difcult to identify some core areas of need, agree the scope and provide quantitative judgements on the ability to align and develop vocabularies across multiple companies. Such projects should enable us to identify key bottlenecks in addition to building the economic models to judge whether such an approach will be nancially viable in the long term. Secondly, we advocate that vocabulary standards become an intrinsic element within industry software, whether they are document repositories, electronic laboratory notebooks or intelligence systems. Ensuring that the designers of these applications consider how they will identify concepts in a way that facilitates integration will provide signicant future benets. Finally, industry should actively engage with many of the external groups listed in Table 2, using constructs such as the Pistoia Alliance and Innovative Medicines Initiative (IMI) to supply the legal and logistical frameworks for collaboration. By providing drive, leadership and a vision in this area, industry has an opportunity to begin to tackle a long-standing issue and enable the science needed to deliver future products.
will all be important. We believe that publicprivate projects provide a rich environment to explore this issue and indeed, have already begun generating some important learning in areas, such as toxicology (OpenTox [10]), pharmacology (OpenPhacts: http:// openphacts.org) and disease knowledge and semantic publishing (Pistoia Alliance: http://www.pistoiaalliance.org/workinggroups/ sesl.html and [32]). In addition to industry and nonprot groups, the role of commercial content suppliers, aggregators and publishers must be considered. In many instances, these organizations are supportive of open data standards and actively promote their adoption and use [33,34]. It is also true that they have developed and might sell their own proprietary thesauri, taxonomies and ontologies. An industry vision around open vocabularies has implications for those providers that have already made signicant investment in this area. Considering this perspective, two important points should be made. Firstly, to fully leverage the value in commercial data we must markedly improve our ability to integrate and interrogate it, a capability that is dependent upon data standards. As customers, industry needs to clearly articulate the anticipated
Acknowledgements
This topic was rst discussed at a Pistoia Alliance-sponsored meeting on industry vocabulary strategies. We thank the following participants for valuable contributions which provided the substrate for this perspective: Michael Ashburner, Susanna Lewis, Alan Ruttenberg, Barry Smith (OBO Foundry);
Reviews INFORMATICS
DRUDIS-901; No of Pages 8 REVIEWS Drug Discovery Today Volume 00, Number 00 September 2011
Johanna McEntyre, Dominic Clark, Chris Taylor (EMBL-EBI); Philippe Rocca-Serra (University of Oxford); Gordon Baxter (Biowisdom); Douglas Bassett (Ingenuity); Ashley George, Raymond Grimaila (GSK); Tim Shay (Symyx); Bette Brunelle (Dialog); Paolo Ciccarese (Medicognos); Olivier Bodenereider
(National Library of Medicine); Andrej Bugrim (GeneGo). We also thank Therese Vachon (Novartis), Hilary Vass (AstraZeneca) and Phoebe Roberts (Pzer) for suggestions and corrections. We thank Anna Zhao-Wong (MSSO) for information regarding MedDRA.
References Reviews INFORMATICS

1 Searls, D. (2005) Data integration: challenges for drug discovery. Nat. Rev. Drug Discov. 4, 4558 2 Slater, T. et al. (2008) Beyond data integration. Drug Discov. Today 13, 584589 3 Cimino, J.J. (1998) Desiderata for controlled medical vocabularies in the twentyrst century. Methods Inf. Med. 394403 4 Barnes, M.R. et al. (2009) Lowering industry rewalls: pre-competitive informatics initiatives in drug discovery. Nat. Rev. Drug Discov. 18 5 Rocca-Serra, P. et al. (2010) ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics 26, 23542356 6 Wolstencroft, K. et al. (2011) RightField: embedding ontology annotation in spreadsheets. Bioinformatics 27, 20212022 7 Qi, D. et al. (2010) An ontology for description of drug discovery investigations. J. Integr. Bioinform. 7, 113 8 Rubin, D.L. et al. (2009) Computational neuroanatomy: ontology-based representation of neural components and connectivity. BMC Bioinformatics 10 (Suppl. 2), S3 9 Kumar, A. et al. (2005) An ontology for carcinoma classication for clinical bioinformatics. Stud. Health Technol. Inform. 116, 635640 10 Hardy, B. et al. (2010) Collaborative development of predictive toxicology applications. J. Cheminform. 2, 7 11 Campillos, M. et al. (2008) Drug target identication using side-effect similarity. Science 321, 263266 12 Campbell, S.J. et al. (2010) Visualising the drug target landscape. Drug Discov. Today 15, 315 rer, S.C. et al. (2011) BioAssay ontology annotations facilitate cross-analysis of 13 Schu diverse high-throughput screening data sets. J. Biomol. Screen. 16, 415426 14 Chen, H. and Xie, G. (2010) The use of web ontology languages and other semantic web tools in drug discovery. Expert Opin. Drug Discov. 5, 413423 15 Agarwal, P. and Searls, D.B. (2008) Literature mining in support of drug discovery. Brief Bioinform. 9, 479492 16 Agarwal, P. and Searls, D.B. (2009) Can literature analysis identify innovation drivers in drug discovery? Nat. Rev. Drug Discov. 8, 865878 17 Collins, F.S. (2011) Reengineering translational science: the time is right. Sci. Transl. Med. 3, 16 18 Sorani, M.D. et al. (2010) Clinical and biological data integration for biomarker discovery. Drug Discov. Today 15, 741748 19 Hutchins, S. et al. (2011) Open partnering of integrated drug discovery: continuing evolution of the pharmaceutical model. Drug Discov. Today 16, 281283 20 Hunter, J. and Stephens, S. (2010) Is open innovation the way forward for big pharma? Nat. Rev. Drug Discov. 9, 8788 21 Friend, S.H. (2010) The need for precompetitive integrative bionetwork disease model building. Clin. Pharmacol. Ther. 87, 536539 22 Vargas, G. et al. (2010) Arguments against precompetitive collaboration. Clin. Pharmacol. Ther. 87, 527529 23 Sahoo, S.S. et al. (2008) An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence. J. Biomed. Inform. 41, 752765 24 Gudivada, R.C. et al. (2008) Identifying disease-causal genes using semantic webbased representation of integrated genomic and phenomic knowledge. J. Biomed. Inform. 41, 717729 25 Chen, B. et al. (2010) Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 11, 255 26 Jentzsch, A. et al. (2009) Enabling tailored therapeutics with linked data. In Proceedings of the WWW2009 Workshop on Linked Data on the Web pp. 16 27 Heath, T. and Bizer, C., eds) (2010) Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool 28 Goble, C. and Wroe, C. (2004) The montagues and the capulets. Comp. Funct. Genom. 5, 623632 29 Field, D. et al. (2009) Omics data sharing. Science 326, 234236 30 Quackenbush, J. (2009) Data reporting standards: making the things we use better. Genome Med. 1, 111 31 Schoeld, P.N. et al. (2010) Sustaining the data and bioresource commons. Science 330, 592593 32 Mons, B. et al. (2011) The value of data. Nat. Genet. 43, 281283 33 (2009) Editorial. Changing the face of scientic publishing. Integr. Biol. 1, 293 34 (2010) Editorial. Putting data to work. Nat. Chem. Biol. 6, 783 35 Smith, B. et al. (2006) Towards a reference terminology for ontology research and development in the biomedical domain. In Proceedings of KR-MED 2006 pp. 5766 36 Vanopstal, K. et al. (2009) Vocabularies and retrieval tools in biomedicine: disentangling the terminological knot. J. Med. Syst. 10.1007/s10916-009-9389-z ` re, N. (2007) MIRIAM resources: tools to generate and resolve 37 Laibe, C. and Le Nove robust cross-references in systems biology. BMC Syst. Biol. 1, 58 38 McNaught, A. (2006) The IUPAC international chemical identier: InChl a new standard for molecular informatics. Chem. Int. 28, 1214

Empowering Industrial Research With Shared Biomedical Vocabularies

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Empowering Industrial Research With Shared Biomedical Vocabularies

Uploaded by

Copyright:

Available Formats

Drug Discovery Today Volume 00, Number 00 September 2011