Professional Documents
Culture Documents
REVIEWS
Connected Discovery Ltd., 27 Old Gloucester Street, London WC1N 3AX, UK GlaxoSmithKline, Computational Biology, 2F157 Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, UK 3 Standards and Data Sharing Infrastructure Team, e-Research Centre, University of Oxford, 7 Keble Rd, Oxford OX1 3QG, UK 4 lndal, 431 83 Mo lndal, Sweden Knowledge Management and Information Science, R&D Information, AstraZeneca R&D Mo 5 Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, The Netherlands 6 Pharma Research and Early Development, Hoffmann-LaRoche Inc., 340 Kingsland St, Nutley, NJ 07110, USA 7 Pzer Worldwide Research and Development, 235 E 42nd ST, MS 150/5/60N, New York, NY 10017, USA 8 Unilever R&D, Colworth Science Park, Sharnbrook, Bedfordshire MK44 1LQ, UK 9 Syngenta R&D Information Systems, International Research Centre, Jealotts Hill, Berkshire RG42 6EX, UK 10 Pzer Worldwide Research and Development, 35 Cambridge Park Drive, Cambridge, MA 02140, USA 11 Thomson Reuters Life Sciences, 22 Thomson Place, Boston, MA 02210, USA 12 Stanford University, Stanford University, 251 Campus Drive, Stanford, CA 94305-5479, USA 13 Scientic Information Centre, Boehringer Ingelheim Pharma GmbH & Co. KG, 88397 Biberach, Germany 14 Merck Sharp & Dohme Corp., 33 Avenue Louis Pasteur, Boston, MA 02115-5727, USA 15 Science & Technology, Corporate Markets, Elsevier Pharma and Biotech Group, Elsevier, 32 Jamestown Road, London NW1 7BY, UK 16 AstraZeneca UK, Alderley Park, Maccleseld SK10 4TG, UK 17 The Pistoia Alliance1 18 Knowledge Management & Information Science, R&D Information, AstraZeneca, 26F17 Mereside, Alderley Park, Maccleseld SK10 4TG, UK
The life science industries (including pharmaceuticals, agrochemicals and consumer goods) are exploring new business models for research and development that focus on external partnerships. In parallel, there is a desire to make better use of data obtained from sources such as human clinical samples to inform and support early research programmes. Success in both areas depends upon the successful integration of heterogeneous data from multiple providers and scientic domains, something that is already a major challenge within the industry. This issue is exacerbated by the absence of agreed standards that unambiguously identify the entities, processes and observations within experimental results. In this article we highlight the risks to future productivity that are associated with incomplete biological and chemical vocabularies and suggest a new model to address this long-standing issue.
Introduction
Commercial life science organizations are evolving; they are exploring new mechanisms to adjust to well-documented
Corresponding author: Harland, L. (lee@connecteddiscovery.com)
1
http://pistoiaalliance.org
economic and productivity challenges. At the same time, thanks to the rapid technological advances within biology they are facing an explosion in the volume and complexity of available data. Efcient management, processing and application of internal and external data are vital to research and development productivity [1,2]. Yet, an integrated view across experiments, literature
1359-6446/06/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.drudis.2011.09.013
www.drugdiscoverytoday.com 1 Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
Reviews INFORMATICS
DRUDIS-901; No of Pages 8 REVIEWS Drug Discovery Today Volume 00, Number 00 September 2011
and databases presupposes that the entities of interest, such as molecules, compounds, cells, observations and even people can be identied and recorded unambiguously. In information systems, identity can be asserted through the use of reference vocabularies in the form of lists, taxonomies and ontologies (Box 1). While these various structures support different use cases, they all provide a mechanism to dene the things within the data unequivocally [3]. Unfortunately, coverage of biomedical and chemical concepts is patchy at best, with many scientic domains devoid of representation. Even where specic vocabularies exist, they have often been developed for specic purposes, and are unable to support different applications. For example, a thesaurus is useful for text-mining, but may be poor for classication tasks. This compounds the problem, requiring industry informaticians to source or develop multiple vocabulary variants within one scientic domain. Human disease is a good example of this, being represented by the International Classication of Diseases (http://www.who.int/whosis/icd10), Medical Subject Headings (http://www.nlm.nih.gov/mesh), National Cancer Institute Thesaurus (http://ncit.nci.nih.gov/ncitbrowser), Systematized Nomenclature of Medicine-Clinical Terms (http://www.ihtsdo. o rg ) , H u m a n Di s eas e O n t o l o g y ( http:/ /d o-wi ki .nubic. northwestern.edu) and many other proprietary resources. A further consequence is a combinatorial explosion of cross-referencing required to align the same entity across each source. Alternatively, technical or legal restrictions could prevent the cross-referencing of proprietary vocabularies, resulting in incomplete integration. In this article, we consider the impact that this chaotic terminological landscape has on our ability to provide effective information support to industrial research. We present arguments for increased industry participation in developing and sustaining these foundational resources and propose one potential path forward.
BOX 1
Defining vocabulary .
Agreeing on a common definition of the words term, lexicon, dictionary, vocabulary, taxonomy and ontology can in itself be difficult. Here, we use the word vocabulary to refer to a terminological resource that provides the identification and definition of entities (also known as concepts) within a scientific domain. This includes taxonomies, ontologies and terminologies as defined by Smith et al. [35] and glossaries, dictionaries, lexicons, thesauri, taxonomies and ontologies as defined by Vanopstal et al. [36]. Each of these vocabulary forms fit with different use cases and applications, but all provide an ability to unambiguously identify biomedical entities. The various important elements are described below: Basic identity (based on requirements defined by Laibe and Le ` re [37]) Nove
Domain: Which area(s) of biomedical science are covered. Identier: A unique, permanent, standards-compliant and noneditable string of characters that can be used to represent the entity. Ideally the identier should be free from any semantics itself. Preferred term: A human readable name for an individual entity. Denition: A human readable, succinct denition of the real-world entity that this entry in the vocabulary represents.
Reviews INFORMATICS
Organizational structure
Dictionaries: Lists of entities with basic identity criteria. Thesauri: Entities with lists of natural language synonyms particularly suited to search and text-mining applications. Taxonomies: Provides a parent/child hierarchy that might form an important part of the identity and denition of an entity (e.g. Asthma is a sibling of Respiratory Tract Disease). Ontology: Contains entities and their relationships to each other. Most expressive knowledge representation, relating entities to each other through multiple associations (for example is a, part of, found in and interacts with).
Linguistic features
Synonyms: Per entity, a set of other words (and ideally, language origin) that are used to describe the same entity within natural language. Homonyms, antonyms, part of speech fragments: Assist in dealing with ambiguity particularly in term recognition in textmining and search applications.
www.drugdiscoverytoday.com Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
DRUDIS-901; No of Pages 8 Drug Discovery Today Volume 00, Number 00 September 2011 REVIEWS
signicant resource and nancial investments. For both efciency and indeed regulatory reasons, it is essential that these data can be found by those who require it. Searching repositories can be made more efcient by using thesauri that mitigate the problem of multiple names for a single biomedical entity. For example, ensuring that data concerning the protein carboxypeptidase B2 can be found, no matter which of its many synonyms (CPB2; carboxypeptidase U, CPU; plasma carboxypeptidase B, pCPB; thrombin-activatable brinolysis inhibitor, TAFI) are actually used. Alternatively, vocabularies can be used to restrict scientists to specic, controlled terms within data entry software. Obtaining compliance is straightforward when the user is presented with a dened list (typically in a drop-down menu), but is much more difcult where they are entering text freely. However, systems that make the selection of vocabulary terms within spreadsheets and documents as intuitive as spell-checking should improve this, and are now becoming a reality [5,6] and see Microsoft1 News Center: http://www.microsoft.com/presspass/ press/2009/mar09/03-11MSCreativeCommonsPR.mspx). Furthermore, over the next few years the concept of humans sharing the process of documenting methods and results with the machines performing the experiments might also become routine practice [7]. In such scenarios it will be clearly advantageous for both parties to use a common language. In an information-rich industry, scientists need to go beyond simple search requests and ask more detailed questions. Table 1 outlines some common information tasks, all of which require a fundamental ability to join information from multiple sources. Crucially this includes internal and external systems, and for the later, both free and commercial content, highlighting the need for universally accessible vocabularies. Taxonomies and ontologies further aid the interpretation of integrated data by providing the ability to lter results to scientically useful groupings, such as inammatory diseases, antirheumatoid agents or G-protein coupled receptors. Vocabularies are also crucial to more advanced tasks, such as systems modeling in important areas, such as neural circuitry [8], carcinoma classication [9] and drug toxicity [10]. There is a similar vocabulary dependency in common data mining approaches that provide insights into off-target interactions for drugs [11], new therapeutic opportunities [12], high-throughput screening data [13] and others [14]. Finally, they also power the many successful text-mining techniques that provide surveillance across biomedical literature, patents, regulatory documents and tweets that would otherwise be impossible to monitor fully [15,16].
expression, cell population and protein marker studies were performed by several contract research organizations (CROs) and subsequently integrated for the analysis. The authors describe how their efforts were hindered by a lack of vocabulary standards, with signicant laborious, manual intervention required to match up ethnicity, study regions, drugs and drug types across the results. Because the relationship between large organizations and CROs is evolving from a customer-provider to a research partner model [19], it is likely that these phenomena will arise more frequently. Indeed, the blurring of lines between internal and external teams is apparent in many business strategies, such as industryacademic partnerships, precompetitive initiatives, product in-licensing and open innovation [20]. It is also true that the depth of biomedical knowledge required to take some innovative products to market might be beyond that which any one company can build alone [21]. Thus, the future would appear to be a highly dynamic system with different, transient external partnerships evolving as projects progress. It is self-evident that these highly networked and uid business models will be signicantly hindered by inefcient data exchange and analysis. Indeed, Vargas et al. [22] have highlighted how data standardization issues have already become a problem in major industryacademia collaborations. Similarly, a recent Harvard Business Review identied this area as one requiring substantial improvement, concluding that: A common standard for sharing drug asset data would unleash tremendous innovation (The HBR List: Breakthrough Ideas for 2010: https://archive.harvardbusiness. org/cla/web/pl/product.seam?c=2275&i=2277&cs=7b9e2623ca9 d337e9e6dd0e21012b011). From a technical perspective, the drive for integration has led many industry and academic informaticians to explore Semantic Web technology [14]. This approach holds promise in addressing major information challenges by combining data integration with powerful querying and inferencing capabilities. The World Wide Web Consortium Heath-care and Life Sciences Interest Group (http://www.w3.org/2001/sw/hcls) and others have developed several industry-relevant use cases that identify scientic relationships previously hidden in existing data [2326]. Yet integrating and inferencing using the Semantic Web is completely dependent upon the proper identication of the biomedical entities in these data [2,27]. Thus, there is an intrinsic link between the availability of good vocabularies and the future success of this technology within industry.
A new approach
The traditional individual company approach to vocabulary provision has at best, provided limited support of the overall information needs of industry scientists. However, the ever-increasing volume and complexity of preclinical and translational data suggests that this path is unsustainable and cannot meet the levels of coverage now needed. Furthermore, individual standards add little to the ability to integrate other sources of data, being useful only after laborious (and often ambiguous) cross-referencing exercises. We must realize a new environment where project teams are unhindered in using data from whichever partners they choose to engage with and any electronic systems they wish to interrogate. Only open, shared standards that are available to all information producers and consumers are able to fulll this need. Thus, we propose that industry develops a new strategy in this area, based
Emerging challenges
The negative impact of partial and missing vocabularies on industrial research is not a new issue [1]. However, in the current, rapidly evolving environment, new scientic, business and technical indicators suggests this problem will become even more acute. Within human health, there is an increased interest in using clinical data to drive and augment basic research, especially when combined with in vitro and animal model studies [17]. A good example of the direction in which many are headed is provided by a recent biomarker study from Genentech (http://www.gene.com/ gene/index.jsp), that focused on samples taken from over 3000 patients with rheumatoid arthritis [18]. Multiple genetic, gene
www.drugdiscoverytoday.com 3 Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
Reviews INFORMATICS
TABLE 1
Which internal and external compounds are strong, specic antagonists of the CCR1 receptor? Can any of our compounds be repurposed for additional indications?
Drug repurposing
Which of our drugs affect behavior in animal assays? What adverse events for launched anti-inammatory drugs have been observed in model organisms? Can we monitor relevant resources for safety indications for a set of drugs? Who is an external expert on pancreatitis?
Pharmacovigilance
Strategic collaborations
Competitor intelligence
Reviews INFORMATICS 4
www.drugdiscoverytoday.com Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
REVIEWS Gene/protein; target; pathway; disease; mechanism of action; biological process; physiological process; symptom Compound; drug; animal model; behavioral endpoint Compound; drug; drug class; disease; symptom; species; adverse event; toxicity endpoint; toxicity assay Drug; drug class; symptom; adverse event; toxicity endpoint People; institute; city; country; disease Drug Discovery Today Volume 00, Number 00 September 2011
DRUDIS-901; No of Pages 8
Gene/protein; pathway; target; disease; company; mechanism of action; intracellular process; physiological process; pharmaceutical type; clinical trial; clinical outcome
DRUDIS-901; No of Pages 8 Drug Discovery Today Volume 00, Number 00 September 2011 REVIEWS
on the precompetitive development of open research vocabularies. Such an approach could provide many benets, including: (i) Cost savings in vocabulary development by sharing the work. (ii) Less redundancy, greater coverage and more concepts for same effort. (iii) Wider body of experts to draw on giving broader scientic representation. (iv) Opportunity to cover multiple languages and integrate nonEnglish information. (v) Proactively maintained, rather than ad hoc patches when gaps are found. (vi) Increased efciency, more time exploiting, less time plumbing. (vii) Better analytical capabilities, better results and greater business impact. Of course, collaboration can be non-trivial and initial projects might take considerable time and effort to deliver as the participants establish the most effective way to operate. Additionally, different organizations might have irreconcilable views on the construction and primary application of any specic vocabulary [3,28]. Consequently, not all internal vocabularies should be candidates for externalization, and examples that are highly tuned to one companys specic need would not be relevant. However, even these efforts would benet by being built upon a backbone of common open concepts, providing better connections to the wider information network. Given the vast landscape of biomedicine, it should be possible to identify many major areas of common need, such as open cell/tissue hierarchies and inter-relationships, catalogs of animal models (and relationships to human biology), pathophysiological processes and disease phenotypes. Although agreement on aims and scope should be possible, there might still be differences of opinion regarding vocabulary content, either in the entities themselves or their relationships to one another. For example, participants might disagree on what constitutes valid symptoms for a disease or the inclusion of ambiguous synonyms in thesauri. However, this problem is relatively simple to mitigate by correct recording of provenance within each vocabulary entry, enabling consumers to include or exclude elements from different contributors as they see t. It is important to recognize that the information and informatics infrastructure varies between life science companies and the ability for each member to benet will progress at differing rates. This might add to the complexity and cost of any solution to ensure that it is workable for all participants. Alternatively, those companies with a more advanced infrastructure could identify ways to donate these systems and experiences into the public domain. Ultimately, no one will want to be at a disadvantage and hence building awareness of the advantages of collaboration is crucial to gain enough support and resources to ensure continued momentum. Although these concerns are applicable to many precompetitive initiatives, they do not seem to have had too great a negative impact thus far. However, they could inuence individual decisions whether to engage or refrain from certain vocabulary projects.
required because these collaborations will require time and money. Does partnership make economic sense? Here it is helpful to look at a real example, such as the Medical Dictionary for Regulatory Activities (MedDRA) vocabulary, used primarily within pharmacovigilance and developed by the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). MedDRA access, maintenance and support are funded through tiered subscriptions, with a maximum rate of US $62,850/year for companies with an annual revenue >US $5 billion, but much lower for smaller businesses and free to nonprot and regulatory authorities (MedDRA: MSSO: http://www.meddramsso.com/ public_subscription_rates.asp). The subscription provides an organization with access to over 670,000 highly annotated concepts, translated into multiple languages, in addition to meetings, training and support. Contrast this with an internal approach; the equivalent amount would likely not even cover one employee year, and could never reach the levels of support offered by the group of international subject experts, vocabulary experts, IT experts, and leadership from the ICH MedDRA Management Board. Clearly this joint maintenance model is far more benecial from the nancial stand point than the situation that each company individually carries such maintenance cost for its own in-house medical vocabulary. Of course, MedDRA is only one specic example that has a relatively large number of subscribers and we cannot extrapolate these gures to every vocabulary. However, they do illustrate that industry has already developed one model of operation in this space and that it currently provides signicantly extended capabilities at a reasonable cost.
External partnerships
The benets of open, public data standards for industry align well with strategies put forward in this area for public science [21,22,2931], highlighting potential synergies between the two domains. Such partnerships would offer a chance to kick-start cross-industry projects by leveraging existing experience and resources, and providing a neutral territory in which to collaborate. Furthermore, there are several public organizations and systems that could provide some of the core capabilities that will be required (Table 2). Clearly, there will be challenges in adapting the mechanisms and conventions that are employed by the nonprot sector to the requirements of industry. Box 2 highlights some of the major issues that would need to be addressed, but given that industry is a major consumer of public research, there is much to be gained by exploring the possibility of greater alignment. Part of this strategy must include dialog with funding agencies concerning the long-term support for shared development and infrastructure. This will require adaptable solutions that take advantage of multinational grants, industry funding and in-kind contributions. Crucially, these initiatives should not be seen as simply industry projects, but rather opportunities for industry and academia to solve a common problem. Although industry can (and should) engage much more in public vocabulary infrastructure, it cannot be the sole mechanism for funding this work. Long-term support of co-ordination groups, such as the Open Biomedical Ontologies (OBOs) Foundry, cross-national support for projects involving the US-based NCBO and mechanisms to integrate and co-ordinate with the other organizations, including those listed in Table 2,
Economics
Although the strategic partnerships suggested above might support this culture shift, investment from industry will still be
www.drugdiscoverytoday.com 5 Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
Reviews INFORMATICS
TABLE 2
Clinical and Translational Science Awards Consortium Clinical Data Interchange Standards Consortium (CDISC) International Health Terminology Standards Development Organization International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH)
Reviews INFORMATICS 6
www.drugdiscoverytoday.com Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
DRUDIS-901; No of Pages 8
http://www.w3.org/2001/sw/hcls
DRUDIS-901; No of Pages 8 Drug Discovery Today Volume 00, Number 00 September 2011 REVIEWS
BOX 2
business benets to content providers and to partner with them to achieve this in a way that provides value to both parties. Secondly, we are not calling for universal access to commercial data, but rather that the concepts within that data are identied using standards that present few barriers to widespread adoption. Providers are still able to create proprietary assertions and databases, but in the knowledge that the cells, symptoms, processes and pathologies within them will fully connect with their customers other data. Finally, the lack of an extensive community of smaller commercial vocabulary providers suggests that building industryrelevant vocabularies to order might not be an area of high economic value on its own. Perhaps smaller companies could identify entrepreneurial approaches that combine collaborative vocabulary services with additional value-added (i.e. revenuegenerating) opportunities. This could include agreeing to host and maintain a particular vocabulary that is relevant to their core business on behalf of the community. In return, they would not only obtain wider input on a resource of utility to themselves, but potentially also develop a wider network of potential customers. Major content providers could also have such a role here, assisting in an area in which they already have a great deal of experience.
Concluding remarks
The availability of high-quality biomedical vocabularies is an often-overlooked but crucial component to future success in life science research. Our aim is to raise awareness of this issue across the industry and to develop the necessary support, participation and funding required to address it. Specically, we propose the initiation of pilot studies to explore and validate the hypothesis that shared vocabularies will be mutually benecial and cost effective. It should not be difcult to identify some core areas of need, agree the scope and provide quantitative judgements on the ability to align and develop vocabularies across multiple companies. Such projects should enable us to identify key bottlenecks in addition to building the economic models to judge whether such an approach will be nancially viable in the long term. Secondly, we advocate that vocabulary standards become an intrinsic element within industry software, whether they are document repositories, electronic laboratory notebooks or intelligence systems. Ensuring that the designers of these applications consider how they will identify concepts in a way that facilitates integration will provide signicant future benets. Finally, industry should actively engage with many of the external groups listed in Table 2, using constructs such as the Pistoia Alliance and Innovative Medicines Initiative (IMI) to supply the legal and logistical frameworks for collaboration. By providing drive, leadership and a vision in this area, industry has an opportunity to begin to tackle a long-standing issue and enable the science needed to deliver future products.
will all be important. We believe that publicprivate projects provide a rich environment to explore this issue and indeed, have already begun generating some important learning in areas, such as toxicology (OpenTox [10]), pharmacology (OpenPhacts: http:// openphacts.org) and disease knowledge and semantic publishing (Pistoia Alliance: http://www.pistoiaalliance.org/workinggroups/ sesl.html and [32]). In addition to industry and nonprot groups, the role of commercial content suppliers, aggregators and publishers must be considered. In many instances, these organizations are supportive of open data standards and actively promote their adoption and use [33,34]. It is also true that they have developed and might sell their own proprietary thesauri, taxonomies and ontologies. An industry vision around open vocabularies has implications for those providers that have already made signicant investment in this area. Considering this perspective, two important points should be made. Firstly, to fully leverage the value in commercial data we must markedly improve our ability to integrate and interrogate it, a capability that is dependent upon data standards. As customers, industry needs to clearly articulate the anticipated
Acknowledgements
This topic was rst discussed at a Pistoia Alliance-sponsored meeting on industry vocabulary strategies. We thank the following participants for valuable contributions which provided the substrate for this perspective: Michael Ashburner, Susanna Lewis, Alan Ruttenberg, Barry Smith (OBO Foundry);
www.drugdiscoverytoday.com 7 Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013
Reviews INFORMATICS
DRUDIS-901; No of Pages 8 REVIEWS Drug Discovery Today Volume 00, Number 00 September 2011
Johanna McEntyre, Dominic Clark, Chris Taylor (EMBL-EBI); Philippe Rocca-Serra (University of Oxford); Gordon Baxter (Biowisdom); Douglas Bassett (Ingenuity); Ashley George, Raymond Grimaila (GSK); Tim Shay (Symyx); Bette Brunelle (Dialog); Paolo Ciccarese (Medicognos); Olivier Bodenereider
(National Library of Medicine); Andrej Bugrim (GeneGo). We also thank Therese Vachon (Novartis), Hilary Vass (AstraZeneca) and Phoebe Roberts (Pzer) for suggestions and corrections. We thank Anna Zhao-Wong (MSSO) for information regarding MedDRA.
www.drugdiscoverytoday.com Please cite this article in press as: L.. Harland, et al., Empowering industrial research with shared biomedical vocabularies, Drug Discov Today (2011), doi:10.1016/j.drudis.2011.09.013