You are on page 1of 17

1

2
INTRODUCTION

Biological databases are libraries of life sciences


information, collected from scientific experiments,
published literature, high throughput experiment technology,
and computational analyses. They contain information from
research areas including genomics, proteomics,
metabolomics, microarray gene expression, and
phylogenetics. [1]
Information contained in biological
databases includes gene function, structure, localization
(both cellular and chromosomal), clinical effects of
mutations as well as similarities of biological sequences and
structures.
Relational database concepts of computer science and
Information retrieval concepts of digital libraries are
important for understanding biological databases. Biological
database design, development, and long-term management is
a core area of the discipline of Bioinformatics. [2]. Data
contents include gene sequences, textual descriptions,
attributes and ontology classifications, citations, and tabular
data. These are often described as semi-structured data, and
can be represented as tables, key delimited records, and
XML structures. Cross-references among databases are
common, using database accession numbers.

3
CONTENTS
• Overview

• Public Databases
○ 1) Primary sequence databases

○ 2) Meta-databases

○ 3) Genome Databases

○ 4) Genome Browsers

○ 5) Protein sequence databases

○ 6) Protein structure Databases

○ 7) Protein-protein interactions

○ 8) Metabolic pathway Databases

○ 9) Microarray databases

○ 10) Mathematical Model Databases

○ 11) PCR / Real time PCR primer Databases

○ 12) Specialized databases

4
OVERVIEW

Biological databases have become an important tool in


assisting scientists to understand and explain a host of
biological phenomena from the structure of bio molecules
and their interaction, to the whole metabolism of organisms
and to understanding the evolution of species. This
knowledge helps facilitate the fight against diseases, assists
in the development of medications and in discovering basic
relationships amongst species in the history of life.
The biological knowledge is distributed amongst many
different general and specialized databases. This sometimes
makes it difficult to ensure the consistency of information.
Biological databases cross-reference other databases with
accession numbers as one way of linking their related
knowledge together.
An important resource for finding biological databases is a
special yearly issue of the journal Nucleic Acids Research
(NAR). The Database Issue of NAR is freely available, and
categorizes many of the publicly available online databases
related to biology and bioinformatics.

PUBLIC DATABASES

5
Primary sequence databases
The International Nucleotide Sequence Database (INSD)
consists of the following databases.
1. DDBJ (DNA Data Bank of Japan)

2. EMBL Nucleotide DB (European Molecular Biology


Laboratory)
3. GenBank [1] (National Centre for Biotechnology
Information)
These databanks represent the current knowledge about the
sequences of all organisms. They interchange the stored
information and are the source for many other databases.
Meta Databases
A meta-database can be considered as a database of
databases, rather than any one integration project or
technology. They collect data from different sources and
usually make them available in new and more convenient
form, or with an emphasis on a particular disease or
organism.
1. Entrez[2] (National Centre for Biotechnology
Information)
2. euGenes (Indiana University)

3. Gene Cards (Weizmann Inst.)

4. SOURCE (Stanford University)

5. mGen containing four of the world biggest databases


GenBank, Refseq, EMBL and DDBJ - easy and simple
program friendly gene extraction

6
6. Bioinformatics Harvester [3] (Karlsruhe Institute of
Technology) - Integrating 26 major protein/gene
resources.
7. MetaBase [4] (KOBIC) - A user contributed database of
biological databases.
Genome Databases

These databases collect organism genome sequences,


annotate and analyze them, and provide public access. Some
add creation of experimental literature to improve computed
annotations. These databases may hold many species
genomes, or a single model organism genome.
1. CAMERA Resource for microbial genomics and
metagenomics
2. Corn, the Maize Genetics and Genomics Database

3. Ensembl provides automatic annotation databases for


human, mouse, other vertebrate and eukaryote
genomes.
4. ERIC (Enteropathogen Resource Integration Centre)
Curated database containing annotated genome data for
five enteropathogens - Escherichia coli, Shigella,
Salmonella, Yersinia enterocolitica, and Y. pestis.
5. Flybase, genome of the model organism Drosophila
melanogaster
6. MGI Mouse Genome (Jackson Lab.)

7. JGI Genomes of the DOE-Joint Genome Institute


provides databases of many eukaryote and microbial
genomes.

7
8. National Microbial Pathogen Data Resource. A
manually curated database of annotated genome data
for the pathogens Campylobacter, Chlamydia,
Chlamydophila, Haemophilus, Listeria, Mycoplasma,
Neisseria, Staphylococcus, Streptococcus, Treponema,
Ureaplasma, and Vibrio.
9. Saccharomyces Genome Database, genome of the yeast
model organism.
10.Viral Bioinformatics Resource Center Curated
database containing annotated genome data for eleven
virus families.
11. The SEED platform for microbial genome analysis
includes all complete microbial genomes, and most
partial genomes. The platform is used to annotate
microbial genomes using subsystems.
12. Wormbase, genome of the model organism
Caenorhabditis elegans
13. Zebrafish Information Network, genome of this fish
model organism.
Genome Browsers
Genome Browsers enable researchers to visualize and
browse entire genomes (most have many complete
genomes) with annotated data including gene prediction and
structure, proteins, expression, regulation, variation,
comparative analysis, etc. Annotated data is usually from
multiple diverse sources.
1. Integrated Microbial Genomes (IMG) system by the
DOE-Joint Genome Institute
8
2. UCSC Genome Bioinformatics Genome Browser and
Tools (UCSC)
3. Ensembl The Ensembl Genome Browser (Sanger
Institute and EBI)
4. GBrowse The GMOD GBrowse Project
5. Pathway Tools Genome Browser
6. X:Map A genome browser that shows Affymetrix Exon
Microarray hit locations alongside the gene, transcript
and exon data on a Google maps api
7. Viral Genome Organizer (VGO) A genome browser
providing visualization and analysis tools for annotated
whole genomes from the eleven virus families in the
VBRC (Viral Bioinformatics Resource Center)
databases
8. Apollo Genome Annotation Curation Tool A cross-
platform, JAVA-based standalone genome viewer with
enterprise-level functionality and customizations. The
standard for many model organism databases.
9. SEED viewer for visualizing and interrogating the
SEED database of complete microbial genomes
Protein Sequence Databases
1. UniProt[5] Universal Protein Resource (UniProt
Consortium: EBI, Expasy, PIR)
2. PIR Protein Information Resource (Georgetown
University Medical Center (GUMC))
3. Swiss-Prot[6] Protein Knowledgebase (Swiss Institute
of Bioinformatics)
9
4. PEDANT Protein Extraction, Description and ANalysis
Tool (Forschungszentrum f. Umwelt & Gesundheit)
5. PROSITE Database of Protein Families and Domains

6. DIP Database of Interacting Proteins (Univ. of


California)
7. Pfam Protein families database of alignments and
HMMs (Sanger Institute)
8. ProDom Comprehensive set of Protein Domain
Families (INRA/CNRS)
9. SignalP 3.0 Server for signal peptide prediction
(including cleavage site prediction), based on artificial
neural networks and HMMs
10. SUPERFAMILY Library of HMMs representing
superfamilies and database of (superfamily and family)
annotations for all completely sequenced organisms
11. Annotation Clearing House a project from the National
Microbial Pathogen Data Resource
Protein Structure Databases
1. Protein Data Bank[7] (PDB) (Research Collaboratory
for Structural Bioinformatics (RCSB))
2. CATH Protein Structure Classification

3. SCOP Structural Classification of Proteins

4. SWISS-MODEL Server and Repository for Protein


Structure Models
5. ModBase Database of Comparative Protein Structure
Models (Sali Lab, UCSF)
Protein-Protein Interactions
10
1. BioGRID [8] A General Repository for Interaction
Datasets (Samuel Lunenfeld Research Institute)
2. STRING: STRING is a database of known and
predicted protein-protein interactions. (EMBL)
3. DIP Database of Interacting Proteins
4. BIND Biomolecular Interaction Network Database
Metabolic Pathway Databases
1. BioCyc Database Collection including EcoCyc and
MetaCyc
2. KEGG PATHWAY Database[9] (Univ. of Kyoto)
3. MANET database [10] (University of Illinois)
4. Reactome[11] (Cold Spring Harbor Laboratory, EBI,
Gene Ontology Consortium)
Microarray databases
1. ArrayExpress (European Bioinformatics Institute)
2. Gene Expression Omnibus (National Center for
Biotechnology Information)
3. maxd (Univ. of Manchester)
4. SMD (Stanford University)
5. GPX(Scottish Centre for Genomic Technology and
Informatics)
Mathematical Model Databases
1. CellML
2. Biomodels Database
PCR / Real time PCR primer Databases

11
1. PathoOligoDB: A free QPCR oligo database for
pathogens
Specialized Databases
A biological database is a large, organized body of
persistent data, usually associated with
computerized software designed to update, query,
and retrieve components of the data stored within the
system. A simple database might be a single file
containing many records, each of which includes the
same set of information. For example, a record
associated with a nucleotide sequence database
typically contains information such as contact name;
the input sequence with a description of the type of
molecule; the scientific name of the source organism
from which it was isolated; and, often, literature
citations associated with the sequence.

For researchers to benefit from the data stored in a


database, two additional requirements must be met:
1.Easy access to the information; and
2.A method for extracting only that information
needed to answer a specific biological question.
Currently, a lot of bioinformatics work is concerned
with the technology of databases. These databases
include both "public" repositories of gene data like
GenBank or the Protein DataBank (the PDB), and
private databases like those used by research
groups involved in gene mapping projects or those
held by biotech companies. Making such databases
12
accessible via open standards like the Web is very
important since consumers of bioinformatics data
use a range of computer platforms: from the more
powerful and forbidding UNIX boxes favoured by the
developers and curators to the far friendlier Macs
often found populating the labs of computer-wary
biologists. RNA and DNA are the proteins that store
the hereditary information about an organism. These
macromolecules have a fixed structure, which can be
analyzed by biologists with the help of bioinformatic
tools and databases.
A few popular databases are GenBank from NCBI
(National Center for Biotechnology Information),
SwissProt from the Swiss Institute of Bioinformatics
and PIR from the Protein Information Resource.
GenBank:
GenBank (Genetic Sequence Databank) is one of
the fastest growing repositories of known genetic
sequences. It has a flat file structure that is an ASCII
text file, readable by both humans and computers. In
addition to sequence data, GenBank files contain
information like accession numbers and gene
names, phylogenetic classification and references to
published literature.There are approximately
191,400,000 bases and 183,000 sequences as of
June 1994.
EMBL:
The EMBL Nucleotide Sequence Database is a
comprehensive database of DNA and RNA
sequences collected from the scientific literature and
13
patent applications and directly submitted from
researchers and sequencing groups. Data collection
is done in collaboration with GenBank (USA) and the
DNA Database of Japan (DDBJ). The database
currently doubles in size every 18 months and
currently (June 1994) contains nearly 2 million bases
from 182,615 sequence entries.
SwissProt:
This is a protein sequence database that provides a
high level of integration with other databases and
also has a very low level of redundancy (means less
identical sequences are present in the database).
PROSITE:
The PROSITE dictionary of sites and patterns in
proteins prepared by Amos Bairoch at the University
of Geneva.
EC-ENZYME:
The 'ENZYME' data bank contains the following data
for each type of characterized enzyme for which an
EC number has been provided: EC number,
recommended name, Alternative names, Catalytic
activity, Cofactors, Pointers to the SWISS-PROT
entree(s) that correspond to the enzyme, Pointers to
disease(s) associated with a deficiency of the
enzyme.
RCSB PDB :
The RCSB PDB contains 3-D biological
macromolecular structure data from X-ray
crystallography, NMR, and Cryo-EM. It is operated by
14
Rutgers, The State University of New Jersey and the
San Diego Supercomputer Center at the University
of California, San Diego.
GDB:
The GDB Human Genome Data Base supports
biomedical research, clinical medicine, and
professional and scientific education by providing for
the storage and dissemination of data about genes
and other DNA markers, map location, genetic
disease and locus information, and bibliographic
information.
OMIM:
The Mendelian Inheritance in Man data bank (MIM)
is prepared by Victor Mc Kusick with the assistance
of Claire A. Francomano and Stylianos E.
Antonarakis at John Hopkins University.
PIR-PSD:
PIR (Protein Information Resource) produces and
distributes the PIR-International Protein Sequence
Database (PSD). It is the most comprehensive and
expertly annotated protein sequence database. The
PIR serves the scientific community through on-line
access, distributing magnetic tapes, and performing
off-line sequence identification services for
researchers. Release 40.00: March 31, 1994 67,423
entries 19,747,297 residues.
Protein sequence databases are classified as
primary, secondary and composite depending upon
the content stored in them. PIR and SwissProt are
15
primary databases that contain protein sequences as
'raw' data. Secondary databases (like Prosite)
contain the information derived from protein
sequences. Primary databases are combined and
filtered to form non-redundant composite database
Genethon Genome Databases:
PHYSICAL MAP: computation of the human genetic
map using DNA fragments in the form of YAC
contigs. GENETIC MAP: production of micro-satellite
probes and the localization of chromosomes, to
create a genetic map to aid in the study of hereditary
diseases. GENEXPRESS (cDNA): catalogue the
transcripts required for protein synthesis obtained
from specific tissues, for example neuromuscular
tissues.
21 Bdb: LBL's Human Chr 21 database:
This is a W3 interface to LBL's ACeDB-style
database for Chromosome 21, 21Bdb, using the
ACeDB gateway software developed and provided
by Guy Decoux at INRA.
MEDLINE:
MEDLINE is NLM's premier bibliographic database
covering the fields of medicine, nursing, dentistry,
veterinary medicine, and the preclinical sciences.
Journal articles are indexed for MEDLINE, and their
citations are searchable, using NLM's controlled
vocabulary, MeSH (Medical Subject Headings).

16
1) www.ncbi.nlm.nih.gov
2) www.wikipedia.org/wiki/sanger_institute
3) Biotechnology,U.Satyanarayana

17

You might also like