Professional Documents
Culture Documents
Dr. Ashok Sharma Head, Bioinformatics and Co-ordinator, Bioinformatics Centre Central Institute of Medicinal and Aromatic Plants PO. CIMAP, Lucknow-226015, India. Web site: www.cimap.res.in E-mail: ashoksharma@cimap.res.in
Sequences
Biological Knowledge
Bioinformatics
Databases
Bioinformatics:
Why
What
If you are one of many biologists for whom genome database are as comprehensible as a mass of supermarket barcodes It is a good time to team up with a friendly bioinformaticist and join the action, before, it is too late
If biologists do not adapt to the powerful computation tools needed to exploit huge data sheets, they could find themselves floundering in the wake of advances in genomics
It is predicted that the potential to integrate different levels of genomic data such as raw sequence from the human genome and those of model organisms, data on genetic variability between individuals and on gene expression in different tissues will radically change biological research. It is also agreed that small experiments driven by individual investigators will give way to a world in which multidisciplinary teams, sharing huge online data sets, emerge as key players.
Multidisciplinary teams sharing huge online data will be the key players
Era of systems biology ability to create mathematical models describing the function of networks of genes and proteins is just as important as traditional lab skills
Outcome of this natural selection will see many current top scientists, research groups and even whole institutes relegated to the second division
In the long run, the change will come through the emergence of a new breed of biologists who are steeped in computational biology as an integral parts of their education. This means that the subject must be included as a core module in all undergraduate biology courses, rather than a specialist option. Although, this is starting to happen, the availability of teachers with the appropriate expertise is still a limiting factor.
The emerging new breed So, if the majority of biologists are not to be disenfranchised, What is the solution? Emergence of a new breed of biologists who are steeped in computational biology as an integral parts of their education. Limiting factor: availability of teachers with the appropriate expertise.
Complete map of interactions between some 1,000 proteins in two types of cells
Modern Biology and particularly Biotechnology are very much information-dependent fields. In fact, the symbiosis between information technology and biotechnology today is as intricately entwined as the two strands of the genetic material that make up the DNA helix.
y Human Genome Project and other genome projects such as sequencing of bacterial genomes and yeast genomes, etc. have produced enormous amounts of DNA sequence data.
y Large scale biological research involving micro sequencing of proteins, 2D gel patterns of proteins and polypeptides, metabolic pathways, physical and genetic maps of the organisms, cell line information, and microbial strain data etc. have been responsible for the unprecedented growth of biological data. y Projects such as Species-2000, global plant check list, information on release of organisms in environment, and Animal Virus Information, etc. are producing hard data at the species level in multimedia format.
The rate of growth of the biological data is estimated to be more than 200 million base pairs per year.
Nucleotide and protein sequences are not the only data that are accumulating rapidly. The number of characterized genes from a variety of organisms and the number of solved protein structures are also doubling every two years.
The enormous growth of biological data and its availability in the major international databases is serving as a source of knowledge to the life scientists.
The whole paradigm shift in molecular biology towards dataintensive research in search of useful genes is basically due to the fact that the genetic data is becoming the major driving force in drug discovery, protein engineering, design of new molecules, and other related areas.
The large stores of biological data are holding the promise to serve as the Discovery Super Highway for innovations in biotechnology through a process of analysis and transformation of molecular and structural data into biological knowledge for prosperity.
In the face of the challenges imposed by the growing size and complexity of the biological data, a new discipline of science, known as Bioinformatics, had emerged in the recent past.
Bioinformatics deals with the various issues related to the biological data. It also covers the development of data analysis tools, modeling of biological macromolecules and their complexes, metabolic pathways, designing of new molecules such as drugs, peptide vaccines, proteins, etc.
Gradually, Bioinformatics has evolved to deal with four related but still distinct problem areas, viz.: a) Handling and management of biological data, including its organization, control, linkages, analysis, and so forth. b) Communication among people, projects, and institutions engaged in the biological research and applications. The communication may include e-mail, file transfer, remote login, computer conferencing, electronic bulletin boards, or establishment of web-based information resources. c) Organization, access, search and retrieval of biological information, documents, and literature. d) Analysis and interpretation of the biological data through the computational approaches including visualization, mathematical modeling, and development of algorithms for highly parallel processing of complex biological structure.
Bioinformatics may, be defined as a scientific discipline that encompasses all the aspects of biological information, viz., acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science, and biology with the aim of understanding the biological significance of a variety of data.
Bioinformatics has acquired great importance due to its application in the Genome projects. The target of decoding the three billion base pairs of the human DNA has become achievable only through the use of various innovative techniques and methods evolved by the Bioinformatics scientists. Bioinformatics has become an essential component of biotechnology based product and process development. The process of drug design and development is expensive and timeconsuming. The application of the tools and techniques of Bioinformatics has resulted in the reduction in cost and the development cycle of the drugs. This aspect has a tremendous impact on the society. If a newly discovered drug is a life-saving one, then the resulting gains are not only in terms of financial savings but also in saving the lives of several million people. Major pharmaceutical and Biotechnology companies have set up large R&D groups in Bioinformatics.
Bioinformatics is a multidisciplinary subject. Through only about a decade old, it has become very important for the growth of biosciences, biotechnology, and the economic prosperity of nations. Three well-identified divisions of Bioinformatics may be considered: a) Molecular Bioinformatics, b) Cellular and sub-cellular Bioinformatics, and c) Orgasmic and community Bioinformatics.
ii. To provide a computer-based information storage and retrieval system of database that collects structured information generated by research and industrial institutions in the identified fields of biotechnology, continually update the databases and make the information available to the users. iii. An active network mode, in which the scientists get access to the biotechnology community in the identified areas, answer requests for information in an interactive and discussive mode and actively initiate dialogue among groups with common interest.
iv. To provide retrieval service either online or offline in their specialized areas and to give overall information support even in areas other than those assigned to them.
v. To provide communication link with international databases for selective bibliographic information for the user scientist. vi. To develop software packages and databases specific to user needs.
vii. To conduct training courses in the specialized areas periodically to meet the special requirements of manpower development in the area and to promote awareness about the computerized storage and retrieval facility among bio scientists and information scientists.
Bioinformatics What?
A mixture of Biochemistry, Molecular Biology, and Computer Science Obtaining, storing, organizing, and analyzing biological and genetic information for understanding its activity in living organisms Main goal is to convert multitude of complex data into useful information and knowledge Data includes gene and protein sequences, cDNA, nucleotide sequences Data from gene sequencing, combinatorial chemical synthesis, gene-expression investigations, pharmicogenomics, proteomic studies, and other methods of study. Information used to build synthetic and predictive models allowing scientists to better understand complex living systems Future applications in biology, chemistry, pharmaceuticals, medicine, and agriculture
TECHNOLOGIES IN BIOINFORMATICS
DataData-acquisition Systems These are requires mainly at research labs generating large amounts of data. These data. systems include inventory Control Software, tracking hundreds of thousands of reagents, gels and other materials, reagent manipulation software, robotic system to carry out high volume, high precision laboratory manipulation in genome research and sequence production software that will help improve sequencing. sequencing.
TECHNOLOGIES IN BIOINFORMATICS
Data Analysis Systems Studying sequences, predicting protein structure and comparing genomes on an extension such all requires Informatics tools such as Sequence Analysis Software that performs alignments, detects homologies, identifies coding regions and extracts features. Protein folding software is features. used to transform genetic information into function via proteins whose functional specific are determined by their 3-D shapes. Genetic mapping Software Systems play shapes. a key role in the analysis of genetic mapping data. data. Classification Software extracts features from DNA Sequences place proteins into gene families and track protein motifs. motifs.
TECHNOLOGIES IN BIOINFORMATICS
DataData- Management System
Various genome projects are generating information that can not be accommodated by traditional publishing. Electronic data management and publishing Systems are crucial components of genomic research.
CHALLENGES IN BIOINFORMATICS
Bioinformatics, which is the intersection of Information Technology and Mathematics with molecular biology / genetics, has created several challenges for the Computer Science Community. Community.
Information Storage
Storing huge amounts of genetic information, amenable to rapid access and manipulation, is a great challenge. challenge. One million bases (1Mb) N 1 Megabyte (1MB). Thus, one would require 3 MB). Gigabytes (3 GB) of computer data storage space to store entire Human Genome comprising three Gigabases (3 Gb). Gb). This includes nucleotide sequence data only and does not include data annotations and other information associated with the sequence data. data. With time, more annotations entered either
(a) by scientists as a result of laboratory findings, literature searches, data analysis, or
personal communications, and/or (b) entered as a result of automated data analysis programs or autoannotators, Will be
associated with the sequence data increasing the requirements of storage significantly beyond the 3 GB for the human genome. genome.
The development of database management systems is an active subspecialty of computer science. science. The need to organise terabytes of heterogeneous biological data in a form that is easily usable, and which employs sophisticated data visualisation capabilities is essential for progress in modern biology. biology.
Genetic data can not be analysed efficiently without computer systems. systems. Studying sequences, predicting protein structures, and comparing genomes on an extensive scale all require additional Informatics tools, such as: as:
Sequence analysis is so far the best known, best established area of bioinformatics - Performing alignments, detecting homologies, identifying coding regions, extracting features, and other computerised analysis of the sequences - are now performed as routine. At the same time, sequence routine. analysis is a multi-faceted and biologically profound area of research, multidemanding much continued work. work.
Genetic information is transformed into function via proteins, whose functional specificities are determined by their three dimensional shapes. Prediction of the shapes. protein structure from amino acid sequnces is an important and challenging problem. problem. Computation plays an increasing central role in the assembly and integration of large maps composed of different kinds and combinations of data. data. As the genome projects mature and large amounts of genomic information is available for a number of species, comparative genomics is emerging as an active area of study. study. Methods for mapping genes to their physical locations on the genome; searching genome; for related genes; analysing the database to find families of related genes and to genes; understand their coordinated expression; finding correlation between specific expression; diseases and expression of related genes. genes.
Gene Mining
SERENDIPITY EFFECT
One of the most exciting aspects of the information revolution is that it allows us to combine many different items of information and many different kind of information on a scale never seen before. before. Large international databases for instance, include contributions from thousands of different sources. Also the sources. hypertext links (Information Super Highway) between sites makes it possible to draw together many different kinds of information that bear on a particular problems. problems. These activities not only promote collaboration on a truly vast scale, they also enrich research. One important effect is the research. Screndipity effect combining different datasets makes possible entirely new kinds of study-New Studies inevitable studylead to new and unexpected discoveries. discoveries.
Computational Methods
A core set of computational approaches has emerged for dealing with the types of data that are currently shared in public databases DNA, protein sequence, and protein structure.
Using public databases and data formats The first key skill for biologists is to learn to use online search tools to find information. Literature searching is no longer a matter of looking up references in a printed index. You can find links to most of the scientific publications you need online. There are central databases that collect reference information so you can search dozens of journals at once. You can even set up agents that notify you when new articles are published in an area of interest. Searching the public molecular-biology databases requires the same skills as searching for literature references: you need to know how to construct a query statement that will pluck the particular needle youre looking for out of the database haystack.
Sequence alignment and sequence searching Being able to compare pairs of DNA or protein sequences and extract partial matches has made it possible to use a biological sequence as a database query. Sequence-based searching is another key skill for biologists; a little exploration of the biological databases at the beginning of a project often saves a lot of valuable time in the lab. Identifying homologous sequences provides a basis for phylogenetic analysis and sequence-pattern recognition. Sequence-based searching can be done online through web forms, so it requires no special computing skills, but to judge the quality of your search results you need to understand how the underlying sequence-alignment method works and go beyond simple sequence alignment to other types of analysis.
Gene prediction
Gene prediction is only one of a cluster of methods for attempting to detect meaningful signals in uncharacterized DNA sequences. Until recently, most sequences deposited in GenBank were already characterized at the time of deposition. That is, someone had already gone in and, using molecular biology, genetic, or biochemical methods, figured out what the gene did. However, now that the genome projects are in full swing, theres a lot of DNA sequence out there that isnt characterized. Software for prediction of open reading frames, genes, exon splice sites, promoter binding sites, repeat sequences, and tRNA genes helps molecular biologists make sense out of this unmapped DNA.
Phylogenetic analysis
Phylogenetic analysis attempts to describe the evolutionary relatedness of a group of sequences. A traditional phylogenetic tree or cladogram groups species into a diagram that represents their relative evolutionary divergence. Branchings of the tree that occur furthest from the root separate individual species; branchings that occur close to the root group species into kingdoms, phyla, classes, families, genera, and so on. The information in a molecular sequence alignment can be used to compute a phylogenetic tree for a particular family of gene sequences. The branchings in phylogenetic trees represent evolutionary distance based on sequence similarity scores or on information-theoretic modeling of the number of mutational steps required to change on sequence into the other. Phylogenetic analyses of protein sequence families talks not about the evolution of the entire organism but about evolutionary change in specific coding regions.
Protein structure property analysis Protein structures have many measurable properties that are of interest to crystallographers and structural biologists. Protein structure validation tools are used by crystallographers to measure how well a structure model conforms to structural rules extracted from existing structures or chemical model compounds. These tools may also analyze the fitness of every amino acid in a structure model for its environment, flagging such oddities as buried charges with no countercharge or large patches of hydrophobic amino acids found on a protein surface. These tools are useful for evaluating both experimental and theoretical structure models.
Biochemical simulation
Biochemical simulation uses the tools of dynamical systems modeling to simulate the chemical reactions involved in metabolism. Simulations can extend from individual metabolic pathways to transmembrane transport processes and even properties of whole cells or tissues. Biochemical and cellular simulations traditionally have relied on the ability of the scientist to describe a system mathematically, developing a system of differential equations that represent the different reactions and fluxes occurring in the system. However new software tools can build the mathematical framework of a simulation automatically from a description provided interactively by the user, making mathematical modeling accessible to any biologist who knows enough about a system to describe it according to the conventions of dynamical systems modeling.
Primer design
Many molecular biology protocols require the design of oligonucleotide primers. Proper primer design is critical for the success of polymerase chain reaction (PCR), oligo hybridization, DNA sequencing, and microarray experiments. Primers must hybridize with the target DNA to provide a clear answer to the question being asked, but, they must also have appropriate physicochemical properties; they must not self-hybridize or dimerize; and they should not have multiple targets within the sequence under investigation. There are several web-based services that allow users to submit a DNA sequence and automatically detect appropriate primers, or to compute the properties of a desired primer DNA sequence.
Proteomics analysis
Before theyre ever crystallized and biochemically characterized, proteins are often studied using a combination of gel electrophoresis, partial sequencing, and mass spectroscopy. 2-D gel electrophoresis can separate a mixture of thousands of proteins into distinct components; the individual spots of material can be blotted or even cut from the gel and analyzed. Simple computational tools can provide some information to aid in the process of analyzing protein mixtures. Its trivial to compute molecular weight and pI from a protein sequence; by using these values in combination, sets of candidate identities can be found for each spot on a gel. Its also possible to compute, from a protein sequence, the peptide fingerprint that is created when that protein is broken down into fragments by enzymes with specific protein cleavage sites.
Databases
The internet is a powerful resource containing a large volume of data and tools to manipulate them unfortunately, connecting data between them can sometimes be tricky. What is a database ? An organized body of related information. A collection of information organized and presented to serve a specific purpose. A computerized database is an updated, organized file of machine readable information that is rapidly searched and retrieved by computer. computerized storehouse of data (records). allows user-defined queries. allows extraction of specified records. allows adding, changing, removing, and merging of records . uses standardized formats.
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These data are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.
DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronisation. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. Nucleotide Sequence Databases can be further subdivided into following : 1)International Nucleotide Sequence Database Collaboration 2)Coding and non-coding DNA 3)Gene structure, introns and exons, splice sites 4)Transcriptional regulator sites and transcription factors.
Database name
URL
GenBank
http://www.ncbi.nlm.nih.gov/
http://www.ebi.ac.uk/embl.html
http://www.ddbj.nig.ac.jp
Online databases
primary repositories of sequence data: - European Bioinformatics Institute (EBI) - DNA data bank of Japan (DDBJ) - GenBank, National Center for Biotechnology Information (NCBI) each of these databases contain equivalent information (formats vary slightly)
1.2. DNA sequences: genes, motifs and regulatory sites 1.2.1. Coding and coding DNA
ACLAME CUTG A classification of genetic mobile elements Codon usage tabulated from GenBank Deviations from the standard genetic code in various organisms and organelles Human endogenous retrovirus database Immunoglobulin, T cell receptor and MHC nucleotide sequences from human and other vertebrates http://aclame.ulb.ac.be/ http://www.kazusa.or.jp/codon/ http://www.ncbi.nlm.nih.gov/Taxonomy/ Utils/wprintgc.cgi?mode=c http://herv.img.cas.cz
http://imgt.cines.fr/cgi-bin/IMGTlect.jv
http://www.otago.ac.nz/IGC
Islander MICdb
http://www.indiana.edu/islander http://www.cdfd.org.in/micas
Short tandem DNA repeats database Organism-specific databases of EST and gene sequences
http://www.cstl.nist.gov/div831/strbase/ http://www.tigr.org/tdb/tgi.shtml
Transterm
http://uther.otago.ac.nz/Transterm.html
UniGene
http://www.ncbi.nlm.nih.gov/UniGene/
UniVec
Vector sequences, adapters, linkers and primers used in DNA cloning, can be used to check for vector contamination
VectorDB
Characterization and classification of nucleic acid vectors Eukaryotic protein-encoding DNA sequences, both introncontaining and intron-less genes
Xpro
http://origin.bic.nus.edu.sg/xpro/
ASAP
http://www.bioinformatics.ucla.edu/ASAP
ASD
EBIs alternative splicing database project includes three databases AltSplice, AltExtron and AEdb
http://www.ebi.ac.uk/asd
Alternative splicing database: protein products and expression patterns of alternatively-spliced genes Extended alternatively spliced EST database Exonintron database: introns in protein-coding genes Exonintron structure of eukaryotic genes Homo sapiens splice sites dataset
IDB/IEDB
Intron sequence and evolution databases Introns and alternative splicing in C.elegans and C.briggsae
http://nutmeg.bio.indiana.edu/intron/index.html http://www.cse.ucsc.edu/kent/intronerator/
Intronerator
SpliceDB
http://genomic.sanger.ac.uk/spldb/SpliceDB.htm l
SpliceNest
http://splicenest.molgen.mpg.de/
YIDB
http://www.emblheidelberg.DE/ExternalInfo/seraphin/yidb.html
DPInteract
http://arep.med.harvard.edu/dpinteract
EPD
Eukaryotic promoter database Hematopoietic promoter database: transcriptional regulation in hematopoiesis Primate mitochondrial DNA control region sequences PSSMs for transcription factor DNA-binding sites Plant cis-acting regulatory DNA elements
PlantCARE PlantProm
Plant promoters and cis-acting regulatory elements Plant promoter sequences for RNA polymerase II
http://intra.psb.ugent.be:8080/PlantCARE/ http://mendel.cs.rhul.ac.uk/
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
16S and 23S ribosomal RNA mutations 5S rRNA sequences Small RNA/DNA molecules binding nucleic acids, proteins AU-rich element-containing mRNA database A database of group II introns, self-splicing catalytic RNAs All complete or nearly complete rRNA sequences Genomic tRNA database
RNA editing in various kinetoplastid species HIV RNA sequences Hybrid pattern library: structural elements in classes of RNA Internal ribosome entry site database
HyPaLib IRESdb
ncRNAs Database
http://biobases.ibch.poznan.pl/ncRNA/
PLANTncRNAs
http://www.prl.msu.edu/PLANTncRNAs
Plant snoRNA DB
http://www.scri.sari.ac.uk/plant_snoRNA/
PLMItRNA
PseudoBase RDP
http://www.sanger.ac.uk/Software/Rfam/ http://ulises.umh.es/RISSC
http://medlib.med.utah.edu/RNAmods/ http://rrndb.cme.msu.edu/
http://mbcr.bcm.tmc.edu/smallRNA
SRPDB
http://psyche.uthct.edu/dbs/SRPDB/SRPD B.html
http://subviral.med.uottawa.ca/cgibin/home.cgi
tmRNA Website
http://www.indiana.edu/tmrna
tmRDB
tmRNA database
http://psyche.uthct.edu/dbs/tmRDB/tmRDB. html
tRNA viewer and sequence editor 5'- and 3'-UTRs of eukaryotic mRNAs
http://www.unibayreuth.de/departments/biochemie/trna/ http://bighost.area.ba.cnr.it/srs6/
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
All protein sequences: translated from GenBank and imported from other protein databases
http://www.ncbi.nlm.nih.gov/entrez
PIR
Protein information resource: a collection of protein sequence databases, part of the UniProt project
PIR-NREF
PRF
Protein research foundation database of peptides: sequences, literature and unnatural amino acids
http://www.prf.or.jp/en
Swiss-Prot
Curated protein sequence database with a high level of annotation (protein function, domain structure, modifications) Translations of EMBL nucleotide sequence entries: computerannotated supplement to Swiss-Prot
http://www.expasy.org/sprot
TrEMBL
http://www.expasy.org/sprot
UniProt
Universal protein knowledgebase: a database of protein sequence from Swiss-Prot, TrEMBL and PIR
http://www.uniprot.org/
AAindex ProTherm
Physicochemical properties of amino acids Thermodynamic data for wild-type and mutant proteins
DBSubLoc
MitoDrome
http://bighost.area.ba.cnr.it/BIG/MitoDrome
NESbase NLSdb
http://www.cbs.dtu.dk/databases/NESbase http://cubic.bioc.columbia.edu/db/NLSdb/
THGS
http://pranag.physics.iisc.ernet.in/thgs/
TMPDB
http://bioinfo.si.hirosaki-.ac.jp/TMPDB/
Blocks
http://blocks.fhcrc.org/
CSA
Catalytic site atlas: enzyme active sites and catalytic residues in enzymes of known 3D structure
http://www.ebi.ac.uk/thorntonsrv/databases/CSA/
COMe
Co-ordination of metals etc.: classification of bioinorganic proteins (metalloproteins and some other complex proteins)
http://www.ebi.ac.uk/come
eMOTIF
http://motif.stanford.edu/emotif
http://metallo.scripps.edu/
O-GlycBase
http://www.cbs.dtu.dk/databases/OGLYCBA SE/
PhosphoBase
http://www.cbs.dtu.dk/databases/PhosphoBas e/
PROMISE
http://metallo.scripps.edu/PROMISE
PROSITE
http://www.expasy.org/prosite
Conserved domain database: includes protein domains from Pfam, SMART and COG databases Clusters of Swiss-Prot+TrEMBL proteins A database of protein domains and motifs Integrated resource of protein families, domains and functional sites
InterPro
http://www.ebi.ac.uk/interpro
iProClass MetaFam
Integrated protein classification database Database of protein family annotations Protein families: multiple sequence alignments and profile hidden Markov models of protein domains
http://pir.georgetown.edu/iproclass/ http://metafam.ahc.umn.edu/
Pfam
http://www.sanger.ac.uk/Software/Pfa m/
PIRSF
PRINTS
PIR-ALN
Curated database of protein sequence alignments Protein families defined by PIR superfamilies and PROSITE patterns
ProClass
ProDom
http://www.toulouse.inra.fr/prodom.html
Hierarchical classification of Swiss-Prot proteins Hierarchical clustering of Swiss-Prot proteins Protein domain sequences and tools
SMART
Simple modular architecture research tool: signalling, extracellular and chromatin-associated protein domains
http://smart.embl-heidelberg.de/
SUPFAM
http://pauling.mbu.iisc.ernet.in/supfam
SYSTERS
http://systers.molgen.mpg.de/
TIGRFAMs
http://www.tigr.org/TIGRFAMs
ABCdb
http://ir2lcb.cnrs-mrs.fr/ABCdb/
ASPD
http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/
BacTregulators
http://www.helicase.net/dexhd/dbhome.htm
http://www.tumor-gene.org/GPCR/gpcr.html
Esterases and other alpha/beta hydrolase enzymes Families of proteins functioning in the eye G protein-coupled receptors database
http://research.nhgri.nih.gov/histones/
HIV epitopes
http://hiv-web.lanl.gov/immunology/
http://hivdb.stanford.edu/ http://www.biosci.ki.se/groups/tbu/homeo.ht ml
Homeobox Page
http://research.nhgri.nih.gov/homeodomain
Human olfactory receptor data exploratorium Inteins (protein splicing elements) database: properties, sequences, bibliography Sequences of proteins of immunological interest
http://bioinfo.weizmann.ac.il/HORDE/
http://www.neb.com/neb/inteins.html http://immuno.bme.nwu.edu/
KinG Knottins
Ser/Thr/Tyr-specific protein kinases encoded in complete genomes Database of knottinssmall proteins with an unusual disulfide through disulfide knot
http://hodgkin.mbu.iisc.ernet.in/king http://knottin.cbs.cnrs.fr
http://www.pasteur.fr/recherche/banques/ LGIC/LGIC.html
http://www.led.uni-stuttgart.de/
Mammalian, invertebrate, plant and fungal lipoxygenases Database of proteolytic enzymes (peptidases) MHC-binding peptides Mitochondrial protein import machinery of plants Nuclear protein database Nuclear receptor superfamily
NUREBASE
http://ycmi.med.yale.edu/senselab/ordb/
http://www.ifti.org/ootfd
PKR
Protein kinase resource: sequences, enzymology, genetics and molecular and structural properties
http://pkr.sdsc.edu/
PLANT-PIs
http://bighost.area.ba.cnr.it/PLANT-PIs http://plantsp.sdsc.edu/
PlantsP/PlantsT
Prolysis
http://delphi.phys.univ-tours.fr/Prolysis/
REBASE
http://rebase.neb.com/rebase/rebase.html
http://www.mbio.ncsu.edu/RNaseP/home.html http://ribosome.miyazaki-med.ac.jp/
RTKdb
http://pbil.univ-lyon1.fr/RTKdb/
S/MARt dB
http://smartdb.bioinf.med.uni-goettingen.de/ http://fermi.utmb.edu/SDAP
SDAP
SENTRA
SEVENS
SRPDB TrSDB
VIDA VKCDB
http://www.biochem.ucl.ac.uk/bsm/virus_da tabase/VIDA.html
Wnt Database
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
Structure Databases
The number of known molecular structures is increasing very rapidly and these are available through the various databases comprising of structural information regarding the specific molecule. Various sub categories lying in this divison of molecular databases are: 1)Small molecules 2)Carbohydrates 3)Nucleic acid structure 4)Protein structure 5) Unicellular eukaryotes genome databases.
CSD
Cambridge structural database: crystal structure information for organic and metal-organic compounds
http://www.ccdc.cam.ac.uk/prods/csd/csd. html
HIC-Up
http://xray.bmc.uu.se/hicup
AANT
http://aant.icmb.utexas.edu/
Klotho
http://www.biocheminfo.org/klotho
LIGAND
http://www.genome.ad.jp/ligand/
4.2. Carbohydrates
http://bssv01.lancs.ac.uk/gig/pages/gag/c arbbank.htm
CCSD
Glycan
http://glycan.genome.ad.jp/
GlycoSuiteDB
http://www.glycosuite.com/
Monosaccharide Browser
http://www.jonmaber.demon.co.uk/monosac charide
SWEET-DB
http://www.dkfzheidelberg.de/spec2/sweetdb/
NTDB
http://ntdb.chem.cuhk.edu.hk/
RNABase
http://www.rnabase.org/
SCOR
Structural classification of RNA: RNA motifs by structure, function and tertiary interactions
http://scor.lbl.gov/
PRODORIC NET
http://prodoric.tu-bs.de/
PromEC
http://bioinfo.md.huji.ac.il/marg/promec
SELEX_DB
DNA and RNA binding sites for various proteins, found by systematic evolution of ligands by exponential enrichment
http://wwwmgs.bionet.nsc.ru/mgs/systems/s elex/
TESS
http://www.cbil.upenn.edu/tess
TRANSCompel
http://www.generegulation.com/pub/databases.html#transco mpel
TRANSFAC
http://transfac.gbf.de/TRANSFAC/index. html
TRRD
http://www.bionet.nsc.ru/trrd/
ASTRAL
http://astral.stanford.edu/
BAliBASE BioMagResBa nk
http://www-igbmc.ustrasbg.fr/BioInfo/BAliBASE2/index.html
http://www.bmrb.wisc.edu/
CADB
Protein domain structures database 3D Protein structure alignments Structurally-similar proteins with dissimilar sequences Protein fold classification using the Dali search engine
Decoys R Us
http://dd.stanford.edu/
DisProt
Database of Protein Disorder: information about proteins that lack fixed 3D structure in their native states
http://divac.ist.temple.edu/disprot
DomIns
DSDBASE
DSMM
http://projects.villaosch.de/dbase/dsmm/
eF-site
Electrostatic surface of Functional site: electrostatic potentials and hydrophobic properties of the active sites
http://ef-site.protein.osaka-u.ac.jp/eF-site
FSSP
Fold classification based on structure-structure alignment of proteins, currently maintained as Dali database
Gene3D
Precalculated structural assignments for whole genomes Genomic threading database: structural annotations of complete genomes
GTD
http://bioinf.cs.ucl.ac.uk/GTD
GTOP Het-PDB Navi HOMSTRAD IMB Jena Image Library IMGT/3Dstruct ure-DB ISSD LPFC MMDB E-MSD ModBase
Protein fold predictions from genome sequences Hetero-atoms in protein structures Homologous structure alignment database: curated structurebased alignments for protein families
http://spock.genes.nig.ac.jp/
genome/
http://daisy.nagahama-ibio.ac.jp/golab/hetpdbnavi.html http://www-cryst.bioc.cam.ac.uk/homstrad
Visualization and analysis of 3D biopolymer structures Sequences and 3D structures of vertebrate immunoglobulins, T cell receptors and MHC proteins Integrated sequence-structure database Library of protein family core structures NCBIs database of 3D structures, part of NCBI Entrez EBIs macromolecular structure database Annotated comparative protein structure models Database of macromolecular movements: descriptions of protein and macromolecular motions, including movies Phylogeny and alignment of homologous protein structures Structural motifs of protein superfamilies
faculty/mini/campass/pas
PepConfDB PDB PDB-REPRDB PDBsum SCOP Sloop StructureSuperposition Database SWISS-MODEL Repository SUPERFAMILY SURFACE TargetDB 3D-GENOMICS TOPS
A database of peptide conformations Protein structure databank: all publicly available 3D structures of proteins and nucleic acids Representative protein chains, based on PDB entries Summaries and analyses of PDB structures Structural classification of proteins Classification of protein loops
http://ssd.rbvi.ucsf.edu/
Database of annotated 3D protein structure models Assignments of proteins to structural superfamilies Surface residues and functions annotated, compared and evaluated: a database of protein surface patches Target data from worldwide structural genomics projects Structural annotations for complete proteomes Topology of protein structures database
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
Genomics Databases
For organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how these data are stored.This category of databases comprising of the information regarding various genomes like of Humans ,Plants, Viral, Invertebrate, Microbes etc. 1)Genome annotation terms, ontologies and nomenclature 2)Taxonomy and identification 3)General genomics databases 4)Viral genome databases 5)Prokaryotic genome databases 6)Unicellular eukaryotes genome databases 7)Fungal genome databases 8)Invertebrate genome databases 9)Human genome databases, maps and viewers.
5. Genomics Databases (non-human) (non5.1. Genome annotation terms, onthologies and nomenclature
Human gene nomenclature: approved gene symbols Gene onthology consortium database Gene onthology annotation project Nomenclature of enzymes, membrane transporters, electron transport proteins and other proteins Nomenclature of biochemical and organic compounds approved by the IUBMB-IUPAC Joint Commission The International Union of Pharmacology recommendations on receptor nomenclature and drug classification Gene products organized by biological function http://www.gene.ucl.ac.uk/nomenclat ure http://www.geneontology.org/ http://www.ebi.ac.uk/GOA
http://www.chem.qmul.ac.uk/iubmb
http://www.chem.qmul.ac.uk/iupac
IUPHAR-RD PANTHER
http://www.iuphar-db.org/iuphar-rd/ http://panther.celera.com/
SOURCE UMLS
Functional genomic resource for annotations ontologies and expression data Unified medical language system
http://source.stanford.edu/ http://umlsks.nlm.nih.gov/
ICB
http://www.mbio.co.jp/icb
NCBI Taxonomy
http://www.ncbi.nlm.nih.gov/Taxonomy/
RIDOM RDP
http://www.ridom-rdna.de/ http://rdp.cme.msu.edu
Tree of Life
http://phylogeny.arizona.edu/tree/phylogeny .html
CORG
http://corg.molgen.mpg.de/
DEG
http://tubic.tju.edu.cn/deg
EBI Genomes
EBIs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Eukaryotic gene orthologs: orthologous DNA sequences in the TIGR gene indices Enhanced microbial genomes library: completely sequenced genomes of unicellular organisms
http://www.ebi.ac.uk/genomes
EGO
http://www.tigr.org/tdb/tgi/ego/
EMGlib
http://pbil.univ-lyon1.fr/emglib/emglib.html
Entrez Genomes
NCBIs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Integrated biochemical data on seven bacterial genomes: publicly available portion of the ERGO database Database of bacterial and archaeal gene fusion events
http://www.ncbi.nlm.nih.gov/entrez/query. fcgi?db=Genome
ERGOLight FusionDB
http://www.ergo-light.com/ERGO http://igs-server.cnrs-mrs.fr/FusionDB
DDBJs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Genomes online database: a listing of completed and ongoing genome projects
http://gib.genes.nig.ac.jp
http://www.genomesonline.org/
Lists of completed and ongoing genome projects with links to complete genome sequences Putative horizontally transferred genes in prokaryotic genomes
http://www.tigr.org/tdb/mdb/mdbcomplet e.html
HGT-DB
http://www.fut.es/
debb/HGT/
KEGG MBGD
Kyoto encyclopedia of genes and genomes: integrated suite of databases on genes, proteins, and metabolic pathways Microbial genome database for comparative analysis Database of orphan ORFs (ORFs with no homologs) in complete microbial genomes
http://www.genome.ad.jp/kegg http://mbgd.genome.ad.jp/
ORFanage
http://www.cs.bgu.ac.il/
nomsiew/ORFans
PACRAT
http://www.biosci.ohio-tate.edu/
pacrat
PEDANT
http://pedant.gsf.de
Various data on complete microbial genomes: uniform annotation, properties of DNA and predicted proteins
http://www.tigr.org/CMR
TransportDB
Predicted membrane transporters in complete genomes, classified according to the TC classification system
http://www.membranetransport.org
WIT
http://wit.mcs.anl.gov/WIT2/
Mutations in HIV genes that confer resistance to anti-HIV drugs Annotated and curated database for complete viral genome sequences
http://resdb.lanl.gov/Resist_DB/default.htm
VirGen
http://bioinfo.ernet.in/virgen/virgen.html
CyberCell database: E.coli database at U. Alberta A database for E.coli, Salmonella and Shigella E.coli genome database at Institut Pasteur
http://redpoll.pharmacy.ualberta.ca/CCDB http://colibase.bham.ac.uk/ http://genolist.pasteur.fr/Colibri/ http://magpie.genome.wisc.edu/ ntial.html http://ecoli.aist-nara.ac.jp/ http://genprotec.mbl.edu http://shigen.lab.nig.ac.jp/ecoli/pec http://ecocyc.org/ http://bmb.med.miami.edu/EcoGene/EcoWe b/ chris/esse
First results of an E.coli gene deletion project E.coli genome database at Nara Institute E.coli K-12 genome and proteome database Profiling of E.coli chromosome E.coli K-12 genes, metabolic pathways, transporters, and gene regulation Sequence and literature data on E.coli genes and proteins
EcoCyc EcoGene
RegulonDB
http://www.cifn.unam.mx/Computational_G enomics/regulondb/
BioCyc
http://biocyc.org/
CampyDB
http://campy.bham.ac.uk/
ClostriDB
http://clostri.bham.ac.uk/
CyanoBase
Cyanobacterial genomes
http://www.kazusa.or.jp/cyano
LeptoList
http://bioinfo.hku.hk/LeptoList
MolliGen
http://cbi.labri.fr/outils/molligen/
RsGDB
http://wwwmmg.med.uth.tmc.edu/sphaeroides
CYGD
http://mips.gsf.de/proj/yeast
Gnolevures
http://cbi.labri.fr/Genolevures
MitoPD
http://bmerc-www.bu.edu/mito
SCMD SCPD
Saccharomyces cerevisiae morphological database: micrographs of budding yeast mutants Saccharomyces cerevisiae promoter database
http://yeast.gi.k.u-tokyo.ac.jp/ http://cgsigma.cshl.org/jian
TRIPLES
http://ygac.med.yale.edu/triples/
YDPM
http://wwwdeletion.stanford.edu/YDPM/YDPM_index. html
http://www.cse.ucsc.edu/research/compbio/ yeast_introns.html
CryptoDB
http://cryptodb.org/
DictyBase
http://dictybase.org/
Full-Malaria
http://fullmal.ims.u-tokyo.ac.jp/
Curated database for Trypanosoma brucei, Leishmania major, S.pombe and other Sanger-sequenced genomes Plasmodium genome database Trypanosoma cruzi genome database Toxoplasma gondii genome database
FLAGdb++
GnoPlante-Info
Plant genomic data from the Gnoplante consortium Molecular and phenotypic information on wheat, barley, rye, triticale and oats Database of plant EST and STS sequences annotated with gene family information
GrainGenes
Mendel
http://www.mendel.ac.uk/ http://genoplanteinfo.infobiogen.fr/phytoprot
PHYTOPROT
Clusters of (predicted) plant proteins Plant genome database: actively-transcribed plant genomic sequences Plant EST clustering and functional annotation
PlantGDB Sputnik
http://www.plantgdb.org/ http://mips.gsf.de/proj/sputnik
Classification of repetitive sequences in plant genomes Genetic and genomic information about tropical crops: sugarcane, banana, cocoa
http://www.tigr.org/tdb/e2k1/plant.repeat s
TropGENE DB
http://tropgenedb.cirad.fr/
ARAMEMNON
http://aramemnon.botanik.uni-koeln.de/
AthaMap
http://www.athamap.de/
CATMA
http://www.catma.org
FLAGdb/FST
http://genoplante-info.infobiogen.fr/
MAtDB
http://mips.gsf.de/proj/thal/db
SeedGenes TAIR
http://www.seedgenes.org/ http://www.arabidopsis.org/
5.3.4.3. Rice
BGI-RISe
http://rise.genomics.org.cn/
INE
http://rgp.dna.affrc.go.jp/giot/INE.html
IRIS
http://www.iris.irri.org/
MOsDB
http://mips.gsf.de/proj/rice
Oryzabase
http://www.shigen.nig.ac.jp/rice/oryzabase/
RiceGAAS
http://ricegaas.dna.affrc.go.jp/
Rice PIPELINE
http://cdna01.dna.affrc.go.jp/PIPE
RPD
http://gene64.dna.affrc.go.jp/RPD/
5.3.5. Fungi
CADRE COGEME MagnaportheD B MNCDB Central Aspergillus data repository Phytopathogenic fungi and oomycete EST database http://www.cadre.man.ac.uk/ http://cogeme.ex.ac.uk http://www.fungalgenomics.ncsu.edu/Proje cts/mgdatabase/int.htm http://mips.gsf.de/proj/neurospora/
https://xgi.ncgr.org/pgc
http://www.sanger.ac.uk/Projects/C_elegans
Intronerator RNAiDB
Introns and alternative splicing in C.elegans and C.briggsae RNAi phenotypic analysis of C.elegans genes
http://www.cse.ucsc.edu/ / http://www.rnai.org/
kent/intronerator
WILMA
http://www.came.sbg.ac.at/wilma/
WorfDB
C.elegans ORFeome
http://worfdb.dfci.harvard.edu/
WormBase
Data repository for C.elegans and C.briggsae: curated genome annotation, genetic and physical maps, pathways
http://www.wormbase.org/
FlyTrap
http://www.flyarrays.com/fruitfly
CnidBase
http://cnidbase.bu.edu/
Nematode.net NEMBASE
Parasitic nematode sequencing project Nematode sequence and functional data database
http://nematode.net/ http://www.nematodes.org
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
The metabolic and signaling pathway is a collection of Pathway/Signaling Databases. Each database in this collection describes the genome and metabolic pathways of a single organism, with some exception databases. The categories in this 1)Enzymes and enzyme nomenclature 2)Metabolic pathways 3)Intermolecular interactions and signaling pathways
6. Metabolic Enzymes and Pathways; Signaling Pathways 6.1. Enzymes and Enzyme Nomenclature
ENZYME Enzyme nomenclature and properties Enzyme names and properties: sequence, structure, specificity, stability, reaction parameters, isolation data Integrated enzyme database and enzyme nomenclature http://www.expasy.org/enzyme
http://www.brenda.uni-koeln.de http://www.ebi.ac.uk/intenz
http://www.chem.qmw.ac.uk/iubmb/enzyme
UM-BBD WIT2
http://umbbd.ahc.umn.edu/ http://wit.mcs.anl.gov/WIT2/
aMAZE BIND
A system for the annotation, management and analysis of biochemical and signaling pathway networks Biomolecular interaction network database
BioCarta
BRITE
Biomolecular relations in information transmission and expression, part of the KEGG system
http://www.genome.ad.jp/brite
DIP
http://dip.doe-mbi.ucla.edu
DRC
GeneNet
Proteinprotein interaction data Putative protein domain interactions Functional and quantitative thermodynamic data on peptide binding to immunological biomacromolecules MHCpeptide interaction database Reactive oxygen species (ROS) signaling pathway
http://www.ebi.ac.uk/intact http://interdom.lit.org.sg
STCDB
http://www.techfak.unibielefeld.de/ mchen/STCDB
STRING
www.bork.emblheidelberg.de/STRING
TRANSPATH
http://www.biobase.de/pages/products/ databases.html
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
The Human and other vertebrate genomes is a repository of the human genome as well as the other vertebrate genomes containing databases. 1)Model organisms, comparative genomics 2)Human genome databases, maps and viewers 3)Human ORFs.
7. Human and other Vertebrate Genomes 7.1. Mitochondrial Genes and Proteins
AMmtDB Metazoan mitochondrial genes http://bighost.area.ba.cnr.it/mitochondriom e http://megasun.bch.umontreal.ca/gobase/go base.html
GOBASE
MitoDat MitoMap
http://www-lecb.ncifcrf.gov/mitoDat/ http://www.mitomap.org/
MitoNuc MITOP2
Nuclear genes coding for mitochondrial proteins Mitochondrial proteins, genes and diseases Mitochondrial protein sequences encoded by mitochondrial and nuclear genes Complete mitochondrial genome sequences for 200 metazoan species
http://biowww.ba.cnr.it:8000/BioWWW/#MitoNuc http://ihg.gsf.de/mitop2/
MitoProteome OGRe
http://www.mitoproteome.org http://www.bioinf.man.ac.uk/ogre
Human and mouse gene, transcript and protein annotation Genome databases for farm and other animals Cre transgenic mouse lines with links to publications Human cDNA clones homologous to Drosophila mutant genes Annotated information on eukaryotic genomes
DRESH Ensembl
FANTOM FREP
Functional annotation of mouse full-length cDNA clones Functional repeats in mouse cDNAs
http://fantom2.gsc.riken.go.jp http://facts.gsc.riken.go.jp/FREP/
IPD-MHC Database
http://www.ebi.ac.uk/ipd/mhc
GenetPig
http://www.infobiogen.fr/services/Genetpig
KOG LocusLink Mouse Genome Database Mouse SAGE Mouse Targeted Mutations MTID PEDE Rat Genome Database TIGR Gene Indices UniGene UniSTS ZFIN
Eukaryotic orthologous groups of proteins Curated sequences and descriptions of genetic loci
Mouse genome database SAGE libraries from various mouse tissues and cell lines
http://www.informatics.jax.org/ http://mouse.biomed.cas.cz/sage
Information on transgenic animals and targeted mutations Mouse transposon insertion database Pig EST data explorer: full-length cDNA libraries and ESTs Rat genetic and genomic data Organism-specific databases of EST and gene sequences Unified clusters of ESTs and full-length mRNA sequences Unified non-redundant view of sequence tagged sites with marker and mapping data from a variety of resources Genetic, genomic and developmental data from zebrafish
http://alugene.tau.ac.il/
http://bioinfo.weizmann.ac.il/crow21/ http://www-shgc.stanford.edu/RH/
Genebridge4 human radiation hybrid maps Human genes and genomic maps Human genes, markers and phenotypes Integrated database of human genes, maps, proteins and diseases
GeneCards
http://bioinfo.weizmann.ac.il/cards/
GeneLoc GeneNest
Gene location database (formerly UDBUnified database for human genome mapping) Gene indices of human, mouse, zebrafish, etc.
http://genecards.weizmann.ac.il/geneloc/ http://genenest.molgen.mpg.de/
Mapped human BAC clones Alignment of ESTs with finished human sequence
http://genomics.med.upenn.edu/genmapdb http://grl.gi.k.u-tokyo.ac.jp/
HOWDY
Non-redundant DNA and protein sequence collection Genome assemblies and annotation Paralogy mapping in human genomes Radiation hybrid map data
STACK
http://www.sanbi.ac.za/Dbases.html
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
Human Genes and Diseases Human genes and diseases is a category of those databases that has the information regarding disease causing genes, having databases of cancerous genes, human ORFs, etc. 1)Human ORFs 2)General human genetics databases 3)General polymorphism databases 4)Cancer gene databases 5)Gene-system or disease-specific databases
HPRD HUNT
http://www.hprd.org http://www.hri.co.jp/HUNT
HUGE
http://www.kazusa.or.jp/huge
http://www.dkfz.de/LIFEdb
ftp://ftp.isrec.isb-sib.ch/pub/databases/
IMGT
International immunogenetics information system: immunoglobulins, T cell receptors, MHC and RPI
http://imgt.cines.fr/
http://info.med.yale.edu/mutbase/
OMIA
Online Mendelian inheritance in animals: a catalog of animal genetic and genomic disorders
http://www.angis.org.au/omia
OMIM
Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders Collection of ORFs that are sold by Invitrogen European mutant mice pathology database: histopathology photomicrographs and macroscopic images Compilation of protein mutant data
http://www.ncbi.nlm.nih.gov/Omim/ http://orf.invitrogen.com/
ORFDB
PathBase PMD
http://www.pathbase.net/ http://pmd.ddbj.nig.ac.jp/
http://snp.cshl.org/ http://gila.bioengr.uic.edu/snp/toposnp
8.2.2. Cancer
Atlas of Genetics and Cytogenetics in Oncology and Haematology CGED Database of Germline p53 Mutations IARC TP53 Database MTB Oral Cancer Gene Database RB1 Gene Mutation Database RTCGD SNP500Cancer SV40 Large TAntigen Mutant Database
Cancer related genes, chromosomal abnormalities in oncology and haematology, and cancer-prone diseases Cancer gene expression database
Mutations in human tumor and cell line p53 gene Human TP53 somatic and germline mutations Mouse tumor biology database: mouse tumor types, genes, classification, incidence, pathology Cellular and molecular data for genes involved in oral cancer
http://www.tumor-gene.org/Oral/oral.html
Mutations in the human retinoblastoma (RB1) gene Mouse retroviral tagged cancer gene database Re-sequenced SNPs from 102 reference samples
http://www.d-lohmann.de/Rb/
http://rtcgd.ncifcrf.gov/
http://snp500cancer.nci.nih.gov
http://bigdaddy.bio.pitt.edu/SV40/
Cellular, molecular and biological data about genes involved in various cancers
http://www.tumor-gene.org/tgdf.html
8.2.3. Gene, system or disease-specific diseaseALPSbase Androgen Receptor Gene Mutations Database BTKbase CASRDB Cytokine Gene Polymorphism in Human Disease Collagen Mutation Database ERGDB FUNPEP GOLD.db Autoimmune lymphoproliferative syndrome database http://research.nhgri.nih.gov/alps/
Mutations in the androgen receptor gene Mutation registry for X-linked agammaglobulinemia Calcium-sensing receptor database: CASR mutations causing hypercalcemia and/or hyperparathyroidism
http://bris.ac.uk/pathandmicro/services/GAI /cytokine4.htm
Human type I and type III collagen gene mutations Estrogen responsive genes database Low-complexity peptides capable of forming amyloid plaque Genomics of lipid-associated disorders database
tGRAP
http://tinygrap.uit.no/GRAP/
http://www.kcl.ac.uk/ip/petergreen/haemBd atabase.html
HaemB
HbVar
http://globin.cse.psu.edu/globin/hbvar
Human p53/hprt, rodent lacI/lacZ databases Human PAX2 Allelic Variant Database
Mutations at the human p53 and hprt genes; rodent transgenic lacI and lacZ mutations
http://www.ibiblio.org/dnam/mainpage.htm l
http://pax2.hgu.mrc.ac.uk/
http://pax6.hgu.mrc.ac.uk/
IL2Rgbase
X-linked severe combined immunodeficiency mutations Vertebrate immunoglobulin and T cell receptor genes
http://research.nhgri.nih.gov/scid/
IMGT/Gene-DB
http://imgt.cines.fr/cgi-bin/GENElect.jv
IMGT/HLA
Polymorphism of human MHC and related genes Hereditary inflammatory disorder and familial mediterranean fever mutation data
http://www.ebi.ac.uk/imgt/hla/
INFEVERS
http://fmf.igh.cnrs.fr/infevers
KinMutBase
http://www.uta.fi/imt/bioinfo/KinMutBase/
http://research.nhgri.nih.gov/lowe/
Polymorphisms in neuronal ceroid lipofuscinoses genes Mutations at the phenylalanine hydroxylase locus Prostate and prostatic diseases gene database
http://www.phexdb.mcgill.ca/
http://www.cybergene.se/PTCH/ptchbase.ht ml
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
Microarrays are producing massive amounts of data. These data, like genome sequence data, can use to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs . A gene expression database can be regarded as consisting of three parts the gene expression data matrix, gene annotation and sample annotation. Hence the Microarray data and other gene expression databases is consists of repositories of microarray data and gene expression data.
Gene expression in Xenopus laevis Human and mouse gene expression data Brain gene expression database
CleanEx
Expression reference database, linking heterogeneous expression data to facilitate cross-dataset comparisons
http://www.cleanex.isb-sib.ch/
EICO DB
Expression-based imprint candidate organiser: a database for discovery of novel imprinted genes
http://fantom2.gsc.riken.jp/EICODB/
emap Atlas
Edinburgh mouse atlas: a digital atlas of mouse embryo development and spatially-mapped gene expression
http://genex.hgu.mrc.ac.uk/
EPConDB
http://www.cbil.upenn.edu/EPConDB
EpoDB FlyView GeneAnnot GeneNote GenePaint GeneTrap GermOnline GXD HemBase HugeIndex Interferon Stimulated Gene Database Kidney Development Database
Genes expressed during human erythropoiesis Drosophila development and genetics Revised and improved annotation of Affymetrix human gene probe sets Human genes expression profiles in healthy tissues Gene expression patterns in the mouse Expression patterns in an embryonic stem library of gene trap insertions Expression data relevant for the mitotic and meiotic cell cycle and gametogenesis in yeast and higher eukaryotes Mouse gene expression database Genes transcribed in differentiating human erythroid cells Expression levels of human genes in normal tissues
http://www.cbil.upenn.edu/EpoDB/ http://pbio07.uni-muenster.de/ http://genecards.weizmann.ac.il/geneannot/ http://genecards.weizmann.ac.il/genenote / http://www.genepaint.org/Frameset.html http://www.cmhd.ca/sub/genetrap.asp http://www.germonline.org/ http://www.informatics.jax.org/menus/expre ssion_menu.shtml http://hembase.niddk.nih.gov/ http://hugeindex.org/
Genes induced by treatment with interferons Kidney development and gene expression
MAGEST
Ascidian (Halocynthia roretzi) gene expression patterns Medaka (freshwater fish Oryzias latipes) gene expression pattern database DNA methylation data, patterns and profiles
http://www.genome.ad.jp/magest
MEPD MethDB
http://medaka.dsp.jst.go.jp/MEPD http://www.methdb.de/
NASCarrays NetAffx
Nottingham Arabidopsis Stock Centre microarray database Public Affymetrix probesets and annotations Prostate expression database: ESTs from prostate tissue and cell type-specific cDNA libraries Public expression profiling resource: expression profiles in a variety of diseases and conditions Genes using programmed translational recoding in their expression Reference database for human gene expression analysis
http://affymetrix.arabidopsis.info http://www.affymetrix.com/
PEDB
PEPR
http://recode.genetics.utah.edu/ http://www.lsbm.org/db/index_e.html
Raw and normalized data from microarray experiments Gene expression in dental tissue
http://genomewww.stanford.edu/microarray
http://bite-it.helsinki.fi/
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
Proteomics Resources
Applications of Proteomics
Characterization of Protein Complexes Protein Expression Profiling Proteome Mining Protein Arrays
The
proteomic
resources
have
databases
containing
What is Proteomics?
Defined as the analysis of the entire protein complement in a given cell, tissue, or organism.
Proteomics also assesses activities, modifications, localization, and interactions of proteins in complexes.
Technology of Proteomics
1D and 2D PAGE
Types of Proteomics
Protein Expression
Quantitative study of protein expression between samples that differ by some variable
Structural Proteomics
Goal is to map out the 3-D structure of proteins and 3protein complexes
Functional Proteomics
GelBank
http://gelbank.anl.gov/
PEP
http://cubic.bioc.columbia.edu/pep/
RESID SWISS2DPAGE
http://www.expasy.org/ch2d/
This category has the remaining types of databases. This category again can be subdivide into the following divisions: 1) BioImage 2) MetaRouter 3) PubMed 4) Drugs and drug design 5) Molecular probes and primers
11. Other Molecular Biology Databases 11.1. Drugs and drug design
ANTIMIC APD Database of natural antimicrobial peptides Antimicrobial peptide database Biodegradative strain database: microorganisms that can degrade aromatic and other organic compounds http://research.i2r.astar.edu.sg/Templar/DB/ANTIMIC/
http://aps.unmc.edu/AP/main.php
BSD
http://bsd.cme.msu.edu/
DART
Peptaibol
http://www.pharmgkb.org/
TTD
http://xin.cz3.nus.edu.sg/group/cjttd/ttd.asp
11.2. Probes
IMGT/PRIME R-DB
http://imgt3d.igh.cnrs.fr/PrimerDB/Query_ PrDB.pl
MPDB
http://www.biotech.ist.unige.it/interlab/m pdb.html
probeBase
rRNA-targeted oligonucleotide probe sequences, DNA microarray layouts and associated information
http://www.microbialecology.net/probeba se
RTPrimerDB
http://medgen31.ugent.be/primerdatabase/in dex.php
VirOligo
http://viroligo.okstate.edu/
http://pubmed.gov/
http://www.bioimage.org/
Bioinformatics Tools
BLAST(Basic Local Alignment Search Tool)
BLAST is the algorithm used by a family of five programs that will align your query sequence against sequences in a molecular database. Statistical methods are applied to judge the significance of matches. Reported alignments (i.e. sequences in the database that could be identical to your query sequence) are reported in order of significance, as estimated by the applied statistics
BLASTN
BLASTP
BLASTX
Compares the six-frame conceptual translation sixproducts of a nucleotide query sequence (both strands) against a protein sequence database.
TBLASTN
Compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
TBLASTX
Compares a nucleotide query sequence against the sixsixframe translations of a nucleotide sequence database.
CLUSTALX
Clustal X (Thompson et al. 1997) is a (Thompson 1997) version of Clustal W with a graphical user interface. This programme is used for multiple sequence alignment.
Multiple Alignment
Phylogenetic Analysis
Nucleic acid and protein sequences are used to infer Phylogenetic relationships Molecular phylogeny methods allow the suggestion of phylogenetic trees, from a given set of aligned sequences. The phylogenetic trees aim at reconstructing the history of successive divergence which took place during the evolution, between the considered sequences and their common ancestor.
Phylogenetic programmes
PHYLIP PAUP MEGA Treeview ODEN PHYLOWIN TREECON DENDRON
Gene Identification
AAT: Analysis and Annotation Tool FGENESH: Splice sites, protein coding exons & gene models Genie: Gene finder based on hidden Markov models GenScan: Identification of gene structures in genomic DNA Grail: DNA sequence analysis tool
3D3D-PSSM: Protein Fold Recognition Multicoil: Predict coiled coil structures NNPredict: Protein secondary structure prediction PredictProtein: Sequence analysis and structure prediction SAPS: Statistical analysis of protein sequences
FUGUE: Sequence-structure homology recognition SequencePDB Viewer: Protein structure database Proinformatix: Modeling oligopeptides for energetically minimized structures SWISSSWISS-MODEL: An automated knowledge-based knowledgeprotein modelling server