INDO Thai What Is Bioinformatics, A.sharMA

CIMAP Summer Training on Biotechnology & Bioinformatics
20th June 20th July, 2006
Bioinformatics : Techniques and usage
Dr. Ashok Sharma Head, Bioinformatics and Co-ordinator, Bioinformatics Centre Central Institute of Medicinal and Aromatic Plants PO. CIMAP, Lucknow-226015, India. Web site: www.cimap.res.in E-mail: ashoksharma@cimap.res.in
Sequences
Biological Knowledge
Bioinformatics
Databases
Greater Biological Knowledge
Bioinformatics:
Why
What
Computational Methods Resources and Tools
If you are one of many biologists for whom genome database are as comprehensible as a mass of supermarket barcodes It is a good time to team up with a friendly bioinformaticist and join the action, before, it is too late
If biologists do not adapt to the powerful computation tools needed to exploit huge data sheets, they could find themselves floundering in the wake of advances in genomics
It is predicted that the potential to integrate different levels of genomic data such as raw sequence from the human genome and those of model organisms, data on genetic variability between individuals and on gene expression in different tissues will radically change biological research. It is also agreed that small experiments driven by individual investigators will give way to a world in which multidisciplinary teams, sharing huge online data sets, emerge as key players.
Bioinformatics : a brave new world

Radical change in biological research from small experiments driven by individual investigators
Multidisciplinary teams sharing huge online data will be the key players
Era of systems biology ability to create mathematical models describing the function of networks of genes and proteins is just as important as traditional lab skills
Who will have competitive advantage?

Those who learn to conduct high throughput genomic analyses, and who can master the computational tools needed to exploit biological databases
Outcome of this natural selection will see many current top scientists, research groups and even whole institutes relegated to the second division
What is the solution
In the long run, the change will come through the emergence of a new breed of biologists who are steeped in computational biology as an integral parts of their education. This means that the subject must be included as a core module in all undergraduate biology courses, rather than a specialist option. Although, this is starting to happen, the availability of teachers with the appropriate expertise is still a limiting factor.
The emerging new breed So, if the majority of biologists are not to be disenfranchised, What is the solution? Emergence of a new breed of biologists who are steeped in computational biology as an integral parts of their education. Limiting factor: availability of teachers with the appropriate expertise.
One of the model solution has come out in U.S.A

Funding agencies are also trying to drive change by ploughing money into initiatives that require a multidisciplinary approach and a strong computational component. The US National Institute of Health, for instance, through its National Institute of General Medical Sciences; has created a programme of glue grants for integrative and collaborative approaches to research. Under this programme, the Alliance will draw a complete map of interactions between some 1000 proteins in two types of cells. The consortium unites computational biologists. traditional experimentalists with
Glue grants Integrative and collaborative approaches to research
US National Institutes of Health Alliance for Cellular Signaling
Complete map of interactions between some 1,000 proteins in two types of cells
Consortium unites traditional experimentalist with computational biologists.
Modern Biology and particularly Biotechnology are very much information-dependent fields. In fact, the symbiosis between information technology and biotechnology today is as intricately entwined as the two strands of the genetic material that make up the DNA helix.
y Human Genome Project and other genome projects such as sequencing of bacterial genomes and yeast genomes, etc. have produced enormous amounts of DNA sequence data.
y Large scale biological research involving micro sequencing of proteins, 2D gel patterns of proteins and polypeptides, metabolic pathways, physical and genetic maps of the organisms, cell line information, and microbial strain data etc. have been responsible for the unprecedented growth of biological data. y Projects such as Species-2000, global plant check list, information on release of organisms in environment, and Animal Virus Information, etc. are producing hard data at the species level in multimedia format.
The rate of growth of the biological data is estimated to be more than 200 million base pairs per year.
The database content itself is doubling in size approximately every year.
Nucleotide and protein sequences are not the only data that are accumulating rapidly. The number of characterized genes from a variety of organisms and the number of solved protein structures are also doubling every two years.
The enormous growth of biological data and its availability in the major international databases is serving as a source of knowledge to the life scientists.
The whole paradigm shift in molecular biology towards dataintensive research in search of useful genes is basically due to the fact that the genetic data is becoming the major driving force in drug discovery, protein engineering, design of new molecules, and other related areas.
The large stores of biological data are holding the promise to serve as the Discovery Super Highway for innovations in biotechnology through a process of analysis and transformation of molecular and structural data into biological knowledge for prosperity.
In the face of the challenges imposed by the growing size and complexity of the biological data, a new discipline of science, known as Bioinformatics, had emerged in the recent past.
Bioinformatics deals with the various issues related to the biological data. It also covers the development of data analysis tools, modeling of biological macromolecules and their complexes, metabolic pathways, designing of new molecules such as drugs, peptide vaccines, proteins, etc.
Gradually, Bioinformatics has evolved to deal with four related but still distinct problem areas, viz.: a) Handling and management of biological data, including its organization, control, linkages, analysis, and so forth. b) Communication among people, projects, and institutions engaged in the biological research and applications. The communication may include e-mail, file transfer, remote login, computer conferencing, electronic bulletin boards, or establishment of web-based information resources. c) Organization, access, search and retrieval of biological information, documents, and literature. d) Analysis and interpretation of the biological data through the computational approaches including visualization, mathematical modeling, and development of algorithms for highly parallel processing of complex biological structure.
Bioinformatics may, be defined as a scientific discipline that encompasses all the aspects of biological information, viz., acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science, and biology with the aim of understanding the biological significance of a variety of data.
Bioinformatics has acquired great importance due to its application in the Genome projects. The target of decoding the three billion base pairs of the human DNA has become achievable only through the use of various innovative techniques and methods evolved by the Bioinformatics scientists. Bioinformatics has become an essential component of biotechnology based product and process development. The process of drug design and development is expensive and timeconsuming. The application of the tools and techniques of Bioinformatics has resulted in the reduction in cost and the development cycle of the drugs. This aspect has a tremendous impact on the society. If a newly discovered drug is a life-saving one, then the resulting gains are not only in terms of financial savings but also in saving the lives of several million people. Major pharmaceutical and Biotechnology companies have set up large R&D groups in Bioinformatics.
Bioinformatics is a multidisciplinary subject. Through only about a decade old, it has become very important for the growth of biosciences, biotechnology, and the economic prosperity of nations. Three well-identified divisions of Bioinformatics may be considered: a) Molecular Bioinformatics, b) Cellular and sub-cellular Bioinformatics, and c) Orgasmic and community Bioinformatics.
FUNCTIONS OF A BIOINFORMATICS CENTRE

i. The principal objective of a Bioinformatics Centre is to function as an information base in each specialty so that the scientists have ready access to the computer-based information on resources, databases in subject fields, and build up expertise in bioinformatics in keeping with the rapid development in this area.
ii. To provide a computer-based information storage and retrieval system of database that collects structured information generated by research and industrial institutions in the identified fields of biotechnology, continually update the databases and make the information available to the users. iii. An active network mode, in which the scientists get access to the biotechnology community in the identified areas, answer requests for information in an interactive and discussive mode and actively initiate dialogue among groups with common interest.
iv. To provide retrieval service either online or offline in their specialized areas and to give overall information support even in areas other than those assigned to them.
v. To provide communication link with international databases for selective bibliographic information for the user scientist. vi. To develop software packages and databases specific to user needs.
vii. To conduct training courses in the specialized areas periodically to meet the special requirements of manpower development in the area and to promote awareness about the computerized storage and retrieval facility among bio scientists and information scientists.
Bioinformatics What?
A mixture of Biochemistry, Molecular Biology, and Computer Science Obtaining, storing, organizing, and analyzing biological and genetic information for understanding its activity in living organisms Main goal is to convert multitude of complex data into useful information and knowledge Data includes gene and protein sequences, cDNA, nucleotide sequences Data from gene sequencing, combinatorial chemical synthesis, gene-expression investigations, pharmicogenomics, proteomic studies, and other methods of study. Information used to build synthetic and predictive models allowing scientists to better understand complex living systems Future applications in biology, chemistry, pharmaceuticals, medicine, and agriculture
What is the Role of Bioinformatics

The Role of the Bioinformatics group is to:
Research and develop tools and systems that provide understanding and integration of genomic data across technologies Work with other Research Information staff to make these tools available to research scientists
What kinds of data are we interested in?

Sequence data Profile data gene expression and proteins Mapping data Function and phenotype Pathways
TECHNOLOGIES IN BIOINFORMATICS
DataData-acquisition Systems These are requires mainly at research labs generating large amounts of data. These data. systems include inventory Control Software, tracking hundreds of thousands of reagents, gels and other materials, reagent manipulation software, robotic system to carry out high volume, high precision laboratory manipulation in genome research and sequence production software that will help improve sequencing. sequencing.
Data Analysis Systems Studying sequences, predicting protein structure and comparing genomes on an extension such all requires Informatics tools such as Sequence Analysis Software that performs alignments, detects homologies, identifies coding regions and extracts features. Protein folding software is features. used to transform genetic information into function via proteins whose functional specific are determined by their 3-D shapes. Genetic mapping Software Systems play shapes. a key role in the analysis of genetic mapping data. data. Classification Software extracts features from DNA Sequences place proteins into gene families and track protein motifs. motifs.
DataData- Management System
Various genome projects are generating information that can not be accommodated by traditional publishing. Electronic data management and publishing Systems are crucial components of genomic research.
CHALLENGES IN BIOINFORMATICS
Bioinformatics, which is the intersection of Information Technology and Mathematics with molecular biology / genetics, has created several challenges for the Computer Science Community. Community.

Information Storage

Storing huge amounts of genetic information, amenable to rapid access and manipulation, is a great challenge. challenge. One million bases (1Mb) N 1 Megabyte (1MB). Thus, one would require 3 MB). Gigabytes (3 GB) of computer data storage space to store entire Human Genome comprising three Gigabases (3 Gb). Gb). This includes nucleotide sequence data only and does not include data annotations and other information associated with the sequence data. data. With time, more annotations entered either
(a) by scientists as a result of laboratory findings, literature searches, data analysis, or
personal communications, and/or (b) entered as a result of automated data analysis programs or autoannotators, Will be
associated with the sequence data increasing the requirements of storage significantly beyond the 3 GB for the human genome. genome.
The Management and Integration of Biological Information

The development of database management systems is an active subspecialty of computer science. science. The need to organise terabytes of heterogeneous biological data in a form that is easily usable, and which employs sophisticated data visualisation capabilities is essential for progress in modern biology. biology.
Data Analysis Systems

Genetic data can not be analysed efficiently without computer systems. systems. Studying sequences, predicting protein structures, and comparing genomes on an extensive scale all require additional Informatics tools, such as: as:
Sequence Analysis Software

Sequence analysis is so far the best known, best established area of bioinformatics - Performing alignments, detecting homologies, identifying coding regions, extracting features, and other computerised analysis of the sequences - are now performed as routine. At the same time, sequence routine. analysis is a multi-faceted and biologically profound area of research, multidemanding much continued work. work.
Protein Folding Software

Genetic information is transformed into function via proteins, whose functional specificities are determined by their three dimensional shapes. Prediction of the shapes. protein structure from amino acid sequnces is an important and challenging problem. problem. Computation plays an increasing central role in the assembly and integration of large maps composed of different kinds and combinations of data. data. As the genome projects mature and large amounts of genomic information is available for a number of species, comparative genomics is emerging as an active area of study. study. Methods for mapping genes to their physical locations on the genome; searching genome; for related genes; analysing the database to find families of related genes and to genes; understand their coordinated expression; finding correlation between specific expression; diseases and expression of related genes. genes.
Map Assembly & Integration Software

Comparative Genomics Tools

Gene Mining

SERENDIPITY EFFECT

One of the most exciting aspects of the information revolution is that it allows us to combine many different items of information and many different kind of information on a scale never seen before. before. Large international databases for instance, include contributions from thousands of different sources. Also the sources. hypertext links (Information Super Highway) between sites makes it possible to draw together many different kinds of information that bear on a particular problems. problems. These activities not only promote collaboration on a truly vast scale, they also enrich research. One important effect is the research. Screndipity effect combining different datasets makes possible entirely new kinds of study-New Studies inevitable studylead to new and unexpected discoveries. discoveries.
Computational Methods
A core set of computational approaches has emerged for dealing with the types of data that are currently shared in public databases DNA, protein sequence, and protein structure.
Using public databases and data formats The first key skill for biologists is to learn to use online search tools to find information. Literature searching is no longer a matter of looking up references in a printed index. You can find links to most of the scientific publications you need online. There are central databases that collect reference information so you can search dozens of journals at once. You can even set up agents that notify you when new articles are published in an area of interest. Searching the public molecular-biology databases requires the same skills as searching for literature references: you need to know how to construct a query statement that will pluck the particular needle youre looking for out of the database haystack.
Sequence alignment and sequence searching Being able to compare pairs of DNA or protein sequences and extract partial matches has made it possible to use a biological sequence as a database query. Sequence-based searching is another key skill for biologists; a little exploration of the biological databases at the beginning of a project often saves a lot of valuable time in the lab. Identifying homologous sequences provides a basis for phylogenetic analysis and sequence-pattern recognition. Sequence-based searching can be done online through web forms, so it requires no special computing skills, but to judge the quality of your search results you need to understand how the underlying sequence-alignment method works and go beyond simple sequence alignment to other types of analysis.
Gene prediction
Gene prediction is only one of a cluster of methods for attempting to detect meaningful signals in uncharacterized DNA sequences. Until recently, most sequences deposited in GenBank were already characterized at the time of deposition. That is, someone had already gone in and, using molecular biology, genetic, or biochemical methods, figured out what the gene did. However, now that the genome projects are in full swing, theres a lot of DNA sequence out there that isnt characterized. Software for prediction of open reading frames, genes, exon splice sites, promoter binding sites, repeat sequences, and tRNA genes helps molecular biologists make sense out of this unmapped DNA.
Multiple sequence alignment

Multiple sequence-alignment methods assemble pairwise sequence alignments for many related sequences into a picture of sequence homology among all members of a gene family. Multiple sequence alignments aid in visual identification of sites in a DNA or protein sequence that may be functionally important. Such sites are usually conserved; that is, the same amino acid is present at that site in each one of a group of related sequences. Multiple sequence alignments can also be quantitatively analyzed to extract information about a gene family. Multiple sequence alignments are an integral step in phylogenetic analysis of a family of related sequences, and they also provide the basis for identifying sequence patterns that characterize particular protein families.
Phylogenetic analysis
Phylogenetic analysis attempts to describe the evolutionary relatedness of a group of sequences. A traditional phylogenetic tree or cladogram groups species into a diagram that represents their relative evolutionary divergence. Branchings of the tree that occur furthest from the root separate individual species; branchings that occur close to the root group species into kingdoms, phyla, classes, families, genera, and so on. The information in a molecular sequence alignment can be used to compute a phylogenetic tree for a particular family of gene sequences. The branchings in phylogenetic trees represent evolutionary distance based on sequence similarity scores or on information-theoretic modeling of the number of mutational steps required to change on sequence into the other. Phylogenetic analyses of protein sequence families talks not about the evolution of the entire organism but about evolutionary change in specific coding regions.
Extraction of patterns and profiles from sequence data

A motif is a sequence of amino acids that defines a substructure in a protein that can be connected to function or to structural stability. In a group of evolutionarily related gene sequences, motifs appear as conserved sites. Sites in a gene sequence tend to be conserved-to remain the same in all or most representatives of a sequence family when there is selection pressure against copies of the gene that have mutations at that site. Nonessential parts of the gene sequence will diverge from each other in the course of evolution, so the conserved motif regions who up as a signal in a sea of mutational noise. Sequence profiles are statistical descriptions of these motif signals; profiles can help identify distantly related proteins by picking out a motif signal even in a sequence that has diverged radically from other members of the same family.
Protein sequence analysis

The amino-acid content of a protein sequence can be used as the basis for many analyses, from computing the isoelectric point and molecular weight of the protein and the characteristic peptide mass fingerprints that will form when its digested with a particular protease, to predicting secondary structure features and posttransnational modification sites.
Protein structure prediction

It is a lot harder to determine the structure of a protein experimentally than it is to obtain DNA sequence data. One very active area of bioinformatics and computational biology research is the developemtn of methods for predicting protein structure from protein sequence. Methods such as secondary structure prediction and threading can help determine how a protein might fold, classifying it with other proteins that have similar topology, but they dont provide a detailed structure mode.
Protein structure property analysis Protein structures have many measurable properties that are of interest to crystallographers and structural biologists. Protein structure validation tools are used by crystallographers to measure how well a structure model conforms to structural rules extracted from existing structures or chemical model compounds. These tools may also analyze the fitness of every amino acid in a structure model for its environment, flagging such oddities as buried charges with no countercharge or large patches of hydrophobic amino acids found on a protein surface. These tools are useful for evaluating both experimental and theoretical structure models.
Protein structure alignment and comparison

Even when two gene sequences arent apparently homologous, the structures of the proteins they encode can be similar, New tools for computing structural similarity are making is possible to detect distant homologies by comparing structures, even on the absence of much sequence similarity.
Biochemical simulation
Biochemical simulation uses the tools of dynamical systems modeling to simulate the chemical reactions involved in metabolism. Simulations can extend from individual metabolic pathways to transmembrane transport processes and even properties of whole cells or tissues. Biochemical and cellular simulations traditionally have relied on the ability of the scientist to describe a system mathematically, developing a system of differential equations that represent the different reactions and fluxes occurring in the system. However new software tools can build the mathematical framework of a simulation automatically from a description provided interactively by the user, making mathematical modeling accessible to any biologist who knows enough about a system to describe it according to the conventions of dynamical systems modeling.
Whole genome analysis

As more and more genomes are sequenced completely, the analysis of raw genome data has become a more important task. There are a number of perspectives from which one can look at genome data: for example, it can be treated as a long linear sequence, but its often more useful to integrate DNA sequence information with existing genetic and physical map data. This allows you to navigate a very large genome and find what you want.
Primer design
Many molecular biology protocols require the design of oligonucleotide primers. Proper primer design is critical for the success of polymerase chain reaction (PCR), oligo hybridization, DNA sequencing, and microarray experiments. Primers must hybridize with the target DNA to provide a clear answer to the question being asked, but, they must also have appropriate physicochemical properties; they must not self-hybridize or dimerize; and they should not have multiple targets within the sequence under investigation. There are several web-based services that allow users to submit a DNA sequence and automatically detect appropriate primers, or to compute the properties of a desired primer DNA sequence.
DNA microarray analysis

DNA microarray analysis is a relatively new molecular biology method that expands on classic probe hybridization methods to provide access to thousands of genes at once. The main tasks in microarray analysis as its currently done are an image analysis step, in which individual spots on the array image are identified and signal intensities are identified.
Proteomics analysis
Before theyre ever crystallized and biochemically characterized, proteins are often studied using a combination of gel electrophoresis, partial sequencing, and mass spectroscopy. 2-D gel electrophoresis can separate a mixture of thousands of proteins into distinct components; the individual spots of material can be blotted or even cut from the gel and analyzed. Simple computational tools can provide some information to aid in the process of analyzing protein mixtures. Its trivial to compute molecular weight and pI from a protein sequence; by using these values in combination, sets of candidate identities can be found for each spot on a gel. Its also possible to compute, from a protein sequence, the peptide fingerprint that is created when that protein is broken down into fragments by enzymes with specific protein cleavage sites.
Databases
The internet is a powerful resource containing a large volume of data and tools to manipulate them unfortunately, connecting data between them can sometimes be tricky. What is a database ? An organized body of related information. A collection of information organized and presented to serve a specific purpose. A computerized database is an updated, organized file of machine readable information that is rapidly searched and retrieved by computer. computerized storehouse of data (records). allows user-defined queries. allows extraction of specified records. allows adding, changing, removing, and merging of records . uses standardized formats.
The ideal sequence database for computational analyses and datadatamining:

It must be complete with minimal redundancy It must contain as much up-to-date information (annotation) as up-topossible on each sequence All the information items must be retrievable by computer programs in a consistent manner It must be highly interoperable with other databases
Database Categories List
Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases
Nucleotide Sequence Databases
The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These data are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.
DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronisation. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. Nucleotide Sequence Databases can be further subdivided into following : 1)International Nucleotide Sequence Database Collaboration 2)Coding and non-coding DNA 3)Gene structure, introns and exons, splice sites 4)Transcriptional regulator sites and transcription factors.
Nucleotide Sequence Databases
Database name
Full name and/or description
URL
1.1. International Nucleotide Sequence Database Collaboration
GenBank
An annotated collection of all publicly available nucleotide and protein sequences
http://www.ncbi.nlm.nih.gov/
EMBL Nucleotide Sequence Database
http://www.ebi.ac.uk/embl.html
DDBJDNA Data Bank of Japan
http://www.ddbj.nig.ac.jp
Online databases

primary repositories of sequence data: - European Bioinformatics Institute (EBI) - DNA data bank of Japan (DDBJ) - GenBank, National Center for Biotechnology Information (NCBI) each of these databases contain equivalent information (formats vary slightly)
1.2. DNA sequences: genes, motifs and regulatory sites 1.2.1. Coding and coding DNA
ACLAME CUTG A classification of genetic mobile elements Codon usage tabulated from GenBank Deviations from the standard genetic code in various organisms and organelles Human endogenous retrovirus database Immunoglobulin, T cell receptor and MHC nucleotide sequences from human and other vertebrates http://aclame.ulb.ac.be/ http://www.kazusa.or.jp/codon/ http://www.ncbi.nlm.nih.gov/Taxonomy/ Utils/wprintgc.cgi?mode=c http://herv.img.cas.cz
Genetic Codes HERVd IMGT/LIGMDB Imprinted Gene Catalogue
http://imgt.cines.fr/cgi-bin/IMGTlect.jv
Imprinted genes and parent-of-origin effects in animals
http://www.otago.ac.nz/IGC
Islander MICdb
Pathogenicity islands and prophages in bacterial genomes Prokaryotic microsatellites
http://www.indiana.edu/islander http://www.cdfd.org.in/micas
STRBase TIGR Gene Indices
Short tandem DNA repeats database Organism-specific databases of EST and gene sequences
http://www.cstl.nist.gov/div831/strbase/ http://www.tigr.org/tdb/tgi.shtml
Transterm
Codon usage, start and stop signals
http://uther.otago.ac.nz/Transterm.html
UniGene
Unified clusters of ESTs and full-length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/
UniVec
Vector sequences, adapters, linkers and primers used in DNA cloning, can be used to check for vector contamination
http://www.ncbi.nlm.nih.gov/VecScreen/U niVec.html http://genomewww2.stanford.edu/vectordb/
VectorDB
Characterization and classification of nucleic acid vectors Eukaryotic protein-encoding DNA sequences, both introncontaining and intron-less genes
Xpro
http://origin.bic.nus.edu.sg/xpro/
1.2.2. Gene structure, introns and exons, splice sites
ASAP
Alternative spliced isoforms
http://www.bioinformatics.ucla.edu/ASAP
ASD
EBIs alternative splicing database project includes three databases AltSplice, AltExtron and AEdb
http://www.ebi.ac.uk/asd
ASDB EASED EID ExInt HS3D
Alternative splicing database: protein products and expression patterns of alternatively-spliced genes Extended alternatively spliced EST database Exonintron database: introns in protein-coding genes Exonintron structure of eukaryotic genes Homo sapiens splice sites dataset
http://hazelton.lbl.gov/teplitski/alt http://eased.bioinf.mdc-berlin.de/ http://mcb.harvard.edu/gilbert/EID/ http://intron.bic.nus.edu.sg/exint/exint.html http://www.sci.unisannio.it/docenti/rampone/
IDB/IEDB
Intron sequence and evolution databases Introns and alternative splicing in C.elegans and C.briggsae
http://nutmeg.bio.indiana.edu/intron/index.html http://www.cse.ucsc.edu/kent/intronerator/
Intronerator
SpliceDB
Canonical and non-canonical mammalian splice sites
http://genomic.sanger.ac.uk/spldb/SpliceDB.htm l
SpliceNest
A tool for visualizing splicing of genes from EST data
http://splicenest.molgen.mpg.de/
YIDB
Yeast nuclear and mitochondrial intron sequences
http://www.emblheidelberg.DE/ExternalInfo/seraphin/yidb.html
1.2.3. Transcriptional regulator sites and transcription factors

ACTIVITY DBTBS DBTSS Functional DNA/RNA site activity Bacillus subtilis promoters and transcription factors A database of transcriptional start sites http://util.bionet.nsc.ru/databases/activity.htm l http://dbtbs.hgc.jp/ http://dbtss.hgc.jp/
DPInteract
Binding sites for E.coli DNA-binding proteins
http://arep.med.harvard.edu/dpinteract
EPD
Eukaryotic promoter database Hematopoietic promoter database: transcriptional regulation in hematopoiesis Primate mitochondrial DNA control region sequences PSSMs for transcription factor DNA-binding sites Plant cis-acting regulatory DNA elements
http://www.epd.isb-sib.ch http://bioinformatics.med.ohiostate.edu/HemoPDB http://www.hvrbase.org/ http://jaspar.cgb.ki.se http://www.dna.affrc.go.jp/htdocs/PLACE
HemoPDB HvrBase JASPAR PLACE
PlantCARE PlantProm
Plant promoters and cis-acting regulatory elements Plant promoter sequences for RNA polymerase II
http://intra.psb.ugent.be:8080/PlantCARE/ http://mendel.cs.rhul.ac.uk/
RNA sequence databases

The RNA sequence databases aims to contain all the databases have compiled all complete or nearly complete ribosomal RNA sequences from all or specific rna sequences. Some of them contains secondary structure information, additional information about the sequences, such as taxonomic classification of the organism from which they have been obtained, and literature references are also provided. There are databases containing information regarding 16S and 23S ribosomal RNA mutations, 5S rRNA sequences, Genomic tRNA, All complete or nearly complete rRNA sequences etc.
2. RNA sequence databases

16S and 23S rRNA Mutation Database 5S rRNA Database Aptamer database ARED
16S and 23S ribosomal RNA mutations 5S rRNA sequences Small RNA/DNA molecules binding nucleic acids, proteins AU-rich element-containing mRNA database A database of group II introns, self-splicing catalytic RNAs All complete or nearly complete rRNA sequences Genomic tRNA database
http://ribosome.fandm.edu/ http://biobases.ibch.poznan.pl/5SData/ http://aptamer.icmb.utexas.edu/ http://rc.kfshrc.edu.sa/ared
Mobile group II introns European rRNA database GtRDB
http://www.fp.ucalgary.ca/group2introns/ http://www.psb.ugent.be/rRNA/ http://rna.wustl.edu/GtRDB http://biosun.bio.tudarmstadt.de/goringer/gRNA/gRNA.html http://hiv-web.lanl.gov/ http://bibiserv.techfak.unibielefeld.de/HyPa/ http://ifr31w3.toulouse.inserm.fr/IRESda tabase/
Guide RNA Database HIV Sequence Database
RNA editing in various kinetoplastid species HIV RNA sequences Hybrid pattern library: structural elements in classes of RNA Internal ribosome entry site database
HyPaLib IRESdb
miRNA Registry NCIR
Database of microRNAs (small non-coding RNAs) Non-canonical interactions in RNA structures
http://www.sanger.ac.uk/Software/Rfam/mir na/ http://prion.bchs.uh.edu/bp_type/
ncRNAs Database
Non-coding RNAs with regulatory functions
http://biobases.ibch.poznan.pl/ncRNA/
PLANTncRNAs
Plant non-coding RNAs
http://www.prl.msu.edu/PLANTncRNAs
Plant snoRNA DB
snoRNA genes in plant species
http://www.scri.sari.ac.uk/plant_snoRNA/
PLMItRNA
Plant mitochondrial tRNA
http://bighost.area.ba.cnr.it/PLMItRNA/ http://wwwbio.leidenuniv.nl/ KB.html http://rdp.cme.msu.edu Batenburg/P
PseudoBase RDP
Database of RNA pseudoknots Ribosomal database project: rRNA sequence data
Rfam RISCC RNA Modification Database RRNDB
Non-coding RNA families Ribosomal internal spacer sequence collection
http://www.sanger.ac.uk/Software/Rfam/ http://ulises.umh.es/RISSC
Naturally modified nucleosides in RNA rRNA operon numbers in various prokaryotes
http://medlib.med.utah.edu/RNAmods/ http://rrndb.cme.msu.edu/
Small RNA Database
Small RNAs from prokaryotes and eukaryotes
http://mbcr.bcm.tmc.edu/smallRNA
SRPDB
Signal recognition particle database
http://psyche.uthct.edu/dbs/SRPDB/SRPD B.html
Subviral RNA Database
Viroids and viroid-like RNAs
http://subviral.med.uottawa.ca/cgibin/home.cgi
tmRNA Website
tmRNA sequences and alignments
http://www.indiana.edu/tmrna
tmRDB
tmRNA database
http://psyche.uthct.edu/dbs/tmRDB/tmRDB. html
tRNA database UTRdb/UTRsit e
tRNA viewer and sequence editor 5'- and 3'-UTRs of eukaryotic mRNAs
http://www.unibayreuth.de/departments/biochemie/trna/ http://bighost.area.ba.cnr.it/srs6/
Types of protein databases

1. Sequence sequence databases 2. Protein motif databases 3. Protein structure databases
SCIENCEISFN GLAWEWINQTR | ||||| GREWEWINES
Protein sequence databases

The protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialised data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the upcoming slides you will find a list of the databases like: Primary protein sequence databases such as UniProt/Swiss-Prot Specialised protein sequence databases such as GOA Specialised protein databases such as ENZYME Secondary protein databases such as InterPro Structure databases such as PDB
3. Protein sequence databases 3.1. General sequence databases

EXProt Sequences of proteins with experimentally verified function http://www.cmbi.kun.nl/EXProt/
NCBI Protein database
All protein sequences: translated from GenBank and imported from other protein databases
http://www.ncbi.nlm.nih.gov/entrez
PIR
Protein information resource: a collection of protein sequence databases, part of the UniProt project
http://pir.georgetown.edu/ http://pir.georgetown.edu/pirwww/pirnref .shtml
PIR-NREF
PIRs non-redundant reference protein database
PRF
Protein research foundation database of peptides: sequences, literature and unnatural amino acids
http://www.prf.or.jp/en
Swiss-Prot
Curated protein sequence database with a high level of annotation (protein function, domain structure, modifications) Translations of EMBL nucleotide sequence entries: computerannotated supplement to Swiss-Prot
http://www.expasy.org/sprot
TrEMBL
http://www.expasy.org/sprot
UniProt
Universal protein knowledgebase: a database of protein sequence from Swiss-Prot, TrEMBL and PIR
http://www.uniprot.org/
3.2. Protein properties
AAindex ProTherm
Physicochemical properties of amino acids Thermodynamic data for wild-type and mutant proteins
http://www.genome.ad.jp/aaindex/ http://gibk26.bse.kyutech.ac.jp/jouhou/Pr otherm/protherm.html
3.3. Protein localization and targeting

http://www.bioinfo.tsinghua.edu.cn/dbsublo c.html
DBSubLoc
Database of protein subcellular localization
MitoDrome
Nuclear-encoded mitochondrial proteins of Drosophila
http://bighost.area.ba.cnr.it/BIG/MitoDrome
NESbase NLSdb
Nuclear export signals database Nuclear localization signals
http://www.cbs.dtu.dk/databases/NESbase http://cubic.bioc.columbia.edu/db/NLSdb/
THGS
Transmembrane helices in genome sequences
http://pranag.physics.iisc.ernet.in/thgs/
TMPDB
Experimentally characterized transmembrane topologies
http://bioinfo.si.hirosaki-.ac.jp/TMPDB/
3.4. Protein sequence motifs and active sites

ASC Active sequence collection: biologically active peptides http://bioinformatica.isa.cnr.it/ASC/
Blocks
Alignments of conserved regions in protein families
http://blocks.fhcrc.org/
CSA
Catalytic site atlas: enzyme active sites and catalytic residues in enzymes of known 3D structure
http://www.ebi.ac.uk/thorntonsrv/databases/CSA/
COMe
Co-ordination of metals etc.: classification of bioinorganic proteins (metalloproteins and some other complex proteins)
http://www.ebi.ac.uk/come
eMOTIF
Protein sequence motif determination and searches
http://motif.stanford.edu/emotif
Metalloprotein Site Database
Metal-binding sites in metalloproteins
http://metallo.scripps.edu/
O-GlycBase
O- and C-linked glycosylation sites in proteins
http://www.cbs.dtu.dk/databases/OGLYCBA SE/
PhosphoBase
Protein phosphorylation sites
http://www.cbs.dtu.dk/databases/PhosphoBas e/
PROMISE
Prosthetic centers and metal ions in protein active sites
http://metallo.scripps.edu/PROMISE
PROSITE
Biologically significant protein patterns and profiles
http://www.expasy.org/prosite
3.5. Protein domain databases; protein classification
CDD CluSTr Hits
Conserved domain database: includes protein domains from Pfam, SMART and COG databases Clusters of Swiss-Prot+TrEMBL proteins A database of protein domains and motifs Integrated resource of protein families, domains and functional sites
http://www.ncbi.nlm.nih.gov/Structure/cdd/ cdd.shtml http://www.ebi.ac.uk/clustr http://hits.isb-sib.ch/
InterPro
http://www.ebi.ac.uk/interpro
iProClass MetaFam
Integrated protein classification database Database of protein family annotations Protein families: multiple sequence alignments and profile hidden Markov models of protein domains
http://pir.georgetown.edu/iproclass/ http://metafam.ahc.umn.edu/
Pfam
http://www.sanger.ac.uk/Software/Pfa m/
PIRSF
Family/superfamily classification of whole proteins
http://pir.georgetown.edu/pirsf/ http://www.bioinf.man.ac.uk/dbbrowser/PRIN TS/
PRINTS
Hierarchical gene family fingerprints
PIR-ALN
Curated database of protein sequence alignments Protein families defined by PIR superfamilies and PROSITE patterns
http://pir.georgetown.edu/pirwww/dbinfo/piraln .html http://pir.georgetown.edu/gfserver/proclass.htm l
ProClass
ProDom
Protein domain families
http://www.toulouse.inra.fr/prodom.html
ProtoMap ProtoNet SBASE
Hierarchical classification of Swiss-Prot proteins Hierarchical clustering of Swiss-Prot proteins Protein domain sequences and tools
http://protomap.cornell.edu/ http://www.protonet.cs.huji.ac.il/ http://www.icgeb.org/sbase
SMART
Simple modular architecture research tool: signalling, extracellular and chromatin-associated protein domains
http://smart.embl-heidelberg.de/
SUPFAM
Grouping of sequence families into superfamilies
http://pauling.mbu.iisc.ernet.in/supfam
SYSTERS
Systematic re-searching and clustering of proteins
http://systers.molgen.mpg.de/
TIGRFAMs
TIGR protein families adapted for functional annotation
http://www.tigr.org/TIGRFAMs
3.6. Databases of individual protein families

AARSDB Aminoacyl-tRNA synthetase database http://rose.man.poznan.pl/aars/index.html
ABCdb
ABC transporters database
http://ir2lcb.cnrs-mrs.fr/ABCdb/
ASPD
Artificial selected proteins/peptides database
http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/
BacTregulators
Transcriptional regulators of AraC and TetR families
http://www.bactregulators.org/ http://www.chemie.unimarburg.de/ csdbase/
CSDBase DExH/D Family Database Endogenous GPCR List
Cold shock domain-containing proteins
DEAD-box, DEAH-box and DExH-box proteins
http://www.helicase.net/dexhd/dbhome.htm
G protein-coupled receptors; expression in cell lines
http://www.tumor-gene.org/GPCR/gpcr.html
ESTHER EyeSite GPCRDB
Esterases and other alpha/beta hydrolase enzymes Families of proteins functioning in the eye G protein-coupled receptors database
http://www.ensam.inra.fr/esther http://eyesite.cryst.bbk.ac.uk/ http://www.gpcr.org/7tm/
Histone Database HIV Molecular Immunology Database HIV Protease Database
Histone fold sequences and structures
http://research.nhgri.nih.gov/histones/
HIV epitopes
http://hiv-web.lanl.gov/immunology/
HIV reverse transcriptase and protease sequences
http://hivdb.stanford.edu/ http://www.biosci.ki.se/groups/tbu/homeo.ht ml
Homeobox Page
Homeobox proteins, classification and evolution
Homeodomain Resource HORDE
Homeodomain sequences, structures and related genetic and genomic information
http://research.nhgri.nih.gov/homeodomain
Human olfactory receptor data exploratorium Inteins (protein splicing elements) database: properties, sequences, bibliography Sequences of proteins of immunological interest
http://bioinfo.weizmann.ac.il/HORDE/
InBase Kabat Database
http://www.neb.com/neb/inteins.html http://immuno.bme.nwu.edu/
KinG Knottins
Ser/Thr/Tyr-specific protein kinases encoded in complete genomes Database of knottinssmall proteins with an unusual disulfide through disulfide knot
http://hodgkin.mbu.iisc.ernet.in/king http://knottin.cbs.cnrs.fr
LGICdb Lipase Engineering Database
Ligand-gated ion channel subunit sequences database
http://www.pasteur.fr/recherche/banques/ LGIC/LGIC.html
Sequence, structure and function of lipases and esterases
http://www.led.uni-stuttgart.de/
LOX-DB MEROPS MHCPEP MPIMP NPD NucleaRDB Nuclear Receptor Resource
Mammalian, invertebrate, plant and fungal lipoxygenases Database of proteolytic enzymes (peptidases) MHC-binding peptides Mitochondrial protein import machinery of plants Nuclear protein database Nuclear receptor superfamily
http://www.dkfz-heidelberg.de/spec/lox-db/ http://www.merops.ac.uk/ http://wehih.wehi.edu.au/mhcpep/ http://millar3.biochem.uwa.edu.au/ ndex.html http://npd.hgu.mrc.ac.uk/ http://www.receptors.org/NR/ lister/i
Nuclear receptor superfamily
http://nrr.georgetown.edu/nrr/nrr.html http://www.enslyon.fr/LBMC/laudet/nurebase/nurebase. html
NUREBASE
Nuclear hormone receptors database
Olfactory Receptor Database ooTFD
Sequences for olfactory receptor-like molecules
http://ycmi.med.yale.edu/senselab/ordb/
Object-oriented transcription factors database
http://www.ifti.org/ootfd
PKR
Protein kinase resource: sequences, enzymology, genetics and molecular and structural properties
http://pkr.sdsc.edu/
PLANT-PIs
Plant protease inhibitors
http://bighost.area.ba.cnr.it/PLANT-PIs http://plantsp.sdsc.edu/
PlantsP/PlantsT
Plant proteins involved in phosphorylation and membrane transport
Prolysis
Proteases and natural and synthetic protease inhibitors
http://delphi.phys.univ-tours.fr/Prolysis/
REBASE
Restriction enzymes and associated methylases
http://rebase.neb.com/rebase/rebase.html
Ribonuclease P Database RPG
RNase P sequences, alignments and structures Ribosomal protein gene database
http://www.mbio.ncsu.edu/RNaseP/home.html http://ribosome.miyazaki-med.ac.jp/
RTKdb
Receptor tyrosine kinase sequences
http://pbil.univ-lyon1.fr/RTKdb/
S/MARt dB
Nuclear scaffold/matrix attached regions
http://smartdb.bioinf.med.uni-goettingen.de/ http://fermi.utmb.edu/SDAP
SDAP
Structural database of allergenic proteins and food allergens http://wit.mcs.anl.gov/WIT2/Sentra/HTML/ sentra.html
SENTRA
Sensory signal transduction proteins
SEVENS
7-transmembrane helix receptors (G-protein-coupled)
http://sevens.cbrc.jp/ http://bio.lundberg.gu.se/dbs/SRPDB/SR PDB.html http://ibb.uab.es/trsdb
SRPDB TrSDB
Proteins of the signal recognition particles Transcription factor database
VIDA VKCDB
Homologous viral protein families database
http://www.biochem.ucl.ac.uk/bsm/virus_da tabase/VIDA.html
Voltage-gated potassium channel database
http://vkcdb.biology.ualberta.ca/ http://www.stanford.edu/rnusse/wntwindow. html
Wnt Database
Wnt proteins and phenotypes
Structure Databases
The number of known molecular structures is increasing very rapidly and these are available through the various databases comprising of structural information regarding the specific molecule. Various sub categories lying in this divison of molecular databases are: 1)Small molecules 2)Carbohydrates 3)Nucleic acid structure 4)Protein structure 5) Unicellular eukaryotes genome databases.
4. Structure Databases 4.1. Small molecules
CSD
Cambridge structural database: crystal structure information for organic and metal-organic compounds
http://www.ccdc.cam.ac.uk/prods/csd/csd. html
HIC-Up
Hetero-compound Information CentreUppsala
http://xray.bmc.uu.se/hicup
AANT
Amino acidnucleotide interaction database
http://aant.icmb.utexas.edu/
Klotho
Collection and categorization of biological compounds
http://www.biocheminfo.org/klotho
LIGAND
Chemical compounds and reactions in biological pathways
http://www.genome.ad.jp/ligand/
4.2. Carbohydrates
http://bssv01.lancs.ac.uk/gig/pages/gag/c arbbank.htm
CCSD
Complex carbohydrate structure database (CarbBank)
Glycan
Carbohydrate database, part of the KEGG system
http://glycan.genome.ad.jp/
GlycoSuiteDB
N- and O-linked glycan structures and biological sources
http://www.glycosuite.com/
Monosaccharide Browser
Space filling Fischer projections of monosaccharides
http://www.jonmaber.demon.co.uk/monosac charide
SWEET-DB
Annotated carbohydrate structure and substance information
http://www.dkfzheidelberg.de/spec2/sweetdb/
4.3. Nucleic acid structure

NDB Nucleic acid-containing structures http://ndbserver.rutgers.edu/
NTDB
Thermodynamic data for nucleic acids
http://ntdb.chem.cuhk.edu.hk/
RNABase
RNA-containing structures from PDB and NDB
http://www.rnabase.org/
SCOR
Structural classification of RNA: RNA motifs by structure, function and tertiary interactions
http://scor.lbl.gov/
PRODORIC NET
Prokaryotic database of gene regulation networks
http://prodoric.tu-bs.de/
PromEC
E.coli promoters with experimentally-identified transcriptional start sites
http://bioinfo.md.huji.ac.il/marg/promec
SELEX_DB
DNA and RNA binding sites for various proteins, found by systematic evolution of ligands by exponential enrichment
http://wwwmgs.bionet.nsc.ru/mgs/systems/s elex/
TESS
Transcription element search system
http://www.cbil.upenn.edu/tess
TRANSCompel
Composite regulatory elements affecting gene transcription in eukaryotes
http://www.generegulation.com/pub/databases.html#transco mpel
TRANSFAC
Transcription factors and binding sites
http://transfac.gbf.de/TRANSFAC/index. html
TRRD
Transcription regulatory regions of eukaryotic genes
http://www.bionet.nsc.ru/trrd/
4.4. Protein structure

ArchDB Automated classification of protein loop structures http://gurion.imim.es/archdb
ASTRAL
Sequences of domains of known structure, selected subsets and sequence-structure correspondences
http://astral.stanford.edu/
BAliBASE BioMagResBa nk
A database for comparison of multiple sequence alignments
http://www-igbmc.ustrasbg.fr/BioInfo/BAliBASE2/index.html
NMR spectroscopic data for proteins and nucleic acids
http://www.bmrb.wisc.edu/
CADB
Conformational angles in proteins database
http://cluster.physics.iisc.ernet.in/cadb/ http://www.biochem.ucl.ac.uk/bsm/cath_ new http://cl.sdsc.edu/ce.html http://ckaap.sdsc.edu/ http://www.bioinfo.biocenter.helsinki.fi:8 080/dali/
CATH CE CKAAPs DB Dali
Protein domain structures database 3D Protein structure alignments Structurally-similar proteins with dissimilar sequences Protein fold classification using the Dali search engine
Decoys R Us
Computer-generated protein conformations
http://dd.stanford.edu/
DisProt
Database of Protein Disorder: information about proteins that lack fixed 3D structure in their native states
http://divac.ist.temple.edu/disprot
DomIns
Domain insertions in known protein structures
http://stash.mrc-lmb.cam.ac.uk/DomIns http://www.ncbs.res.in/ se/dsdbase.html faculty/mini/dsdba
DSDBASE
Native and modeled disulfide bonds in proteins
DSMM
Database of simulated molecular motions
http://projects.villaosch.de/dbase/dsmm/
eF-site
Electrostatic surface of Functional site: electrostatic potentials and hydrophobic properties of the active sites
http://ef-site.protein.osaka-u.ac.jp/eF-site
FSSP
Fold classification based on structure-structure alignment of proteins, currently maintained as Dali database
http://www.ebi.ac.uk/dali/fssp http://www.biochem.ucl.ac.uk/bsm/cath_ne w/Gene3D/
Gene3D
Precalculated structural assignments for whole genomes Genomic threading database: structural annotations of complete genomes
GTD
http://bioinf.cs.ucl.ac.uk/GTD
GTOP Het-PDB Navi HOMSTRAD IMB Jena Image Library IMGT/3Dstruct ure-DB ISSD LPFC MMDB E-MSD ModBase
Protein fold predictions from genome sequences Hetero-atoms in protein structures Homologous structure alignment database: curated structurebased alignments for protein families
http://spock.genes.nig.ac.jp/
genome/
http://daisy.nagahama-ibio.ac.jp/golab/hetpdbnavi.html http://www-cryst.bioc.cam.ac.uk/homstrad
Visualization and analysis of 3D biopolymer structures Sequences and 3D structures of vertebrate immunoglobulins, T cell receptors and MHC proteins Integrated sequence-structure database Library of protein family core structures NCBIs database of 3D structures, part of NCBI Entrez EBIs macromolecular structure database Annotated comparative protein structure models Database of macromolecular movements: descriptions of protein and macromolecular motions, including movies Phylogeny and alignment of homologous protein structures Structural motifs of protein superfamilies
http://www.imb-jena.de/IMAGE.html http://imgt3d.igh.cnrs.fr http://www.protein.bio.msu.su/issd http://wwwsmi.stanford.edu/projects/helix/LPFC http://www.ncbi.nlm.nih.gov/Structure http://www.ebi.ac.uk/msd http://salilab.org/modbase
MolMovDB PALI PASS2
http://bioinfo.mbb.yale.edu/MolMovDB/ http://pauling.mbu.iisc.ernet.in/ http://ncbs.res.in/ s.html pali
faculty/mini/campass/pas
PepConfDB PDB PDB-REPRDB PDBsum SCOP Sloop StructureSuperposition Database SWISS-MODEL Repository SUPERFAMILY SURFACE TargetDB 3D-GENOMICS TOPS
A database of peptide conformations Protein structure databank: all publicly available 3D structures of proteins and nucleic acids Representative protein chains, based on PDB entries Summaries and analyses of PDB structures Structural classification of proteins Classification of protein loops
http://202.41.70.49:8080/pepconfdb/index.ht m http://www.rcsb.org/pdb http://www.cbrc.jp/pdbreprdb/ http://www.biochem.ucl.ac.uk/bsm/pdbsum http://scop.mrc-lmb.cam.ac.uk/scop http://www-cryst.bioc.cam.ac.uk/ sloop/
Pairwise superposition of TIM-barrel structures
http://ssd.rbvi.ucsf.edu/
Database of annotated 3D protein structure models Assignments of proteins to structural superfamilies Surface residues and functions annotated, compared and evaluated: a database of protein surface patches Target data from worldwide structural genomics projects Structural annotations for complete proteomes Topology of protein structures database
http://swissmodel.expasy.org/repository http://supfam.org/ http://cbm.bio.uniroma2.it/surface http://targetdb.pdb.org/ http://www.sbg.bio.ic.ac.uk/3dgenomics http://www.tops.leeds.ac.uk
Genomics Databases
For organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how these data are stored.This category of databases comprising of the information regarding various genomes like of Humans ,Plants, Viral, Invertebrate, Microbes etc. 1)Genome annotation terms, ontologies and nomenclature 2)Taxonomy and identification 3)General genomics databases 4)Viral genome databases 5)Prokaryotic genome databases 6)Unicellular eukaryotes genome databases 7)Fungal genome databases 8)Invertebrate genome databases 9)Human genome databases, maps and viewers.
5. Genomics Databases (non-human) (non5.1. Genome annotation terms, onthologies and nomenclature
Human gene nomenclature: approved gene symbols Gene onthology consortium database Gene onthology annotation project Nomenclature of enzymes, membrane transporters, electron transport proteins and other proteins Nomenclature of biochemical and organic compounds approved by the IUBMB-IUPAC Joint Commission The International Union of Pharmacology recommendations on receptor nomenclature and drug classification Gene products organized by biological function http://www.gene.ucl.ac.uk/nomenclat ure http://www.geneontology.org/ http://www.ebi.ac.uk/GOA
Genew GO GOA IUBMB Nomenclature database IUPAC Nomenclature database
http://www.chem.qmul.ac.uk/iubmb
http://www.chem.qmul.ac.uk/iupac
IUPHAR-RD PANTHER
http://www.iuphar-db.org/iuphar-rd/ http://panther.celera.com/
SOURCE UMLS
Functional genomic resource for annotations ontologies and expression data Unified medical language system
http://source.stanford.edu/ http://umlsks.nlm.nih.gov/
5.1.1. Taxonomy and Identification
ICB
gyrB database for identification and classification of bacteria
http://www.mbio.co.jp/icb
NCBI Taxonomy
Names and taxonomic lineages of all organisms in GenBank
http://www.ncbi.nlm.nih.gov/Taxonomy/
RIDOM RDP
rRNA-based differentiation of medical microorganisms Ribosomal database project
http://www.ridom-rdna.de/ http://rdp.cme.msu.edu
Tree of Life
Information on phylogeny and biodiversity
http://phylogeny.arizona.edu/tree/phylogeny .html
5.2. General genomics databases

COG Clusters of orthologous groups of proteins from unicellular microorganisms Comparative regulatory genomics: conserved non-coding sequence blocks http://www.ncbi.nlm.nih.gov/COG
CORG
http://corg.molgen.mpg.de/
DEG
Database of essential genes from bacteria and yeast
http://tubic.tju.edu.cn/deg
EBI Genomes
EBIs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Eukaryotic gene orthologs: orthologous DNA sequences in the TIGR gene indices Enhanced microbial genomes library: completely sequenced genomes of unicellular organisms
http://www.ebi.ac.uk/genomes
EGO
http://www.tigr.org/tdb/tgi/ego/
EMGlib
http://pbil.univ-lyon1.fr/emglib/emglib.html
Entrez Genomes
NCBIs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Integrated biochemical data on seven bacterial genomes: publicly available portion of the ERGO database Database of bacterial and archaeal gene fusion events
http://www.ncbi.nlm.nih.gov/entrez/query. fcgi?db=Genome
ERGOLight FusionDB
http://www.ergo-light.com/ERGO http://igs-server.cnrs-mrs.fr/FusionDB
Genome information broker
DDBJs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Genomes online database: a listing of completed and ongoing genome projects
http://gib.genes.nig.ac.jp
GOLD TIGR Microbial Database
http://www.genomesonline.org/
Lists of completed and ongoing genome projects with links to complete genome sequences Putative horizontally transferred genes in prokaryotic genomes
http://www.tigr.org/tdb/mdb/mdbcomplet e.html
HGT-DB
http://www.fut.es/
debb/HGT/
KEGG MBGD
Kyoto encyclopedia of genes and genomes: integrated suite of databases on genes, proteins, and metabolic pathways Microbial genome database for comparative analysis Database of orphan ORFs (ORFs with no homologs) in complete microbial genomes
http://www.genome.ad.jp/kegg http://mbgd.genome.ad.jp/
ORFanage
http://www.cs.bgu.ac.il/
nomsiew/ORFans
PACRAT
Archaeal and bacterial intergenic sequence features
http://www.biosci.ohio-tate.edu/
pacrat
PEDANT
Results of an automated analysis of genomic sequences
http://pedant.gsf.de
TIGR Comprehensiv e Microbial Resource
Various data on complete microbial genomes: uniform annotation, properties of DNA and predicted proteins
http://www.tigr.org/CMR
TransportDB
Predicted membrane transporters in complete genomes, classified according to the TC classification system
http://www.membranetransport.org
WIT
What is there? Metabolic reconstruction for completely sequenced microbial genomes
http://wit.mcs.anl.gov/WIT2/
5.3. Organism-specific genomic databases Organism5.3.1. Viruses

HCVDB HIV Drug Resistance Database The hepatitis C virus database http://hepatitis.ibcp.fr/
Mutations in HIV genes that confer resistance to anti-HIV drugs Annotated and curated database for complete viral genome sequences
http://resdb.lanl.gov/Resist_DB/default.htm
VirGen
http://bioinfo.ernet.in/virgen/virgen.html
5.3.2. Prokaryotes 5.3.2.1. Escherichia coli

ASAP A systematic annotation package for community analysis of E.coli and related genomes https://asap.ahabs.wisc.edu/annotation/php/A SAP1.htm
CCDB coliBase Colibri Essential genes in E.coli GenoBase GenProtEC PEC
CyberCell database: E.coli database at U. Alberta A database for E.coli, Salmonella and Shigella E.coli genome database at Institut Pasteur
http://redpoll.pharmacy.ualberta.ca/CCDB http://colibase.bham.ac.uk/ http://genolist.pasteur.fr/Colibri/ http://magpie.genome.wisc.edu/ ntial.html http://ecoli.aist-nara.ac.jp/ http://genprotec.mbl.edu http://shigen.lab.nig.ac.jp/ecoli/pec http://ecocyc.org/ http://bmb.med.miami.edu/EcoGene/EcoWe b/ chris/esse
First results of an E.coli gene deletion project E.coli genome database at Nara Institute E.coli K-12 genome and proteome database Profiling of E.coli chromosome E.coli K-12 genes, metabolic pathways, transporters, and gene regulation Sequence and literature data on E.coli genes and proteins
EcoCyc EcoGene
RegulonDB
Transcriptional regulation and operon organization in E.coli
http://www.cifn.unam.mx/Computational_G enomics/regulondb/
5.3.2.2. Bacillus subtilis

BSORF NRSub SubtiList Bacillus subtilis genome database at Kyoto U. Non-redundant Bacillus subtilis database at U. Lyon Bacillus subtilis genome database at Institut Pasteur http://bacillus.genome.ad.jp/ http://pbil.univ-lyon1.fr/nrsub/nrsub.html http://genolist.pasteur.fr/SubtiList/
5.3.2.3. Other bacteria
BioCyc
Pathway/genome databases for many bacteria
http://biocyc.org/
CampyDB
Database for Campylobacter genome analysis
http://campy.bham.ac.uk/
ClostriDB
Finished and unfinished genomes of Clostridium spp.
http://clostri.bham.ac.uk/
CyanoBase
Cyanobacterial genomes
http://www.kazusa.or.jp/cyano
LeptoList
Leptospira interrogans genome
http://bioinfo.hku.hk/LeptoList
MolliGen
Genomic data on mollicutes
http://cbi.labri.fr/outils/molligen/
RsGDB
Rhodobacter sphaeroides genome
http://wwwmmg.med.uth.tmc.edu/sphaeroides
5.3.3. Unicellular eukaryotes 5.3.3.1. Yeast

SGD Saccharomyces genome database http://www.yeastgenome.org/
CYGD
MIPS Comprehensive yeast genome database
http://mips.gsf.de/proj/yeast
Gnolevures
A comparison of S.cerevisiae and 14 other yeast species
http://cbi.labri.fr/Genolevures
MitoPD
Yeast mitochondrial protein database
http://bmerc-www.bu.edu/mito
SCMD SCPD
Saccharomyces cerevisiae morphological database: micrographs of budding yeast mutants Saccharomyces cerevisiae promoter database
http://yeast.gi.k.u-tokyo.ac.jp/ http://cgsigma.cshl.org/jian
TRIPLES
Transposon-insertion phenotypes, localization, and expression in Saccharomyces
http://ygac.med.yale.edu/triples/
YDPM
Yeast deletion project and mitochondria database
http://wwwdeletion.stanford.edu/YDPM/YDPM_index. html
Yeast Intron Database
Ares laboratory database of splicesomal introns in S.cerevisiae
http://www.cse.ucsc.edu/research/compbio/ yeast_introns.html
Yeast snoRNA Database yMGV
Yeast small nucleolar RNAs Yeast microarray global viewer
http://www.bio.umass.edu/biochem/rnasequence/Yeast_snoRNA_Database/snoRN A_DataBase.html http://www.transcriptome.ens.fr/ymgv/
5.3.3.2. Other unicellular eukaryotes

ApiEST-DB EST sequences from various Apicomplexan parasites http://www.cbil.upenn.edu/paradbs-servlet
CryptoDB
Cryptosporidium parvum genome database
http://cryptodb.org/
DictyBase
Genome information, literature and experimental resources for Dictyostelium discoideum
http://dictybase.org/
Full-Malaria
Full-length cDNA library from erythrocytic-stage Plasmodium falciparum
http://fullmal.ims.u-tokyo.ac.jp/
GeneDB PlasmoDB TcruziDB ToxoDB
Curated database for Trypanosoma brucei, Leishmania major, S.pombe and other Sanger-sequenced genomes Plasmodium genome database Trypanosoma cruzi genome database Toxoplasma gondii genome database
http://www.genedb.org/ http://plasmodb.org/ http://tcruzidb.org/ http://toxodb.org/
5.3.4. Plants 5.3.4.1. General plant databases

CropNet Genome mapping in crop plants http://ukcrop.net/ http://genoplanteinfo.infobiogen.fr/FLAGdb/
FLAGdb++
Integrative database about plant genomes
GnoPlante-Info
Plant genomic data from the Gnoplante consortium Molecular and phenotypic information on wheat, barley, rye, triticale and oats Database of plant EST and STS sequences annotated with gene family information
http://genoplante-info.infobiogen.fr/ http://wheat.pw.usda.gov or http://www.graingenes.org
GrainGenes
Mendel
http://www.mendel.ac.uk/ http://genoplanteinfo.infobiogen.fr/phytoprot
PHYTOPROT
Clusters of (predicted) plant proteins Plant genome database: actively-transcribed plant genomic sequences Plant EST clustering and functional annotation
PlantGDB Sputnik
http://www.plantgdb.org/ http://mips.gsf.de/proj/sputnik
TIGR plant repeat database
Classification of repetitive sequences in plant genomes Genetic and genomic information about tropical crops: sugarcane, banana, cocoa
http://www.tigr.org/tdb/e2k1/plant.repeat s
TropGENE DB
http://tropgenedb.cirad.fr/
5.3.4.2. Arabidopsis thaliana
ARAMEMNON
Arabidopsis thaliana membrane proteins and transporters
http://aramemnon.botanik.uni-koeln.de/
AthaMap
Genome-wide map of putative transcription factor binding sites in Arabidopsis thaliana
http://www.athamap.de/
CATMA
Complete Arabidopsis transcriptome microarray: gene sequence tags
http://www.catma.org
FLAGdb/FST
Arabidopsis thaliana T-DNA transformants
http://genoplante-info.infobiogen.fr/
MAtDB
MIPS Arabidopsis thaliana database
http://mips.gsf.de/proj/thal/db
SeedGenes TAIR
Genes essential for Arabidopsis development The Arabidopsis information resource
http://www.seedgenes.org/ http://www.arabidopsis.org/
5.3.4.3. Rice
BGI-RISe
Beijing genomics institute rice information system
http://rise.genomics.org.cn/
INE
Integrated rice genome explorer
http://rgp.dna.affrc.go.jp/giot/INE.html
IRIS
International rice information system: all rice data
http://www.iris.irri.org/
MOsDB
MIPS Oryza sativa database
http://mips.gsf.de/proj/rice
Oryzabase
Rice genetics and genomics
http://www.shigen.nig.ac.jp/rice/oryzabase/
RiceGAAS
Rice genome automated annotation system
http://ricegaas.dna.affrc.go.jp/
Rice PIPELINE
Unification tool for rice databases
http://cdna01.dna.affrc.go.jp/PIPE
RPD
Rice proteome database
http://gene64.dna.affrc.go.jp/RPD/
5.3.4.4. Other plants

MaizeGDB MGI MtDB SGMD Maize genetics and genomics database, a successor to MaizeDB and ZmDB databases Medicago genome initiative: ESTs, gene expression and proteomic data Medicago trunculata genome Soybean genomics and microarray database http://www.maizegdb.org/ http://xgi.ncgr.org/mgi http://www.medicago.org/MtDB http://psi081.ba.ars.usda.gov/SGMD/defaul t.htm
5.3.5. Fungi
CADRE COGEME MagnaportheD B MNCDB Central Aspergillus data repository Phytopathogenic fungi and oomycete EST database http://www.cadre.man.ac.uk/ http://cogeme.ex.ac.uk http://www.fungalgenomics.ncsu.edu/Proje cts/mgdatabase/int.htm http://mips.gsf.de/proj/neurospora/
Magnaporthe grisea integrated physical/genetic map MIPS Neurospora crassa database
Phytophthora Genome Consortium Database
ESTs from Phytophthora infestans and P.sojae
https://xgi.ncgr.org/pgc
5.3.6. Invertebrates 5.3.6.1. Caenorhabditis elegans

C.elegans Project
Genome sequencing data at the Sanger Institute
http://www.sanger.ac.uk/Projects/C_elegans
Intronerator RNAiDB
Introns and alternative splicing in C.elegans and C.briggsae RNAi phenotypic analysis of C.elegans genes
http://www.cse.ucsc.edu/ / http://www.rnai.org/
kent/intronerator
WILMA
C.elegans annotation database
http://www.came.sbg.ac.at/wilma/
WorfDB
C.elegans ORFeome
http://worfdb.dfci.harvard.edu/
WormBase
Data repository for C.elegans and C.briggsae: curated genome annotation, genetic and physical maps, pathways
http://www.wormbase.org/
5.3.6.2. Drosophila melanogaster

FlyBase GadFly FlyBrain Drosophila sequences and genomic information Genome annotation database of Drosophila Database of the Drosophila nervous system Drosophila transgenic lines created using an intron protein trap strategy http://flybase.bio.indiana.edu/ http://www.fruitfly.org http://flybrain.neurobio.arizona.edu
FlyTrap
http://flytrap.med.yale.edu/ http://sdb.bio.purdue.edu/fly/aimain/1aahom e.htm
InterActive Fly Drosophila microarray centre
Drosophila genes and their roles in development
Data and tools for Drosophila gene expression studies
http://www.flyarrays.com/fruitfly
5.3.6.3. Other invertebrates

AppaDB A database on the nematode Pristionchus pacificus http://appadb.eb.tuebingen.mpg.de
CnidBase
Cnidarian evolution and gene expression database
http://cnidbase.bu.edu/
Nematode.net NEMBASE
Parasitic nematode sequencing project Nematode sequence and functional data database
http://nematode.net/ http://www.nematodes.org
Metabolic and Signaling Pathways
The metabolic and signaling pathway is a collection of Pathway/Signaling Databases. Each database in this collection describes the genome and metabolic pathways of a single organism, with some exception databases. The categories in this 1)Enzymes and enzyme nomenclature 2)Metabolic pathways 3)Intermolecular interactions and signaling pathways
6. Metabolic Enzymes and Pathways; Signaling Pathways 6.1. Enzymes and Enzyme Nomenclature
ENZYME Enzyme nomenclature and properties Enzyme names and properties: sequence, structure, specificity, stability, reaction parameters, isolation data Integrated enzyme database and enzyme nomenclature http://www.expasy.org/enzyme
BRENDA IntEnz Enzyme Nomenclature
http://www.brenda.uni-koeln.de http://www.ebi.ac.uk/intenz
IUBMB Nomenclature Committee recommendations
http://www.chem.qmw.ac.uk/iubmb/enzyme
6.2. Metabolic Pathways

KEGG MetaCyc PathDB Kyoto encyclopedia of genes and genomes: metabolic and regulatory pathways encoded in complete genomes Metabolic pathways and enzymes from various organisms Biochemical pathways, compounds and metabolism University of Minnesota biocatalysis and biodegradation database: microbial catabolism and biotransformations Integrated system for functional curation and development of metabolic models http://www.genome.ad.jp/kegg http://metacyc.org http://www.ncgr.org/pathdb
UM-BBD WIT2
http://umbbd.ahc.umn.edu/ http://wit.mcs.anl.gov/WIT2/
6.3. Intermolecular Interactions and Signaling Pathways
aMAZE BIND
A system for the annotation, management and analysis of biochemical and signaling pathway networks Biomolecular interaction network database
http://www.amaze.ulb.ac.be/ http://www.bind.ca http://www.biocarta.com/genes/allPathways. asp
BioCarta
Online maps of metabolic and signaling pathways
BRITE
Biomolecular relations in information transmission and expression, part of the KEGG system
http://www.genome.ad.jp/brite
DIP
Database of interacting proteins: experimentally determined proteinprotein interactions
http://dip.doe-mbi.ucla.edu
DRC
Database of ribosomal crosslinks
http://www.mpimg-berlindahlem.mpg.de/ ag_ribo/ag_brimacombe/ drc http://wwwmgs.bionet.nsc.ru/mgs/gnw/ge nenet
GeneNet
Database on gene network components
IntAct project InterDom
Proteinprotein interaction data Putative protein domain interactions Functional and quantitative thermodynamic data on peptide binding to immunological biomacromolecules MHCpeptide interaction database Reactive oxygen species (ROS) signaling pathway
http://www.ebi.ac.uk/intact http://interdom.lit.org.sg
JenPep MPID ROSPath
http://www.jenner.ac.uk/Jenpep2 http://surya.bic.nus.edu.sg/mpid http://rospath.ewha.ac.kr
STCDB
Signal transductions classification database
http://www.techfak.unibielefeld.de/ mchen/STCDB
STRING
Predicted functional associations between proteins
www.bork.emblheidelberg.de/STRING
TRANSPATH
Gene regulatory networks and microarray analysis
http://www.biobase.de/pages/products/ databases.html
Human and other Vertebrate Genomes
The Human and other vertebrate genomes is a repository of the human genome as well as the other vertebrate genomes containing databases. 1)Model organisms, comparative genomics 2)Human genome databases, maps and viewers 3)Human ORFs.
7. Human and other Vertebrate Genomes 7.1. Mitochondrial Genes and Proteins
AMmtDB Metazoan mitochondrial genes http://bighost.area.ba.cnr.it/mitochondriom e http://megasun.bch.umontreal.ca/gobase/go base.html
GOBASE
Organelle genome database
MitoDat MitoMap
Mitochondrial proteins (predominantly human) Human mitochondrial genome
http://www-lecb.ncifcrf.gov/mitoDat/ http://www.mitomap.org/
MitoNuc MITOP2
Nuclear genes coding for mitochondrial proteins Mitochondrial proteins, genes and diseases Mitochondrial protein sequences encoded by mitochondrial and nuclear genes Complete mitochondrial genome sequences for 200 metazoan species
http://biowww.ba.cnr.it:8000/BioWWW/#MitoNuc http://ihg.gsf.de/mitop2/
MitoProteome OGRe
http://www.mitoproteome.org http://www.bioinf.man.ac.uk/ogre
7.2. Model organisms, comparative genomics

ACeDB C.elegans, S.pombe, and human sequences and genomic information http://www.acedb.org/
AllGenes ArkDB Cre Transgenic Database
Human and mouse gene, transcript and protein annotation Genome databases for farm and other animals Cre transgenic mouse lines with links to publications Human cDNA clones homologous to Drosophila mutant genes Annotated information on eukaryotic genomes
http://www.allgenes.org/ http://www.thearkdb.org/ http://www.mshri.on.ca/nagy/ http://www.tigem.it/LOCAL/drosophila/dros .html http://www.ensembl.org/
DRESH Ensembl
FANTOM FREP
Functional annotation of mouse full-length cDNA clones Functional repeats in mouse cDNAs
http://fantom2.gsc.riken.go.jp http://facts.gsc.riken.go.jp/FREP/
IPD-MHC Database
Non-human major histocompatibility complex sequences
http://www.ebi.ac.uk/ipd/mhc
GenetPig
Genes controlling economic traits in pig
http://www.infobiogen.fr/services/Genetpig
KOG LocusLink Mouse Genome Database Mouse SAGE Mouse Targeted Mutations MTID PEDE Rat Genome Database TIGR Gene Indices UniGene UniSTS ZFIN
Eukaryotic orthologous groups of proteins Curated sequences and descriptions of genetic loci
http://www.ncbi.nlm.nih.gov/COG/new/sh okog.cgi http://www.ncbi.nlm.nih.gov/LocusLink
Mouse genome database SAGE libraries from various mouse tissues and cell lines
http://www.informatics.jax.org/ http://mouse.biomed.cas.cz/sage
Information on transgenic animals and targeted mutations Mouse transposon insertion database Pig EST data explorer: full-length cDNA libraries and ESTs Rat genetic and genomic data Organism-specific databases of EST and gene sequences Unified clusters of ESTs and full-length mRNA sequences Unified non-redundant view of sequence tagged sites with marker and mapping data from a variety of resources Genetic, genomic and developmental data from zebrafish
http://tbase.jax.org/ http://mouse.ccgb.umn.edu/transposon/ http://pede.gene.staff.or.jp/ http://rgd.mcw.edu/ http://www.tigr.org/tdb/tgi.shtml http://www.ncbi.nlm.nih.gov/UniGene/ http://www.ncbi.nlm.nih.gov/entrez/query.f cgi?db=unists http://zfin.org/
7.3. Human genome databases, maps and viewers

Ensembl AluGene CroW 21 G3-RH Annotated information on eukaryotic genomes Complete Alu map in the human genome Human chromosome 21 database Stanford G3 and TNG radiation hybrid maps http://www.ensembl.org/
http://alugene.tau.ac.il/
http://bioinfo.weizmann.ac.il/crow21/ http://www-shgc.stanford.edu/RH/
GB4-RH GDB GenAtlas
Genebridge4 human radiation hybrid maps Human genes and genomic maps Human genes, markers and phenotypes Integrated database of human genes, maps, proteins and diseases
http://www.sanger.ac.uk/Software/RHserver/ RHserver.shtml http://www.gdb.org/ http://www.citi2.fr/GENATLAS/
GeneCards
http://bioinfo.weizmann.ac.il/cards/
GeneLoc GeneNest
Gene location database (formerly UDBUnified database for human genome mapping) Gene indices of human, mouse, zebrafish, etc.
http://genecards.weizmann.ac.il/geneloc/ http://genenest.molgen.mpg.de/
GenMapDB Gene Resource Locator
Mapped human BAC clones Alignment of ESTs with finished human sequence
http://genomics.med.upenn.edu/genmapdb http://grl.gi.k.u-tokyo.ac.jp/
HOWDY
Human organized whole genome database
http://www-alis.tokyo.jst.go.jp/HOWDY/ http://www.infobiogen.fr/services/Hugema p http://www.tigr.org/tdb/humgen/bac_end_s earch/bac_end_intro.html http://ixdb.mpimg-berlin-dahlem.mpg.de/
HuGeMap Human BAC Ends Database IXDB
Human genome genetic and physical map data
Non-redundant human BAC end sequences Physical maps of human chromosome X
NCBI RefSeq UCSC Genome Browser ParaDB RHdb
Non-redundant DNA and protein sequence collection Genome assemblies and annotation Paralogy mapping in human genomes Radiation hybrid map data
http://www.ncbi.nlm.nih.gov/RefSeq/ http://genome.ucsc.edu/ http://abi.marseille.inserm.fr/paradb/ http://www.ebi.ac.uk/RHdb
STACK
Sequence tag alignment and consensus knowledgebase
http://www.sanbi.ac.za/Dbases.html
Human Genes and Diseases
Human Genes and Diseases Human genes and diseases is a category of those databases that has the information regarding disease causing genes, having databases of cancerous genes, human ORFs, etc. 1)Human ORFs 2)General human genetics databases 3)General polymorphism databases 4)Cancer gene databases 5)Gene-system or disease-specific databases
7.4. Human proteins

HPMR Human plasma membrane receptome: protein sequences, literature, and expression database Human protein reference database: domain architecture, post-translational modifications, and disease association Human novel transcripts: annotated full-length cDNAs Human unidentified gene-encoded large (>50 kDa) protein and cDNA sequences Localization, interaction and functional assays of human proteins http://receptome.stanford.edu/
HPRD HUNT
http://www.hprd.org http://www.hri.co.jp/HUNT
HUGE
http://www.kazusa.or.jp/huge
LIFEdb trome, trEST and trGEN
http://www.dkfz.de/LIFEdb
Databases of predicted human protein sequences
ftp://ftp.isrec.isb-sib.ch/pub/databases/
8. Human Genes and Diseases 8.1. General Databases

Genetics Home Reference Homophila A general guide on human hereditary diseases Drosophila homologs of human disease genes http://ghr.nlm.nih.gov/ http://homophila.sdsc.edu/
IMGT
International immunogenetics information system: immunoglobulins, T cell receptors, MHC and RPI
http://imgt.cines.fr/
Mutation Spectra Database
Mutations in viral, bacterial, yeast and mammalian genes
http://info.med.yale.edu/mutbase/
OMIA
Online Mendelian inheritance in animals: a catalog of animal genetic and genomic disorders
http://www.angis.org.au/omia
OMIM
Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders Collection of ORFs that are sold by Invitrogen European mutant mice pathology database: histopathology photomicrographs and macroscopic images Compilation of protein mutant data
http://www.ncbi.nlm.nih.gov/Omim/ http://orf.invitrogen.com/
ORFDB
PathBase PMD
http://www.pathbase.net/ http://pmd.ddbj.nig.ac.jp/
8.2. Human Mutations Databases 8.2.1. General polymorphism database

ALFRED BayGenomics dbSNP FIMM HGVS Databases HGVbase HGMD Allele frequencies and DNA polymorphisms Genes relevant to cardiovascular and pulmonary disease Database of single nucleotide polymorphisms Functional molecular immunology data A compilation of human mutation databases Human genome variation database: curated human polymorphisms Human gene mutation database Immuno polymorphism database: data on human killer-cell Ig-like receptors and human platelet antigens Japanese SNP database SNPs in regulatory gene regions http://alfred.med.yale.edu/ http://baygenomics.ucsf.edu/ www.ncbi.nlm.nih.gov/SNP/ http://sdmc.krdl.org.sg:8080/fimm/ http://www.hgvs.org/ http://hgvbase.cgb.ki.se/ http://www.hgmd.org/
IPD JSNP rSNP Guide SNP Consortium database TopoSNP
http://www.ebi.ac.uk/ipd http://snp.ims.u-tokyo.ac.jp/ http://util.bionet.nsc.ru/databases/rsnp.html
SNP Consortium data Topographic database of non-synonymous SNPs
http://snp.cshl.org/ http://gila.bioengr.uic.edu/snp/toposnp
8.2.2. Cancer
Atlas of Genetics and Cytogenetics in Oncology and Haematology CGED Database of Germline p53 Mutations IARC TP53 Database MTB Oral Cancer Gene Database RB1 Gene Mutation Database RTCGD SNP500Cancer SV40 Large TAntigen Mutant Database
Cancer related genes, chromosomal abnormalities in oncology and haematology, and cancer-prone diseases Cancer gene expression database
http://www.infobiogen.fr/services/chromca ncer/ http://love2.aist-nara.ac.jp/CGED http://www.lf2.cuni.cz/win/projects/germli ne_mut_p53.htm http://www.iarc.fr/p53/ http://tumor.informatics.jax.org/
Mutations in human tumor and cell line p53 gene Human TP53 somatic and germline mutations Mouse tumor biology database: mouse tumor types, genes, classification, incidence, pathology Cellular and molecular data for genes involved in oral cancer
http://www.tumor-gene.org/Oral/oral.html
Mutations in the human retinoblastoma (RB1) gene Mouse retroviral tagged cancer gene database Re-sequenced SNPs from 102 reference samples
http://www.d-lohmann.de/Rb/
http://rtcgd.ncifcrf.gov/
http://snp500cancer.nci.nih.gov
Mutations in SV40 large tumor antigen gene
http://bigdaddy.bio.pitt.edu/SV40/
Tumor Gene Family Databases
Cellular, molecular and biological data about genes involved in various cancers
http://www.tumor-gene.org/tgdf.html
8.2.3. Gene, system or disease-specific diseaseALPSbase Androgen Receptor Gene Mutations Database BTKbase CASRDB Cytokine Gene Polymorphism in Human Disease Collagen Mutation Database ERGDB FUNPEP GOLD.db Autoimmune lymphoproliferative syndrome database http://research.nhgri.nih.gov/alps/
Mutations in the androgen receptor gene Mutation registry for X-linked agammaglobulinemia Calcium-sensing receptor database: CASR mutations causing hypercalcemia and/or hyperparathyroidism
http://www.mcgill.ca/androgendb/ http://bioinf.uta.fi/BTKbase/ http://www.casrdb.mcgill.ca/
Cytokine gene polymorphism literature database
http://bris.ac.uk/pathandmicro/services/GAI /cytokine4.htm
Human type I and type III collagen gene mutations Estrogen responsive genes database Low-complexity peptides capable of forming amyloid plaque Genomics of lipid-associated disorders database
http://www.le.ac.uk/genetics/collagen/ http://sdmc.lit.org.sg/ergdb/cgibin/explore.pl http://www.cmbi.kun.nl/swift/FUNPEP/g ergo/ http://gold.tugraz.at
tGRAP
Mutants of G-protein coupled receptors of family A
http://tinygrap.uit.no/GRAP/
http://www.kcl.ac.uk/ip/petergreen/haemBd atabase.html
HaemB
Factor IX gene mutations, insertions and deletions
HbVar
Human hemoglobin variants and thalassemias
http://globin.cse.psu.edu/globin/hbvar
Human p53/hprt, rodent lacI/lacZ databases Human PAX2 Allelic Variant Database
Mutations at the human p53 and hprt genes; rodent transgenic lacI and lacZ mutations
http://www.ibiblio.org/dnam/mainpage.htm l
Mutations in human PAX2 gene
http://pax2.hgu.mrc.ac.uk/
Human PAX6 Allelic Variant Database
Mutations in human PAX6 gene
http://pax6.hgu.mrc.ac.uk/
IL2Rgbase
X-linked severe combined immunodeficiency mutations Vertebrate immunoglobulin and T cell receptor genes
http://research.nhgri.nih.gov/scid/
IMGT/Gene-DB
http://imgt.cines.fr/cgi-bin/GENElect.jv
IMGT/HLA
Polymorphism of human MHC and related genes Hereditary inflammatory disorder and familial mediterranean fever mutation data
http://www.ebi.ac.uk/imgt/hla/
INFEVERS
http://fmf.igh.cnrs.fr/infevers
KinMutBase
Disease-causing protein kinase mutations
http://www.uta.fi/imt/bioinfo/KinMutBase/
Lowe Syndrome Mutation Database
Phosphatidylinositol-4,5-bisphosphate 5-phosphatase mutations causing Lowe oculocerebrorenal syndrome
http://research.nhgri.nih.gov/lowe/
NCL Mutation Database PAHdb PGDB
Polymorphisms in neuronal ceroid lipofuscinoses genes Mutations at the phenylalanine hydroxylase locus Prostate and prostatic diseases gene database
http://www.ucl.ac.uk/ncl/ http://www.pahdb.mcgill.ca/ http://www.ucsf.edu/PGDB
PHEXdb PTCH1 Mutation Database
PHEX mutations causing X-linked hypophosphatemia
http://www.phexdb.mcgill.ca/
Mutations and SNPs found in PTCH1 gene
http://www.cybergene.se/PTCH/ptchbase.ht ml
Microarray Data and other Gene Expression Databases
Microarrays are producing massive amounts of data. These data, like genome sequence data, can use to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs . A gene expression database can be regarded as consisting of three parts the gene expression data matrix, gene annotation and sample annotation. Hence the Microarray data and other gene expression databases is consists of repositories of microarray data and gene expression data.
9. Microarray Data and other Gene Expression Databases

ArrayExpress Public collection of microarray gene expression data http://www.ebi.ac.uk/arrayexpress http://www.dkfzheidelberg.de/abt0135/axeldb.htm http://bodymap.ims.u-tokyo.ac.jp/ http://love2.aist-nara.ac.jp/BGED
Axeldb BodyMap BGED
Gene expression in Xenopus laevis Human and mouse gene expression data Brain gene expression database
CleanEx
Expression reference database, linking heterogeneous expression data to facilitate cross-dataset comparisons
http://www.cleanex.isb-sib.ch/
EICO DB
Expression-based imprint candidate organiser: a database for discovery of novel imprinted genes
http://fantom2.gsc.riken.jp/EICODB/
emap Atlas
Edinburgh mouse atlas: a digital atlas of mouse embryo development and spatially-mapped gene expression
http://genex.hgu.mrc.ac.uk/
EPConDB
Endocrine pancreas consortium database
http://www.cbil.upenn.edu/EPConDB
EpoDB FlyView GeneAnnot GeneNote GenePaint GeneTrap GermOnline GXD HemBase HugeIndex Interferon Stimulated Gene Database Kidney Development Database
Genes expressed during human erythropoiesis Drosophila development and genetics Revised and improved annotation of Affymetrix human gene probe sets Human genes expression profiles in healthy tissues Gene expression patterns in the mouse Expression patterns in an embryonic stem library of gene trap insertions Expression data relevant for the mitotic and meiotic cell cycle and gametogenesis in yeast and higher eukaryotes Mouse gene expression database Genes transcribed in differentiating human erythroid cells Expression levels of human genes in normal tissues
http://www.cbil.upenn.edu/EpoDB/ http://pbio07.uni-muenster.de/ http://genecards.weizmann.ac.il/geneannot/ http://genecards.weizmann.ac.il/genenote / http://www.genepaint.org/Frameset.html http://www.cmhd.ca/sub/genetrap.asp http://www.germonline.org/ http://www.informatics.jax.org/menus/expre ssion_menu.shtml http://hembase.niddk.nih.gov/ http://hugeindex.org/
Genes induced by treatment with interferons Kidney development and gene expression
http://www.lerner.ccf.org/labs/williams/xchi p-html.cgi http://golgi.ana.ed.ac.uk/kidhome.html
MAGEST
Ascidian (Halocynthia roretzi) gene expression patterns Medaka (freshwater fish Oryzias latipes) gene expression pattern database DNA methylation data, patterns and profiles
http://www.genome.ad.jp/magest
MEPD MethDB
http://medaka.dsp.jst.go.jp/MEPD http://www.methdb.de/
NASCarrays NetAffx
Nottingham Arabidopsis Stock Centre microarray database Public Affymetrix probesets and annotations Prostate expression database: ESTs from prostate tissue and cell type-specific cDNA libraries Public expression profiling resource: expression profiles in a variety of diseases and conditions Genes using programmed translational recoding in their expression Reference database for human gene expression analysis
http://affymetrix.arabidopsis.info http://www.affymetrix.com/
PEDB
http://www.pedb.org/ http://microarray.cnmcresearch.org/pgadatat able.asp
PEPR
RECODE RefExA Stanford Microarray Database Tooth Development Database
http://recode.genetics.utah.edu/ http://www.lsbm.org/db/index_e.html
Raw and normalized data from microarray experiments Gene expression in dental tissue
http://genomewww.stanford.edu/microarray
http://bite-it.helsinki.fi/
Proteomics Resources
Applications of Proteomics
Characterization of Protein Complexes Protein Expression Profiling Proteome Mining Protein Arrays
The
proteomic
resources
have
databases
containing
proteomics information from various genomes/proteomes.
What is Proteomics?
Defined as the analysis of the entire protein complement in a given cell, tissue, or organism.
Proteomics also assesses activities, modifications, localization, and interactions of proteins in complexes.
Technology of Proteomics
Separation and Isolation of Proteins

1D and 2D PAGE
Edman Sequencing Mass Spectrometry Database utilization
Types of Proteomics
Protein Expression

Quantitative study of protein expression between samples that differ by some variable
Structural Proteomics

Goal is to map out the 3-D structure of proteins and 3protein complexes
Functional Proteomics
10. Proteomics Resources
GelBank
2D gel electrophoresis patterns of proteins from complete microbial genomes
http://gelbank.anl.gov/
PEP
Predictions for entire proteomes: summarized analyses of protein sequences
http://cubic.bioc.columbia.edu/pep/
Proteome Analysis Database
Functional classification of proteins in whole genomes
http://www.ebi.ac.uk/proteome/ http://wwwnbrf.georgetown.edu/pirwww/dbinfo/r esid.html
RESID SWISS2DPAGE
Pre-, co- and post-translational protein modifications
Annotated 2D gel electrophoresis database
http://www.expasy.org/ch2d/
Other Molecular Biology Databases
This category has the remaining types of databases. This category again can be subdivide into the following divisions: 1) BioImage 2) MetaRouter 3) PubMed 4) Drugs and drug design 5) Molecular probes and primers
11. Other Molecular Biology Databases 11.1. Drugs and drug design
ANTIMIC APD Database of natural antimicrobial peptides Antimicrobial peptide database Biodegradative strain database: microorganisms that can degrade aromatic and other organic compounds http://research.i2r.astar.edu.sg/Templar/DB/ANTIMIC/
http://aps.unmc.edu/AP/main.php
BSD
http://bsd.cme.msu.edu/
DART
Drug adverse reaction target database
http://xin.cz3.nus.edu.sg/group/drt/dart.asp http://www.cryst.bbk.ac.uk/peptaibol/welco me.html
Peptaibol
Peptaibol (antibiotic peptide) sequences
Pharmacogenomics and Pharmacogenetics Knowledge Base
Variation in drug response based on human variation
http://www.pharmgkb.org/
TTD
Therapeutic target database
http://xin.cz3.nus.edu.sg/group/cjttd/ttd.asp
11.2. Probes
IMGT/PRIME R-DB
Immunogenetics oligonucleotide primer database
http://imgt3d.igh.cnrs.fr/PrimerDB/Query_ PrDB.pl
MPDB
Information on synthetic oligonucleotides proven useful as primers or probes
http://www.biotech.ist.unige.it/interlab/m pdb.html
probeBase
rRNA-targeted oligonucleotide probe sequences, DNA microarray layouts and associated information
http://www.microbialecology.net/probeba se
RTPrimerDB
Real-time PCR primer and probe sequences
http://medgen31.ugent.be/primerdatabase/in dex.php
VirOligo
Virus-specific oligonucleotides for PCR and hybridization
http://viroligo.okstate.edu/
11.3. Unclassified databases

PubMed BioImage Citations and abstracts of biomedical literature Database of multidimensional biological images
http://pubmed.gov/
http://www.bioimage.org/
Bioinformatics Tools
BLAST(Basic Local Alignment Search Tool)
BLAST is the algorithm used by a family of five programs that will align your query sequence against sequences in a molecular database. Statistical methods are applied to judge the significance of matches. Reported alignments (i.e. sequences in the database that could be identical to your query sequence) are reported in order of significance, as estimated by the applied statistics
BLASTN
Compares a nucleotide query sequence against a nucleotide sequence database.

BLASTP
Compares an amino acid query sequence against a protein sequence database.

BLASTX
Compares the six-frame conceptual translation sixproducts of a nucleotide query sequence (both strands) against a protein sequence database.

TBLASTN
Compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

TBLASTX
Compares a nucleotide query sequence against the sixsixframe translations of a nucleotide sequence database.
CLUSTALX

Clustal X (Thompson et al. 1997) is a (Thompson 1997) version of Clustal W with a graphical user interface. This programme is used for multiple sequence alignment.
Multiple Alignment
Phylogenetic Analysis

Nucleic acid and protein sequences are used to infer Phylogenetic relationships Molecular phylogeny methods allow the suggestion of phylogenetic trees, from a given set of aligned sequences. The phylogenetic trees aim at reconstructing the history of successive divergence which took place during the evolution, between the considered sequences and their common ancestor.
Phylogenetic programmes
PHYLIP PAUP MEGA Treeview ODEN PHYLOWIN TREECON DENDRON
Gene Identification

AAT: Analysis and Annotation Tool FGENESH: Splice sites, protein coding exons & gene models Genie: Gene finder based on hidden Markov models GenScan: Identification of gene structures in genomic DNA Grail: DNA sequence analysis tool
ORF Finder: Search for open reading frame, at NCBI
Protein Structure Prediction

3D3D-PSSM: Protein Fold Recognition Multicoil: Predict coiled coil structures NNPredict: Protein secondary structure prediction PredictProtein: Sequence analysis and structure prediction SAPS: Statistical analysis of protein sequences
Protein 3D Structure / Modelling

FUGUE: Sequence-structure homology recognition SequencePDB Viewer: Protein structure database Proinformatix: Modeling oligopeptides for energetically minimized structures SWISSSWISS-MODEL: An automated knowledge-based knowledgeprotein modelling server

INDO Thai What Is Bioinformatics, A.sharMA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

INDO Thai What Is Bioinformatics, A.sharMA

Uploaded by

Copyright:

Available Formats

CIMAP Summer Training on Biotechnology & Bioinformatics

20th June 20th July, 2006

Bioinformatics : Techniques and usage

Greater Biological Knowledge

Computational Methods Resources and Tools

Bioinformatics : a brave new world

Who will have competitive advantage?

What is the solution

One of the model solution has come out in U.S.A

Glue grants Integrative and collaborative approaches to research

US National Institutes of Health Alliance for Cellular Signaling

Consortium unites traditional experimentalist with computational biologists.

The database content itself is doubling in size approximately every year.

FUNCTIONS OF A BIOINFORMATICS CENTRE

What is the Role of Bioinformatics

What kinds of data are we interested in?

The Management and Integration of Biological Information

Data Analysis Systems

Sequence Analysis Software

Protein Folding Software

Map Assembly & Integration Software

Comparative Genomics Tools

Multiple sequence alignment

Extraction of patterns and profiles from sequence data

Protein sequence analysis

Protein structure prediction

Protein structure alignment and comparison

Whole genome analysis

DNA microarray analysis

The ideal sequence database for computational analyses and datadatamining:

Database Categories List

Database Categories List

Nucleotide Sequence Databases

Nucleotide Sequence Databases

Full name and/or description

1.1. International Nucleotide Sequence Database Collaboration

An annotated collection of all publicly available nucleotide and protein sequences

EMBL Nucleotide Sequence Database

An annotated collection of all publicly available nucleotide and protein sequences

DDBJDNA Data Bank of Japan

An annotated collection of all publicly available nucleotide and protein sequences

Genetic Codes HERVd IMGT/LIGMDB Imprinted Gene Catalogue

Imprinted genes and parent-of-origin effects in animals

Pathogenicity islands and prophages in bacterial genomes Prokaryotic microsatellites

STRBase TIGR Gene Indices

Codon usage, start and stop signals

Unified clusters of ESTs and full-length mRNA sequences

http://www.ncbi.nlm.nih.gov/VecScreen/U niVec.html http://genomewww2.stanford.edu/vectordb/

1.2.2. Gene structure, introns and exons, splice sites

Alternative spliced isoforms

ASDB EASED EID ExInt HS3D

http://hazelton.lbl.gov/teplitski/alt http://eased.bioinf.mdc-berlin.de/ http://mcb.harvard.edu/gilbert/EID/ http://intron.bic.nus.edu.sg/exint/exint.html http://www.sci.unisannio.it/docenti/rampone/

Canonical and non-canonical mammalian splice sites

A tool for visualizing splicing of genes from EST data

Yeast nuclear and mitochondrial intron sequences

1.2.3. Transcriptional regulator sites and transcription factors

Binding sites for E.coli DNA-binding proteins

http://www.epd.isb-sib.ch http://bioinformatics.med.ohiostate.edu/HemoPDB http://www.hvrbase.org/ http://jaspar.cgb.ki.se http://www.dna.affrc.go.jp/htdocs/PLACE

HemoPDB HvrBase JASPAR PLACE

Database Categories List

Database Categories List

RNA sequence databases