You are on page 1of 36

Introduction to bioinformatics

Sylvia B. Nagl
What is bioinformatics?
an emerging interdisciplinary research area

deals with the computational management
and analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...
Relationships between







sequence 3D structure protein functions

Properties and evolution of genes, genomes,
proteins, metabolic pathways in cells

Use of this knowledge for prediction, modelling, and
design
The Core of Bioinformatics to date
TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
GYALYGSATMLV
The holy grail of bioinformatics
GCTCCTCACTGTCTGTGTTTATTC
TTTTAGCTTCTTCAGATCTTTTAG
TCTGAGGAAGCCTGGCATGTGCA
AATGAAGTTAACCTAA...
> 500, 000 genes
sequenced to date
Expected number of
unique protein
structures:
~ 700-1, 000
Basic concepts
conceptual foundations of bioinformatics:
evolution
protein folding
protein function

bioinformatics builds mathematical models
of these processes -
to infer relationships between components
of complex biological systems


Information processing in cells

coding regions
regulatory
sites
nucleic acids
transcripts
proteins
One-to-many mappings!
Context-dependence!
Global cell state
Genome activation
patterns: transcriptomics
Protein population:
proteomics
Organisation:
tissue imaging EM X-ray, NMR
cells
molecular complexes
Global approaches: Toward a new Systems Biology
How does the spatial and
temporal organisation of
living matter give rise to
biological processes?
Genome
Living cell
Virtual cell
Perturbation
Dynamic response
Biological knowledge
(computerised)
Sequence information
Structural information
Basic principles
Practical
applications
Global approaches: Toward a new Systems Biology

Bioinformatics
Mathematical
modelling
Simulation
External environment
Internal environment
Metabolic net
Genetic networks
DNA hRNA mRNAs proteins
We do not know yet whether the information in the genome is sufficient
to reconstruct an entire biological system. Information on building blocks
not enough, information on their interactions is essential.
Bioinformatics in context

Genomics
Molecular
evolution
Biophysics
Molecular
biology
Ethical, legal,
and social
implications
Bioinformatics
Mathematics/
computer
science
Current challenges to users
Potential hurdles:
Methods are in flux and not fully developed-
scattered and heterogeneous resources

Remedies: Web resources
navigation guides
integration of tools and databanks

http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
Example 1

Sequence homology search of the
genome of Plasmodium
falciparum






Target identification for antimalerial
drugs
The search for new antimalarial
drugs

Malaria is one of the leading causes of morbidity
and mortality in the tropics.
300 to 500 million estimated clinical cases and 1.5
million to 2.7 million deaths per year.
Nearly all fatal cases are caused by Plasmodium
falciparum.
The parasite's resistance to conventional
antimalarial drugs such as chloroquine is growing
at an alarming rate.
P. falciparum has a plastidlike organelle, called the
apicoplast, acquired by endosymbiosis of an alga.




Self-replicating, maternally inherited (35kb, circular DNA).
Comparative genome analysis: Search for orthologs.
Apicoplast contains enzymes found in plant and bacterial,
but not animal metabolic pathways.
Potential target for antimalerial drugs:
DOXP reductoisomerase

Jomaa et al. (1999)
Jomaa et al. (1999) Science 285: 1573-1576:
Biological databases
In 1995, the number of genes in the database started to exceed
the number of papers on molecular biology and genetics in the
literature!
(Boguski, 1999)
The challenge
Data types
primary data
secondary data
tertiary data
sequence
DNA
amino acid
AATGCGTATAGGC
DMPVERILEALAVE
primary database
secondary
protein structure
motifs: regular
expressions, blocks,
profiles, fingerprints

e. g., alpha-helices, beta-
strands
secondary db
domains, folding units
tertiary protein
structure
tertiary db
atomic co-ordinates
Primary biological databases
Nucleic acid

EMBL
GenBank
DDBJ (DNA
Data Bank of Japan)
Protein

PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
International nucleotide data banks
EMBL
Europe


EMBL

EBI
GenBank
USA


NLM

NCBI
DDBJ
Japan


NIG

CIB
International
Advisory Meeting
Collaborative Meeting
TrEMBL NRDB
GenBank file format
GenBank file format
Swiss-Prot
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
Other primary protein databases
TrEMBL (translated EMBL) in SWISS-PROT format
rapid access to sequence data from genome projects
computer-annotated supplement to SWISS-PROT
translations of all coding sequences (CDS) in EMBL

SP-TrEMBL

REM-TrEMBL: immunoglobulins, T-cell receptors, short
fragments, synthetic and patented sequences

Other primary protein databases
The Protein Information Resource (PIR)

integrated system of protein sequence databases
and derived related databases, e. g., alignment
databases
rapid searching, comparison, and pattern matching of
protein sequences
retrieval of descriptive, bibliographic, feature, and
concurrent cross-reference information
aims to be comprehensive and consistently
annotated
PIR: related databases
NRL-3D Sequence-Structure Database


produced by PIR from sequence and annotation
information extracted from three-dimensional
structures in the Protein Databank (PDB)

allows keyword and similarity searches


PIR: related databases
PATCHX integrated with PIR

a non-redundant database of protein sequences
produced by MIPS, the European branch of PIR-
International

The PIR Protein Sequence Database and PATCHX
together provide the most complete collection of
protein sequence data currently available in the
public domain.
Composite protein sequence dbs
NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL
PIR PIR PIR TrEMBL
SP SP SP SP
PDB GenBank MIPSOwn
GenPept NRL-3D NRL-3D
MIPSH
PIRMOD
MIPSTrn
EMTrans
GBTrans
Kabat
PseqIP


OWL composite database
OWL only released every 6-8
weeks
By accession number
By database code
By text
By sequence
By title
By author
By query language
By regular expression
Direct OWL access:

OWL Blast server
Two other useful sites
INFOBIOGEN-The Public Catalog of Databases
http://www.infobiogen.fr/services/dbcat/

KEGG-Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.
Sequence Retrieval System (SRS)
Database browser that allows
users to
retrieve
link
access
entries from all interconnected
resources.
Users can formulate queries
across a range of different
database types.

Guide to Protein Databases:
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index
.html
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index
.html


With thanks to Dr Roman Laskowski.

You might also like