Professional Documents
Culture Documents
I t d ti to
Introduction
t Bioinformatics
Bi i f
ti
Spring 2008
2008-2009
2009
Tolga
g Can ((Office: B-109))
e-mail: tcan@ceng.metu.edu.tr
Course Web Page:
http://www ceng metu edu tr/~tcan/ceng465 s0809/
http://www.ceng.metu.edu.tr/~tcan/ceng465_s0809/
1
T
Teaching
hi Assistant
A i t t
Dr. Ahmet Saan
ee-mail:
mail: ahmet@ceng.metu.edu.tr
ahmet@ceng metu edu tr
Office: A-206
A general introduction
Course outline
Motivation and introduction to biology (1 week)
Sequence analysis
l i (4
( weeks)
k)
Analyze DNA and protein sequences for clues regarding
function
Identification of homologues
Pairwise sequence alignment
Course outline
Protein structures (4 weeks)
Analyze
y pprotein structures for clues regarding
g
g function
Structure alignment
Gene/Protein networks,
networks pathways (2 weeks)
Protein-protein, protein/DNA interactions
Construction
Constr ction and analysis
anal sis of large scale networks
net orks
6
Grading
Miscellaneous
Course webpage
hhttp://www.ceng.metu.edu.tr/~tcan/ceng465_s0809/
//
d /
/
465 0809/
Lecture slides and reading materials
Assignments
Other relevant information
Newsgroup
metu.ceng.course.465
You should follow the newsgroup for course related
announcements
Students from other departments
p
should gget a CENG
account for this semester (Room: A-210) in order to
access the newsgroup
8
What is Bioinformatics?
(Molecular) Bio - informatics
One idea for a definition?
p
g biology
gy in
Bioinformatics is conceptualizing
terms of molecules (in the sense of physicalchemistry) and then applying informatics
informatics
techniques (derived from disciplines such as
applied
li d math,
th CS,
CS andd statistics)
t ti ti ) to
t understand
d t d
and organize the information associated with
these molecules, on a large-scale.
Bioinformatics is a practical discipline with
many applications.
Introductory Biology
DNA
(Genotype)
Protein
Phenotype
10
Scales of life
11
Animal Cell
Mitochondrion
Cytoplasm
Nucleus
Plasma membrane
Cell coat
Chromatin
Animal CELL
13
Eukaryotes
y
have nucleus ((animal,plants)
,p
)
Linear genomes with multiple chromosomes in
pairs When pairing up,
pairs.
up they look like
Middle: centromere
Top: p-arm
Bottom: q-arm
14
Molecular Biology
gy Information - DNA
Raw DNA
Sequence
Coding or Not?
Parse into ggenes?
4 bases: AGCT
~1 Kb in a gene, ~2
Mb iin genome
~3 Gb Human
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca
gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac
atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg
aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca
gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc
ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact
ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca
t t
t
t t tt tt t t t
tt
t
ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca
tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct
gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt
gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc
aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact
gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca
gacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . .
caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg
cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt
gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc
aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc
ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa
aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
15
DNA structure
16
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI
TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF
d1dhfa_
d8dfr__
d4dfra
d4dfra_
d3dfr__
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI
ISLIAALAVDRVIGMENAMPW
NLPADLAWFKRNTLD
KPVIMGRHTWESI
TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF
17
18
More on
M
Macromolecular
l
l Structure
St
t
Primary structure of proteins
Linear polymers linked by peptide bonds
Sense of direction
19
Secondary Structure
Polypeptide chains fold into regular local
structures
alpha helix, beta sheet, turn, loop
based on energy
gy considerations
Ramachandran plots
20
Alpha helix
21
Beta sheet
anti-parallel
parallel
schematic
22
Tertiary Structure
3-d structure of a polypeptide sequence
interactions between non
non-local
local and foreign atoms
often separated into domains
tertiary structure of
myoglobin
domains of CD4
23
Quaternary Structure
Arrangement of protein subunits
dimers,
di
tetramers
quaternary structure
of Cro
human hemoglobin
g
tetramer
24
Structure summary
y
3-d structure determined byy pprotein sequence
q
Cooperative and progressive stabilization
Prediction remains a challenge
ab-initio (energy minimization)
knowledge-based
g
Chou-Fasman and GOR methods for SSE prediction
Comparative modeling and protein threading for tertiary
structure prediction
26
27
28
1995
Bacteria,
Bacteria
1.6 Mb,
~1600 genes
[Science 269: 496]
1997
Eukaryote,
Eukaryote
13 Mb,
~6K genes
[Nature 387: 1]
Genomes
hi hli ht
highlight
the
Finiteness
of the
Parts in
Biology
1998
Animal,
A
i l
~100 Mb,
~20K genes
[Science 282:
1945]
2000?
Human,
~3 Gb,
~100K
genes [???]
29
30
Young/Lander, Chips,
Young/Lander
Chips
Abs. Exp.
Brown, array,
Rel. Exp. over
Timecourse
Also: SAGE;
Samson and
Church, Chips;
Aebersold,
Protein
E
Expression
i
Snyder,
S
d
Transposons,
Protein Exp
Exp.
31
Array Data
Yeast Expression Data in
Academia:
l
levels
l ffor allll 6000 genes!!
Can onl
only seq
sequence
ence genome
once but can do an infinite
variety of these array
experiments
at 10 time points
points,
6000 x 10 = 60K floats
telling signal from
background
(courtesy of J Hager)
32
Other Whole-Genome
E
Experiments
i
t
Systematic Knockouts
Winzeler, E. A., Shoemaker, D. D.,
Astromoff, A., Liang, H., Anderson, K.,
Andre, B., Bangham, R., Benito, R.,
Boeke JJ. D
Boeke,
D., Bussey
Bussey, H
H., Chu
Chu, A
A. M
M.,
Connelly, C., Davis, K., Dietrich, F., Dow,
S. W., El Bakkoury, M., Foury, F., Friend,
S. H., Gentalen, E., Giaever, G.,
Hegemann, J. H., Jones, T., Laub, M.,
Liao H
Liao,
H., Davis
Davis, R
R. W
W. & et al
al. (1999)
(1999).
Functional characterization of the S.
cerevisiae genome by gene deletion and
parallel analysis. Science 285, 901-6
2h
hybrids,
b id linkage
li k
maps
Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. &
Zhu, L. (1998). Construction of a modular yeast twohybrid cDNA library from human EST clones for the
human genome protein linkage map. Gene 215,
143-52
For yeast:
6000 x 6000 / 2
~ 18M interactions
33
Information to understand
genomes
Metabolic Pathways
(glycolysis),
traditional
biochemistry
Regulatory Networks
Whole Organisms
Phylogeny,
y g y, traditional
zoology
Environments,
Habitats,, ecology
gy
The Literature
(MEDLINE)
The Future....
Future
34
Organizing
Molecular Biology Information:
Redundancy and Multiplicity
Different Sequences Have the Same
Structure
Organism has many similar genes
Single Gene May Have Multiple
Functions
Genes are grouped into Pathways
Genomic Sequence Redundancy due
to the
h Genetic
G
i Code
C d
How do we find the
similarities?
similarities?.....
35
Human genome
Noncoding
DNA
810Mb
Genes and generelated sequences
900Mb
Coding
DNA
90Mb
Pseudogenes
d
Gene fragments
Introns, leaders, trailers
Single-copy genes
Multi-gene families
Dispersed
Regulatory sequences
Repetitive DNA
420Mb
Non-coding
tandem
repeats
Genomewide
interspersed
p
repeats
Extragenic DNA
2100Mb
Tandemly
repeated
Satellite DNA
Minisatellites
c osate tes
Microsatellites
DNA transposons
LTR elements
LINEs
SINEs
36
Protein Databases
SWISS-PROT: http://www.expasy.ch/sprot
PDB: http://www.pdb.bnl.gov/
37
Bibliography
38
Computer
Calculations
39
Application domains
Bio-defense
40
Kinds of activities
41
Motivation
Diversity
y and size of information
Sequences, 3-D structures, microarrays, protein
interaction networks,, in silico models,, bio-images
g
Bioinformatics - A Revolution
Biological Experiment
Collect
Data
Information
Characterize
Knowledge
Compare
Discovery
Model
Infer
Data
5MHz
Emphasis
Genomes
Microarrays
106
Solved
structure
# People/Website
20K websites in 1995
36M websites in 2004
Ribosome
Sequencing cost
2c/bp
$10/bp
Goal is .0001c/bp
E.Coli
90
Models &
Pathways
102
Virus
Structure
Sequenced
genome
Processing
speed
2 GHz
Technology
95
Yeast
Year
C.Elegans
Human
00
05
44
Like physics, where general rules and laws are taught at the start, biology will
surely be presented to future generations of students as a set of basic systems
....... duplicated and adapted to a very wide range of cellular and organismic
functions, following basic evolutionary principles constrained by Earths
geological history.
history
--Temple
Temple Smith,
Smith Current Topics in Computational Molecular
Biology
45
Scalability challenges
Recent issue of NAR devoted to data collections
contains
co
a s 719
7 9 databases
da abases
Sequence
Genomes ((more than 150),
), ESTs,, Promoters,, transcription
p
factor binding sites, repeats, ..
Structure
Domains, motifs, classifications, ..
Others
Mi
Microarrays, subcellular
b ll l localization,
l li i ontologies,
l i pathways,
h
SNPs, ..
46
47
Skill set
Artificial intelligence
Machine learning
Statistics & probability
Algorithms
Databases
P
Programming
i
48
Bioinformatics Topics
G
Genome
Sequence
S
Fi
Finding
di Genes
G
in
i Genomic
G
i DNA
introns
exons
promotors
Characterizing Repeats in Genomic DNA
Statistics
Patterns
Duplications in the Genome
Large scale genomic alignment
49
Sequence Alignment
non-exact string
g matching,
g gaps
g p
How to align two strings
optimally via Dynamic
Programming
Local vs Global Alignment
Suboptimal Alignment
Hashing to increase speed
(BLAST, FASTA)
Amino
A i acid
id substitution
b tit ti scoring
i
matrices
Multiple Alignment and Consensus
Patterns
How to align more than one
sequence and then fuse the result
in a consensus representation
Transitive Comparisons
HMMs, Profiles
Motifs
Bioinformatics Topics
p
Protein Sequence
Computationally
p
y challenging
g g problems
p
More sensitive ppairwise alignment
g
Dynamic programming is O(mn)
m is the length of the query
n is
i the
h length
l
h off the
h database
d b
Shotgun
g alignment
g
Current techniques will take over 200 days on a single
machine to align the mouse genome
51
Bioinformatics Topics
Sequence / Structure
Secondary Structure
Prediction
Prediction
via Propensities
Neural Networks,
Genetic Alg.
Simple Statistics
TM-helix finding
Assessing Secondary
Structure Prediction
Structure Prediction: Protein
and RNA
Topics -- Structures
Structural Alignment
Aligning sequences on
the basis of 3D structure.
DP does not converge,
unlike
lik sequences, what
h
to do?
Other Approaches:
Distance Matrices,
Hashing
Fold Library
53
Computationally
p
y challenging
g g problems
p
Alignment against a database
Single comparison usually takes seconds.
Comparison
C
i
against
i t a database
d t b
takes
t k hours.
h
All-against-all comparison takes weeks.
Topics
p
-- Databases
Basic clustering
UPGMA
single-linkage
multiple linkage
Protein Units?
What are the units of
biological information?
sequence, structure
motifs, modules, domains
Other Methods
Parsimony, Maximum likelihood
Evolutionary implications
Visualization
Vi
li ti off Large
L
Amounts
A
t
of Information
The Bias Problem
sequence weighting
sampling
55
Topics -- Genomics
Expression Analysis
Time Courses clustering
Measuring differences
Identifying Regulatory
Regions
Large scale cross referencing of
information
Function Classification and
Orthologs
The Genomic vs. Single-molecule
P
Perspective
i
Genome Comparisons
Ortholog Families, pathways
Large-scale censuses
Frequent Words Analysis
Genome Annotation
Trees from Genomes
Identification of interacting
proteins
Structural Genomics
Folds in Genomes, shared &
common folds
Bulk Structure Prediction
G
Genome
T
Trees
56
Topics -- Simulation
Molecular Simulation
> Energy ->
>
Geometry ->
Forces
Basic interactions, potential
energy functions
Electrostatics
VDW Forces
Bonds as Springs
How structure changes over
time?
How to measure the
change in a vector
(gradient)
Molecular Dynamics & MC
Energy Minimization
Parameter Sets
Number Densityy
Poisson-Boltzman
Equation
Lattice Models and
Simplification
57
General Types of
Informatics techniques
in Bioinformatics
Databases
Building, querying
Schema design
Heterogeneous,
distributed
Similarity search
Sequence, structure
Significance statistics
Finding Patterns
AI / Machine Learning
Clustering
Data mining
Modeling & simulation
Programming
Perl
Java/C/C++/..
58