Professional Documents
Culture Documents
1
Computational biology
vs bioinformatics
• According to the NIH:
– Computational biology:
The development and application of data-analytical and theoretical
methods, mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social systems.
– Bioinformatics:
Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or
health data, including those to acquire, store, organize, archive, analyze,
or visualize such data.
Computer scientists do biology?
2
Moleculary biology:
A computer science perspective
• Biopolymers:
– Long molecules with repeating structure which
contain genetic information
3
The Central Dogma
of Molecular Biology (Crick, 1956)
4
Image due to Genome Research Limited, but retrieved from http://www.yourgenome.org/facts/what-is-the-central-dogma
The Central Dogma
of Molecular Biology (Crick, 1956)
• Once you go
protein, you
can’t go back!
5
The Central Dogma
of Molecular Biology (Crick, 1956)
• Once you go
protein, you
can’t go back!
6
Genetic information as strings
• DNA can be represented as strings over
{A,G,C,T} (nucleotides)
7
Genetic information as strings
• Proteins are chains of amino acids.
There are 20 common amino acids, encoded
by the genetic code
10
11
12
13
Sequence alignment
• Evolutionary processes perturb sequences via
deletions, insertions, substitutions
14
T. Warnow - An Introduction to Computational Phylogenetics
Multiple sequence alignment
15
Multiple sequence alignment
16
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Multiple sequence alignment
17
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Profile HMMs
(Haussler et al., 1993; Krogh et al., 1994).
• Hidden Markov models for multiple sequence
alignment
– The model encodes a generative process for the
sequences in the sequence family
• Aligned residues
• Perturbations caused by evolution
18
Profile HMMs
• Simple case: substitutions occur
– No insertions or deletions
19
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Dealing with gaps (insertions)
• Insert states draw residues from a background
distribution. (Denote with diamonds)
• Do not advance the position of the alignment.
Self-loop transitions, to
allow multiple insertions
20
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Dealing with deletions
• Delete states are “silent” and do not emit symbols
21
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
The full profile HMM model
22
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Example profile HMM
trained from an alignment
24
Phylogenetics
• Study evolutionary history and relationships
between organisms, species, populations…
• Finding the “tree of life”
– A phylogenetic tree,
a.k.a. a phylogeny
25
26
Molecular phylogeny
• Input: aligned DNA sequences
0.1
0.01 0.05
0.32 0.27
0.12
28
Branch length parameterization
• Number of changes per edge e is
depending on the “length” of the branch
• If evolution is “clock-like,” lengths proportional
to time.
29
Jukes-Cantor model
• Extend CFN model to four states, i.e. a single
nucleotide
• On each edge, change with
probability p. If change,
pick uniformly from other 3
31
Kingman’s coalescent
• Represent the tree by partitions of the data
(i.e. clusters) at each timepoint
33
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0
34
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0
35
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0
36
Racing exponentials
• Memorylessness:
37
Equivalent implementation of
Kingman’s coalescent
• For i = 1:n-1
38
39
Generating data given a tree
• Generate data forward in time via a Markov
process (e.g. Brownian motion)
– Each split point initiates an independent Markov
process at for each branch
40
Generating 2D data
41
Inference: particle filter
42
Think-pair-share
43