Cse291d 19

CSE291D Lecture 19
Models for computational biology
1
Computational biology
vs bioinformatics
• According to the NIH:
– Computational biology:
The development and application of data-analytical and theoretical
methods, mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social systems.
Biologists do computer science?
– Bioinformatics:
Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or
health data, including those to acquire, store, organize, archive, analyze,
or visualize such data.
Computer scientists do biology?
2
Moleculary biology:
A computer science perspective
• Biopolymers:
– Long molecules with repeating structure which
contain genetic information
– Poly = many. Mer = part. Each part is a monomer
– Can represent as strings
• Include DNA, RNA, proteins
3
The Central Dogma
of Molecular Biology (Crick, 1956)
• DNA makes RNA
• RNA makes protein
4
Image due to Genome Research Limited, but retrieved from http://www.yourgenome.org/facts/what-is-the-central-dogma
The Central Dogma
• Once you go
protein, you
can’t go back!
5
The Central Dogma
• Once you go
protein, you
can’t go back!
6
Genetic information as strings
• DNA can be represented as strings over
{A,G,C,T} (nucleotides)
• RNA can be represented as strings over

{A,G,C,U}
7
Genetic information as strings
• Proteins are chains of amino acids.
There are 20 common amino acids, encoded
by the genetic code
• The sequence determines the 3D shape that

the molecule folds into; thereby its function.
8
Genes
• DNA is organized into long strands called
chromosomes. Humans have 46
chromosomes (23 pairs)
• Each chromosome includes many genes:

sequences that encode proteins or functional
RNA
• These determine our genetic traits,
e.g. eye color
9
Learning outcomes
By the end of the lesson, you should be able to:
• Apply profile HMMs for multiple sequence alignment
• Model phylogenetic trees and perform hierarchical

clustering with Kingman’s coalescent
• Contrast the properties of Kingman’s coalescent with

other simple phylogenetic models
10
11
12
13
Sequence alignment
• Evolutionary processes perturb sequences via
deletions, insertions, substitutions
14
T. Warnow - An Introduction to Computational Phylogenetics
Multiple sequence alignment
• Given: A set of biological sequences

– protein, DNA, or RNA
– assumed to be evolutionarily related
• Output: A matrix where the sequences are aligned:

– each column contains residues (i.e. letters) that came from
the same ancestral residue
– have similar 3D structural positions
15
• x: consensus sequence (aligned columns)

• -: gaps/deletions
16
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
17
Profile HMMs
(Haussler et al., 1993; Krogh et al., 1994).
• Hidden Markov models for multiple sequence
alignment
– The model encodes a generative process for the
sequences in the sequence family
• Aligned residues
• Perturbations caused by evolution
– Latent HMM states encode where perturbations

happen for a given sequence
– Parameters encode probabilities over residues, and
transition probabilities between states
18
Profile HMMs
• Simple case: substitutions occur
– No insertions or deletions
• A match state for each column in the alignment
• Discrete distribution over residues,

for each match state (i.e. position in the alignment)
• Denote with a square
19
Dealing with gaps (insertions)
• Insert states draw residues from a background
distribution. (Denote with diamonds)
• Do not advance the position of the alignment.
Self-loop transitions, to
allow multiple insertions
20
Dealing with deletions
• Delete states are “silent” and do not emit symbols
21
The full profile HMM model
22
Example profile HMM
trained from an alignment
Shaded residues treated as inserts

23
Performing multiple sequence
alignment using profile HMMs
• Set number of match states, e.g. to average
length of training sequences
• Given unaligned sequences, learn the model

parameters using EM
• Align all sequences to the model using Viterbi
24
Phylogenetics
• Study evolutionary history and relationships
between organisms, species, populations…
• Finding the “tree of life”
– A phylogenetic tree,
a.k.a. a phylogeny
25
26
Molecular phylogeny
• Input: aligned DNA sequences
• Output: phylogenetic tree
bat rat cat goat gnat 27

Cavender-Farris-Neyman (CFN) model
• A model for a particular trait, which is either present or
absent (e.g. has fins)
• A tree, and probabilities of flipping the value at each edge
0.1
0.01 0.05
0.32 0.27
0.12
bat rat cat goat
28
Branch length parameterization
• Number of changes per edge e is
depending on the “length” of the branch
• If evolution is “clock-like,” lengths proportional
to time.
bat rat cat goat
29
Jukes-Cantor model
• Extend CFN model to four states, i.e. a single
nucleotide
• On each edge, change with
probability p. If change,
pick uniformly from other 3
bat rat cat goat gnat 30

Kingman’s coalescent
• The models so far do not have a prior on the
tree. They can be estimated via maximum
likelihood
• Kingman’s coalescent is a prior on trees

– Teh et al. (2008) present a Bayesian analysis,
and use the model for hierarchical clustering
31
• Represent the tree by partitions of the data
(i.e. clusters) at each timepoint
• Tree formation runs backwards in time
• Recursively merge clusters until all are merged

(coalesced)
32
• Start with all n data points in their own
clusters, at time 0
• Each pair of clusters merges with exponential

rate 1
33
clusters, at time 0

rate 1
34
clusters, at time 0

rate 1
35
clusters, at time 0

rate 1
36
Racing exponentials
• Memorylessness:
37
Equivalent implementation of
• For i = 1:n-1
– Time to next merge is
– Select two clusters to merge uniformly at random
38
39
Generating data given a tree
• Generate data forward in time via a Markov
process (e.g. Brownian motion)
– Each split point initiates an independent Markov
process at for each branch
40
Generating 2D data
41
Inference: particle filter
• Simulate the tree bottom-up via a proposal distribution

– The prior, or the local posterior
• Importance weights correct for the proposal
• Resampling to prune particles stuck in low probability

regions
• Alternatively, greedy algorithms also possible
42
Think-pair-share
• Can you come up with a modified version of the

profile HMM to take into account a given
phylogenetic tree over the input sequences?
43

Cse291d 19

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cse291d 19

Uploaded by

Copyright:

Available Formats

CSE291D Lecture 19

Models for computational biology

Biologists do computer science?

– Poly = many. Mer = part. Each part is a monomer

– Can represent as strings

• Include DNA, RNA, proteins

• DNA makes RNA

• RNA makes protein

• RNA can be represented as strings over

• The sequence determines the 3D shape that

• Each chromosome includes many genes:

• Apply profile HMMs for multiple sequence alignment

• Model phylogenetic trees and perform hierarchical

• Contrast the properties of Kingman’s coalescent with

• Given: A set of biological sequences

• Output: A matrix where the sequences are aligned:

• x: consensus sequence (aligned columns)

– Latent HMM states encode where perturbations

• A match state for each column in the alignment

• Discrete distribution over residues,

Shaded residues treated as inserts

• Given unaligned sequences, learn the model

• Align all sequences to the model using Viterbi

• Output: phylogenetic tree

bat rat cat goat gnat 27

bat rat cat goat

bat rat cat goat

bat rat cat goat gnat 30

• Kingman’s coalescent is a prior on trees

• Tree formation runs backwards in time

• Recursively merge clusters until all are merged

• Each pair of clusters merges with exponential

• Each pair of clusters merges with exponential

• Each pair of clusters merges with exponential

• Each pair of clusters merges with exponential

– Time to next merge is

– Select two clusters to merge uniformly at random

• Simulate the tree bottom-up via a proposal distribution

• Importance weights correct for the proposal

• Resampling to prune particles stuck in low probability

• Alternatively, greedy algorithms also possible

• Can you come up with a modified version of the

You might also like