You are on page 1of 43

CSE291D Lecture 19

Models for computational biology

1
Computational biology
vs bioinformatics
• According to the NIH:

– Computational biology:
The development and application of data-analytical and theoretical
methods, mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social systems.

Biologists do computer science?

– Bioinformatics:
Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or
health data, including those to acquire, store, organize, archive, analyze,
or visualize such data.
Computer scientists do biology?
2
Moleculary biology:
A computer science perspective
• Biopolymers:
– Long molecules with repeating structure which
contain genetic information

– Poly = many. Mer = part. Each part is a monomer

– Can represent as strings

• Include DNA, RNA, proteins

3
The Central Dogma
of Molecular Biology (Crick, 1956)

• DNA makes RNA

• RNA makes protein

4
Image due to Genome Research Limited, but retrieved from http://www.yourgenome.org/facts/what-is-the-central-dogma
The Central Dogma
of Molecular Biology (Crick, 1956)

• Once you go
protein, you
can’t go back!

5
The Central Dogma
of Molecular Biology (Crick, 1956)

• Once you go
protein, you
can’t go back!

6
Genetic information as strings
• DNA can be represented as strings over
{A,G,C,T} (nucleotides)

• RNA can be represented as strings over


{A,G,C,U}

7
Genetic information as strings
• Proteins are chains of amino acids.
There are 20 common amino acids, encoded
by the genetic code

• The sequence determines the 3D shape that


the molecule folds into; thereby its function.
8
Genes
• DNA is organized into long strands called
chromosomes. Humans have 46
chromosomes (23 pairs)

• Each chromosome includes many genes:


sequences that encode proteins or functional
RNA
• These determine our genetic traits,
e.g. eye color
9
Learning outcomes
By the end of the lesson, you should be able to:

• Apply profile HMMs for multiple sequence alignment

• Model phylogenetic trees and perform hierarchical


clustering with Kingman’s coalescent

• Contrast the properties of Kingman’s coalescent with


other simple phylogenetic models

10
11
12
13
Sequence alignment
• Evolutionary processes perturb sequences via
deletions, insertions, substitutions

14
T. Warnow - An Introduction to Computational Phylogenetics
Multiple sequence alignment

• Given: A set of biological sequences


– protein, DNA, or RNA
– assumed to be evolutionarily related

• Output: A matrix where the sequences are aligned:


– each column contains residues (i.e. letters) that came from
the same ancestral residue
– have similar 3D structural positions

15
Multiple sequence alignment

• x: consensus sequence (aligned columns)


• -: gaps/deletions

16
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Multiple sequence alignment

17
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Profile HMMs
(Haussler et al., 1993; Krogh et al., 1994).
• Hidden Markov models for multiple sequence
alignment
– The model encodes a generative process for the
sequences in the sequence family
• Aligned residues
• Perturbations caused by evolution

– Latent HMM states encode where perturbations


happen for a given sequence
– Parameters encode probabilities over residues, and
transition probabilities between states

18
Profile HMMs
• Simple case: substitutions occur
– No insertions or deletions

• A match state for each column in the alignment

• Discrete distribution over residues,


for each match state (i.e. position in the alignment)
• Denote with a square

19
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Dealing with gaps (insertions)
• Insert states draw residues from a background
distribution. (Denote with diamonds)
• Do not advance the position of the alignment.
Self-loop transitions, to
allow multiple insertions

20
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Dealing with deletions
• Delete states are “silent” and do not emit symbols

21
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
The full profile HMM model

22
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Example profile HMM
trained from an alignment

Shaded residues treated as inserts


23
Durbin et al. - Biological sequence analysis: probabilistic models of proteins and nucleic acids
Performing multiple sequence
alignment using profile HMMs
• Set number of match states, e.g. to average
length of training sequences

• Given unaligned sequences, learn the model


parameters using EM

• Align all sequences to the model using Viterbi

24
Phylogenetics
• Study evolutionary history and relationships
between organisms, species, populations…
• Finding the “tree of life”
– A phylogenetic tree,
a.k.a. a phylogeny

25
26
Molecular phylogeny
• Input: aligned DNA sequences

• Output: phylogenetic tree

bat rat cat goat gnat 27


Cavender-Farris-Neyman (CFN) model
• A model for a particular trait, which is either present or
absent (e.g. has fins)
• A tree, and probabilities of flipping the value at each edge

0.1

0.01 0.05

0.32 0.27
0.12

bat rat cat goat

28
Branch length parameterization
• Number of changes per edge e is
depending on the “length” of the branch
• If evolution is “clock-like,” lengths proportional
to time.

bat rat cat goat

29
Jukes-Cantor model
• Extend CFN model to four states, i.e. a single
nucleotide
• On each edge, change with
probability p. If change,
pick uniformly from other 3

bat rat cat goat gnat 30


Kingman’s coalescent
• The models so far do not have a prior on the
tree. They can be estimated via maximum
likelihood

• Kingman’s coalescent is a prior on trees


– Teh et al. (2008) present a Bayesian analysis,
and use the model for hierarchical clustering

31
Kingman’s coalescent
• Represent the tree by partitions of the data
(i.e. clusters) at each timepoint

• Tree formation runs backwards in time

• Recursively merge clusters until all are merged


(coalesced)
32
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0

• Each pair of clusters merges with exponential


rate 1

33
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0

• Each pair of clusters merges with exponential


rate 1

34
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0

• Each pair of clusters merges with exponential


rate 1

35
Kingman’s coalescent
• Start with all n data points in their own
clusters, at time 0

• Each pair of clusters merges with exponential


rate 1

36
Racing exponentials

• Memorylessness:

37
Equivalent implementation of
Kingman’s coalescent

• For i = 1:n-1

– Time to next merge is

– Select two clusters to merge uniformly at random

38
39
Generating data given a tree
• Generate data forward in time via a Markov
process (e.g. Brownian motion)
– Each split point initiates an independent Markov
process at for each branch

40
Generating 2D data

41
Inference: particle filter

• Simulate the tree bottom-up via a proposal distribution


– The prior, or the local posterior

• Importance weights correct for the proposal

• Resampling to prune particles stuck in low probability


regions

• Alternatively, greedy algorithms also possible

42
Think-pair-share

• Can you come up with a modified version of the


profile HMM to take into account a given
phylogenetic tree over the input sequences?

43

You might also like