Professional Documents
Culture Documents
Sequence Alignment
Colin Dewey
BMI/CS 576
www.biostat.wisc.edu/bmi576/
colin.dewey@wisc.edu
Fall 2016
Key concepts
2
What is multiple sequence alignment?
3
An example multiple sequence alignment
4
Why multiple sequence alignment?
5
The tasks in Multiple Sequence Alignment
Scoring an alignment
Algorithms for creating an alignment
6
Some notation
7
Example using notation
i 1
m3 =R
G A R F I E L D T H E F A T C A T
G A R F I E L D T H E - - - C A T j
G
G
A
A
R
R
F
R
I
Y
E
-
L
L
D
I
T
K
H
E
A
D
T
A
-
-
-
-
C
C
A
A
T
T
A
c3 =0
A
c2 =4
m2 m10 H
c10 =3
8
Scoring a Multiple Sequence Alignment
(MSA)
Key issue: how do we score a multiple sequence
alignment?
Usually, we assume that columns of an alignment are
independent
Score(m) = G(m) + S(mi )
i
gap function
score of ith column
Score(m) = S(mi )
i
9
Gap penalty (G)
10
Two common ways of scoring a multiple
alignment
Entropy based scores
Sum of pairs
11
Entropy of a distribution
12
Score of a column: Entropy based
Score of the ith column of alignment m is
X
S(mi ) = cai log(pia )
a
pia : Probability of character a in column i
cai : Number of occurrences of a in column i
This has an entropy-based interpretation
Xi
Let be a random variable representing a character in
column i
Xi
Consider each entry of column i to be observations of
across multiple independent experiments cai
We estimate by
P (Xi = a) pia =
n
Column score is proportional to the entropy of Xi 13
Scoring an alignment: Entropy based score
14
Scoring of a column: Sum of Pairs
k<l
Iterate over all pairs of rows in the column
15
Algorithms for performing a Multiple Sequence
Alignment
Dynamic programming
Not practical
Progressive alignment algorithms
Star alignment
Guide tree approach
Iterative alignment algorithms
16
Dynamic Programming (DP) for global multiple
sequence alignment
17
Notation for DP
F (i1 , i2 , , ik )
denotes the score of the best alignment of the i1, i2.. ik prefixes of the sequences
18
Recall the DP for the pairwise alignment
8
>
<F (i1 1, i2 1) + S(x 1
i1 , xi2 )
2
19
DP for Multiple sequence alignment
F(i1 1,!, ik 1) + S(x1i ,!, x ik )
1 k
"
2
F(i
1 2 , i 1,!, ik ) + S(, x i2 ,!, )
"
20
DP algorithm is too expensive
21
Heuristic algorithms to Multiple sequence
alignment
Progressive alignment
Build the alignment of larger number of
sequences from partial alignments of subsets of
sequences
Iterative alignment
Possibly remove some of the aligned sequences
and re-align to see if score improves
22
Progressive alignment
24
Picking the center in star alignments
i6=c
25
Aligning to an existing partial alignment
TGTTAAC -TGTTAAC
-TGT AAC -TGT-AAC
-TGT -AC -TGT--AC
ATGT --C ATGT---C
ATGT GGC ATGT-GGC
26
Star Alignment Example
Given:
ATGGCCATT
ATTGCCATT ATTGCCATT
ATGGCCATT
ATCCAATTTT ATC-CAATTTT
ATCTTCTT ATTGCCATT--
ATTGCCGATT
ATTGCCATT
ATTGCCGATT ATCTTC-TT
ATTGCC-ATT ATTGCCATT
27
Star Alignment Example
ATGGCCATT ATTGCCATT
1. ATTGCCATT ATGGCCATT
ATC-CAATTTT ATTGCCATT--
2. ATTGCCATT-- ATGGCCATT--
ATC-CAATTTT
28
Star Alignment Example
ATCTTC-TT ATTGCCATT--
3. ATTGCCATT ATGGCCATT--
ATC-CAATTTT
ATCTTC-TT--
Conceptually simple
Dependent only upon pairwise alignments
Does not consider any position-specific
information of the partial multiple sequence
alignment while aligning a new sequence to it
30
Tree-based progressive alignments
Starting sequences
x1 TGTTAAC
x2 TGTAAC
x3 TGTAC
x4 ATGTC
x5 ATGTGGC
Create a guide tree
Using pairwise distances (we will cover this in subsequent lectures)
Approach similar to but simpler than phylogenetic trees
x1 x2 x3 x4 x5
33
Tree Alignment Example
TGTAAC
TGT-AC
TGTAAC ATGT--C
TGT-AC ATGTGGC
-TGTAAC
-TGT-AC
ATGT--C
ATGTGGC
TGTAAC ATGT--C
TGT-AC ATGTGGC
TGTAAC ATGT--C
TGT-AC ATGTGGC
39
Comments about tree-based progressive
alignment
Exploits partial alignment information
But, greedy
The tree might not be correct, that is, reflect an
incorrect ordering of how sequences should be
stacked up in the alignment
Final results prone to errors in alignment
Some positions might be misaligned (that is have a
lower score than if a different ordering is used).
40
Ordering matters
1 is better than 2, assuming a match score of 2, mismatch score =1, gap penalty=-2
41
Iterative refinement methods
42
Additional notes about the ClustalW algorithm
43
Applying ClustalW to SH3 domain proteins
Nucleic Acids Research, 1994, Vol. 22, No. 22 4679
strand secondary
Ac-MILC eqaralydfaaenp----de1tfn---egavvtvin ---------ksnpd1wwegeln--g ---------grgvfpasyvelip
H_HS1 isavlydyqgegs-----d:elafd --- pdavitdie ----------v4egvwvrgrch- -g ---------hfglfpanyvklle
H VAV gtakarydfcar4r ----ees01sk---egdjiiki1nkk---------gqqgwwrgeiyg ----------rvgwfpanyveedy
Din_SRC2 klvvalyi1gkaie;g-----gd1svge--kn_aeyevidds ---------gehwwkvkdialg----------nvgyipsnyvqaea
45