You are on page 1of 45

Multiple

Sequence Alignment
Colin Dewey
BMI/CS 576
www.biostat.wisc.edu/bmi576/
colin.dewey@wisc.edu
Fall 2016
Key concepts

The Multiple Sequence Alignment Problem


Scoring Multiple Sequence Alignments
Scoring an alignment of a profile and a sequence
Heuristic Algorithms for Multiple Sequence
Alignment
General strategies
Progressive alignment
Star alignment
Tree-based alignment
Iterative alignment

2
What is multiple sequence alignment?

Given: three or more related biological sequences

Do: identify the subsets of positions across sequences that


are truly related

In other words: find a simultaneous alignment of all input


sequences such that the implied pairwise alignments identify
the truly related positions between each pair of sequences

3
An example multiple sequence alignment

4
Why multiple sequence alignment?

Build phylogenetic trees (next module)


Determine evolutionary relationships between sequences
A multiple sequence alignment can represent a family of
proteins with similar function
Compare new sequence to a family of known proteins
For example the BLOCKS database used for BLOSUM contains several
ungapped alignments for known protein families
Discover common signatures or protein domains among a
group of proteins
Identify genetic variation among individuals of a population

5
The tasks in Multiple Sequence Alignment

Scoring an alignment
Algorithms for creating an alignment

6
Some notation

Let m denote a Multiple Sequence Alignment


mi is the ith column of the alignment m
mij is the ith column and jth row
cia count of residue a in column i

7
Example using notation

i 1
m3 =R
G A R F I E L D T H E F A T C A T
G A R F I E L D T H E - - - C A T j
G
G
A
A
R
R
F
R
I
Y
E
-
L
L
D
I
T
K
H
E
A
D
T
A
-
-
-
-
C
C
A
A
T
T
A
c3 =0
A
c2 =4
m2 m10 H
c10 =3

8
Scoring a Multiple Sequence Alignment
(MSA)
Key issue: how do we score a multiple sequence
alignment?
Usually, we assume that columns of an alignment are
independent
Score(m) = G(m) + S(mi )
i
gap function
score of ith column

For now, we will simplify the score by assuming a linear


gap penalty

Score(m) = S(mi )
i
9
Gap penalty (G)

We will use a simple linear gap penalty function


Penalty for a space: s
Let S(a,b) denote the cost of substituting a by b.
Linear gap penalty can be incorporated into the
substitution matrix
S(a,-)=-s=S(-,a)
S(-,-)=0

10
Two common ways of scoring a multiple
alignment
Entropy based scores

Sum of pairs

11
Entropy of a distribution

A measure of uncertainty of an outcome


For a discrete distribution P(X), where X takes k values x1, .. xk
it is defined as
Xk
H(X) = P (xi )logP (xi )
i=1
Entropy is greatest when we are most uncertain, that is, for a
uniform distribution
Entropy is least when we are most certain, e.g. deterministic
event

12
Score of a column: Entropy based
Score of the ith column of alignment m is
X
S(mi ) = cai log(pia )
a
pia : Probability of character a in column i
cai : Number of occurrences of a in column i
This has an entropy-based interpretation
Xi
Let be a random variable representing a character in
column i
Xi
Consider each entry of column i to be observations of
across multiple independent experiments cai
We estimate by
P (Xi = a) pia =
n
Column score is proportional to the entropy of Xi 13
Scoring an alignment: Entropy based score

High entropy: More uniform distribution/more variability of


characters
Low entropy: Less uniform distribution/less variability of
characters
X
S(mi ) = cai log(pia )
a

14
Scoring of a column: Sum of Pairs

Compute the sum of the pairwise scores


X
S(mi ) = s(mi , mi )
k l

k<l
Iterate over all pairs of rows in the column

s(mi , mi ) as BLOSUM or PAM


k l Substitution score from a substitution/match matrix such

15
Algorithms for performing a Multiple Sequence
Alignment
Dynamic programming
Not practical
Progressive alignment algorithms
Star alignment
Guide tree approach
Iterative alignment algorithms

16
Dynamic Programming (DP) for global multiple
sequence alignment

Assume columns are independent


Score of alignment is sum of column scores
Generalization of methods for pairwise
alignment
consider k-dimensional matrix for k sequences
(instead of 2-dimensional matrix)
each matrix element represents alignment score
for k prefixes (instead of 2 prefixes)

17
Notation for DP

Assume we have k sequences x1 , , xk


i1 denotes the length of the prefix for sequence 1
i2 denotes the length of the prefix for sequence 2

ik denotes the length of the prefix for sequence k
xkik denotes the character at ik position of sequence xk
F: k-dimensional matrix where

F (i1 , i2 , , ik )
denotes the score of the best alignment of the i1, i2.. ik prefixes of the sequences

18
Recall the DP for the pairwise alignment

8
>
<F (i1 1, i2 1) + S(x 1
i1 , xi2 )
2

F (i1 , i2 ) = max F (i1 , i2 1) + S( , x2i2 )


>
:
F (i1 1, i2 ) + S(x1i1 , )

19
DP for Multiple sequence alignment
F(i1 1,!, ik 1) + S(x1i ,!, x ik )
1 k

F(i1, i2 1,!, ik 1) + S(, x i2 ,!, x ikk )


2

F(i1 1, i2 ,!, ik 1) + S(x1i , ,!, x ik )
F(i1,!, ik ) = max 1 k

"
2
F(i
1 2 , i 1,!, ik ) + S(, x i2 ,!, )
"

max score of alignment for the k prefixes

How many items do we need to maximize over? 2k -1

20
DP algorithm is too expensive

For k sequences each of length n


Space complexity: O(nk)
Time complexity: O(nk2k)

21
Heuristic algorithms to Multiple sequence
alignment
Progressive alignment
Build the alignment of larger number of
sequences from partial alignments of subsets of
sequences
Iterative alignment
Possibly remove some of the aligned sequences
and re-align to see if score improves

22
Progressive alignment

Key heuristic: Align the most similar sequences


first
Rely on pre-computed pairwise similarity/distance
Pairwise sequence alignments
Algorithms differ in the extent to which the pairwise
similarity influences the final alignments
Two strategies
Star alignment
Tree alignments
Simple (quick and dirty) tree
At each time combine two, possibly singleton, sets of sequences
23
Star Alignment Approach
Given: k sequences to be aligned
x1 , , xk
c
x
pick one sequence as the center
i
= c
for each determine an optimal alignment
x 6 x
i c
between and
x x
Aggregate pairwise alignments
Shift entire columns when incorporating gaps
return: multiple alignment resulting from aggregate

24
Picking the center in star alignments

Two possible approaches:


1. try each sequence as the center, return the best multiple
alignment
2. compute all pairwise alignments and select the string that
xc
maximizes:X
sim(x , x )
i c

i6=c

25
Aligning to an existing partial alignment

Need to treat each partial alignment as a single entity


Partial alignment should not be changed other than gap insertions
Shift entire columns when incorporating gaps

TGTTAAC -TGTTAAC
-TGT AAC -TGT-AAC
-TGT -AC -TGT--AC
ATGT --C ATGT---C
ATGT GGC ATGT-GGC

26
Star Alignment Example

Given:
ATGGCCATT
ATTGCCATT ATTGCCATT
ATGGCCATT
ATCCAATTTT ATC-CAATTTT
ATCTTCTT ATTGCCATT--
ATTGCCGATT
ATTGCCATT

ATTGCCGATT ATCTTC-TT
ATTGCC-ATT ATTGCCATT
27
Star Alignment Example

Aggregate pairwise alignments


present pair Current multiple alignment

ATGGCCATT ATTGCCATT
1. ATTGCCATT ATGGCCATT

ATC-CAATTTT ATTGCCATT--
2. ATTGCCATT-- ATGGCCATT--
ATC-CAATTTT

28
Star Alignment Example

present pair Current multiple alignment

ATCTTC-TT ATTGCCATT--
3. ATTGCCATT ATGGCCATT--
ATC-CAATTTT
ATCTTC-TT--

ATTGCCGATT ATTGCC- A TT--


4. ATTGCC-ATT ATGGCC- A TT--
ATC-CA- A TTTT
ATCTTC- - TT--
shift entire columns ATTGCCG A TT--
when incorporating a gap 29
Comments about Star alignment

Conceptually simple
Dependent only upon pairwise alignments
Does not consider any position-specific
information of the partial multiple sequence
alignment while aligning a new sequence to it

30
Tree-based progressive alignments

Align sequences according to a guide tree


leaves represent sequences
internal nodes represent alignments
Determine alignments from bottom of tree
upward
return multiple alignment represented at the
root of the tree
One common variant: the CLUSTALW algorithm
[Thompson et al. 1994]
31
Tree-based progressive alignment
Depending on the internal node in the tree, we may have
to align a
a sequence with a sequence
a sequence with a partial alignment
a partial alignment with a partial alignment
In all cases we have the option of inserting gaps or
substitutions
For aligning alignments, we will use sum of pairs
scoring
To choose between options we will use an idea similar
to the pairwise sequence alignment case
32
Tree alignment example

Starting sequences
x1 TGTTAAC
x2 TGTAAC
x3 TGTAC
x4 ATGTC
x5 ATGTGGC
Create a guide tree
Using pairwise distances (we will cover this in subsequent lectures)
Approach similar to but simpler than phylogenetic trees

x1 x2 x3 x4 x5
33
Tree Alignment Example

TGTAAC
TGT-AC

TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC


34
Tree Alignment Example

TGTAAC ATGT--C
TGT-AC ATGTGGC

TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC


35
Tree Alignment Example

Aligning two alignments

-TGTAAC
-TGT-AC
ATGT--C
ATGTGGC

TGTAAC ATGT--C
TGT-AC ATGTGGC

TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC


36
Tree Alignment Example
-TGTTAAC Aligning sequence to alignment
-TGT-AAC
-TGT--AC
ATGT---C
ATGT-GGC
-TGTAAC
-TGT-AC
ATGT--C
ATGTGGC

TGTAAC ATGT--C
TGT-AC ATGTGGC

TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC


37
Scoring an alignment of partial alignments

Recall the sum of pairs score for a column i


X
S(mi ) = s(mki , mli )
k<l

Let 1 to n represent sequences from the first alignment


Let n+1 to N represent sequences from the second alignment,
N denotes total number of sequences
Alignment at column i can be written as
X
S(mi ) = s(mki , mli ) Within first alignment
k<ln
X
+ s(mki , mli ) Within second alignment
n<k<lN
X
s(mki , mli ) Between two alignments38
kn,n<lN
Computing the sum of scores for two
alignments
Assume we have two alignments corresponding to
intermediate nodes of the guide tree
Alignment A1 Alignment A2
AAAC AGC
-GAC ACC
Alignment of two alignments = pairwise alignment of
sequences of columns
Filling entry (i, j) of the DP matrix we maximize over
aligning column i in A1 to a column j in A2
aligning column i in A1 to gaps in A2
aligning column j in A2 to gaps in A1

39
Comments about tree-based progressive
alignment
Exploits partial alignment information
But, greedy
The tree might not be correct, that is, reflect an
incorrect ordering of how sequences should be
stacked up in the alignment
Final results prone to errors in alignment
Some positions might be misaligned (that is have a
lower score than if a different ordering is used).

40
Ordering matters

Consider aligning GG, DGG and DGD


1 2
D G D D G D
- G G G G -

Are as good. But when we include DGG


1 2
D G D D G D
- G G G G -
D G G D G G

1 is better than 2, assuming a match score of 2, mismatch score =1, gap penalty=-2

41
Iterative refinement methods

The order of selection of sequences can influence the


alignment
ClustalW overcomes some of these issues but has many heuristics and
parameters
How to avoid committing to a non-optimal pairwise decision?
Revisit alignments
This is the focus of iterative alignments
Basic iterative refinement algorithm
Remove a sequence from the current multiple alignment
Realign the removed sequence back to the multiple alignment
Repeat until removal and realignment of any sequence does not
improve the alignment score

42
Additional notes about the ClustalW algorithm

Tailored to handle very divergent sequences: 25-30%


similarity
Dynamically varies the gap penalties in a position and residue
specific manner
Weight different sequences differently
Closely related sequences need to be down-weighted
Divergent sequences are up-weighted
Dynamically switch between substitution matrices depending
upon the average similarity between sequences being aligned

43
Applying ClustalW to SH3 domain proteins
Nucleic Acids Research, 1994, Vol. 22, No. 22 4679

ASV_vSRC ttfvalydyesrte----t41sfk---itgjr1qivnnt ---------igdwwlahslttg ---------qtgyipsnyvapsd


RSV_vSRC ttfvalydyeswte-----tdl sfk - --kgir lqivnnt ----------g4nl1ahslttg ---------qtgyipsnyvapa4$
H_csRC1 ttfvalyt'esrte-----td'lsfk --- kgerlqivnnt ---------e*gdwwlahslstg ---------qtgyipsnyvapsd
Xl1cSRC1 ttfvalyzdyesrte-----tdlofk - -- lrqivnnt---------e6g4ww1arslssg ---------qtgyipsnyvaps~
H nSRC ttfvaly4yesrte ---- tklsfk---I..kg#rlqivnntrkvd ------vrqgdww1ahslstg ---------qtgyipsnyvaps4
Xl cSRC2 t t fva lydyeeorte-----td1sfr---kger1qivnnt ---------sgdwlarslssg ---------qtgyipenyvapst
ASV_vYES tvfvaly4tyeartt----dGlsfk---kglrfqiinnt -gwasagkgisyaa
C cYES tvfvalydyeartt----d4lsfk---.cg4Wrfqiinnt----------*g4wwearsiatg ---------ktgyipsnyvapa4
HcYESl tifvalyd(yeartt-----edl.sfk---kg fiin-gdwwearsiatg---------kngyipsnyvapa
Xl cYES tVfValyttyeartt-----e41sfr---kg~rfgiinnt ----------eogdwwearsiatg ---------ktgyipsnyvapad
Xl cFYN tlfvaly4yearte---- d:dlsf q--- ~g~,kfqilnssa--e------gdwwearslttg ---------gtgyipsnyvapv
H cFYN tlfvalyIayearte-----ddlafh ---kgekfqilnss---------eog4wwearslttg ---------etgyipsnyvapv
M_cFGR tifvalydyeartg ---- ddlt ft---tg4,kfhilnnt--------- 1ty4wwearslssg ---------hrgyvpsnyvapv
H_cFGR tlfialyd4yearte---- d4ltft --- q"kfhilnnt ---------og4nwearslseg---------ktgcipsnyvap4
Ha_STK tifv&ly4yearie ---- e4lsfk --- ger1qiinta ----------dgdwwyarslitn ---------segyipatyvapek
*RHOK iivvaly4tyeaihh-----e4lsfq---.kg4qxuvvlees----------gewwkarslatr----------kegyipsnyvarv4q
H HCK tivvalydyeaihr---- e4lsfq----kgdqinvvleea----------gewwkarslatk ---------kegyipsnyvarvn
*HLYN divvalypydgihp-----ddlsfk --- kg~kkvleeh----------gewwkakslltk ---------kegfipsnyvakln
HBLK rfvvalfalyaa'vnd ----- 4lqvl --- kgklqvlrst ---------gawwlarslvtg ---------regyvpsnfvapve
H_LSKT nlvialhsyepshd----gqd1gfe---kguMq1ri1eqs----------gewwkaqslttg ---------qegfipfnfvakan
HILCK nlvialhsyepshd-----gdlgf e - - -tgqljerilIeqs----------gewwkaqsttg----------qegfipfnfvakan
FSV vABL nlfvalyafvasgd -----tlisit--kg:~klrvlgynh---------ngewceaqtkng ----------qgvvpsnyitpvn
Din ABLI qlfvalydfqagge---- ng1s1k---kg01qvrilsynk---------sgewceahssgn ----------vgwvpsnyvtpln
C cTKL klvvalydyepthd-----gd1g1k---qgM'k1rv1ees----------gewwraqslttg----------qegliphnfvaxnvn
Ce_sem5/1 mneavael4fqagsp-----delsfk --- rgn__t1kv1nk4d-------- efhwykaeld--g ---------negfipsnyirmnte
ce_sem5/2 kfvqaifdfnpqes ----g:*1afk---tgdvit1in---------kd4pnnwegq1n- -n ---------rrgifpsnyvcpyn
Din_SRCl rvvvs1y4yksr e-----sdlsfmn--- kgdrmnevi4dt ----------sdnwrvvn1ttr ---------gegliplnfvaeer
ASV GAGCRK eyvtralfdfkgn4d g1pk--gilkirlk-ewnem5--rzivyec
C Spca elvialydygeksp---- revtink---.kg4i1t11n --------k------- kv]evn--d ---------rqgfvpaayvkklq
DmnSpca ecvvalydyteksp---- revsmnk--- cgdvltlln ---------snnkdwwkvevn--d--------- rqgfvpaayikkia%
Alignment blocks DinSpcb phvkslfpfgqmm---gtrn11kskt ---------nddwwcvrkdn-g ---------vegfvpanyvreve;
H_PLC rtvkalyaykakrs ---- delfc---rga1ihnvs---------kepggwwkgdygt-r---------iqqyfpsnyvedis
R_PLCII cavkalfdykaqre-----d*ltft---ksaiiqnve-----------kdggwwrgdygg-k ---------kqlwfpsnyveemni
E PLCII cavkalfdykaqre-----deltft --- ksaiiqnve ----------qeggwwrgdygg-k ---------kqlwfpsnyveeumv

correspond to beta H-PLCI cavkalfdykaqre----d*ltfi --- ksaiignve ---------kqeggwwrgdygg-k ---------kqlwfpsnyveeinv


H_RASA/GAp rrvrailpytkvpd----d Ia- - -kg4mf ivhn ---------ele:dgwmwvtnlrtd---------eqgliveidlveevg
Ac M4ILE pqvkalydlydaqtg ---- diltfk- - -e g4t iivhq---------kdPagwwege1n--g ---------krgwvpanyvqdi

strand secondary
Ac-MILC eqaralydfaaenp----de1tfn---egavvtvin ---------ksnpd1wwegeln--g ---------grgvfpasyvelip
H_HS1 isavlydyqgegs-----d:elafd --- pdavitdie ----------v4egvwvrgrch- -g ---------hfglfpanyvklle
H VAV gtakarydfcar4r ----ees01sk---egdjiiki1nkk---------gqqgwwrgeiyg ----------rvgwfpanyveedy
Din_SRC2 klvvalyi1gkaie;g-----gd1svge--kn_aeyevidds ---------gehwwkvkdialg----------nvgyipsnyvqaea

structures R-CSK teciakynfhgtae-----qdlpfc --- kg4lvltiv-avtk---------dpnwykaknikvg----------regiipanyvgkre


H-NCK/l vvvnakfayvaqqe ------1dik- - -Icner1w1lds ----------kswwrvrns-nmn ---------ktgfvpsnyverkn
H_NCK/2 inpayvkfnymnaere-----dels ij- - -ozgtkgaizmIka---------dgwwrgsyn--g ---------qvgwfpsnyvteeg
H NCK/3 hvvqalypfsssnd---- ee1nfe---k-g_4vmndviekp --------enalpewwkcrkin-g ----------vglvpknyvtvznq
H_NCF1/l qtyraianyektsg----sBeMals --- tg4vvevveks ----------sgwwfcqznk--a ---------krgwipasf1ep,l4
H-NCF1/2 epyvaikaytaveg-----devsll --- egeavevihk -l--------1dgwwvirkd--d ---------vtgyfpenmylqksg
H_NCF2/1 eahrvlfgfvpetk-----eelqvnu--- pgnivfvlkkg ---------ndnwatvmfn--g ---------qkglvpcnylepve
H_NCF2/2 sqvealfsyeatgp-----ed1efq---eg4ii1v1skvn ---------eewlegeckg----------kvgifpkvfvedca
Y-ABPI pwataey4lydaaed-----ne1tfv---en4eqkiinie--------- f v4jddlgelkd-g ---------skglfpsniyvslgn
Y_EEMl/l kvikaky7syqaqts----ke1sfmn---egeWffyvsgd ---------e~kdwykasnp'stg ---------kegvvpktyfevft4
YBEEMl/2 lyaivlydfkaeka-----deltty --- vg466lficahh ---------ncewfiakpigrlg---------gpglvpvgfvsiid
C PBO/85 itaialy4yqaagd-----deisfd---pd4iitnie ---------mi4dgwwrgvck--g ---------ryglf panyvelrg-
YCDC25 g'ivvaay4fnypikk-dss-sq1lsvq---ggtiyilnkn ---------esagwwdglvidasngkv -------nrgwfpqnfgrplr
Y_SCD25 dvvectyqyftksr-----nklslr---vgdliyvltkg ---------sngwwdgv1irhsannn=ns1ail----drgwfppsftrsil
y-Fus1 ktytviqdyeprlt-----diiiris ---l1g*kvkilath ---------tgcvknqsivvakrlegvpdlea
OC_CACb favrtnvgynpspgd~vpvmilg,aJfr---pkdflhikeky---------tndwwiglvkctkegibv-----------nedrgfipspgvcldl
DinDL lyva1lf4ydpnrdd-glp-sr1pf--g41i1hvtnas--------- cdd-ewwqarrvlgdneieqgvsrwr
H P55 mnfmraqfd$ydpkkdn-lip-c a 1k-f gdiiqiinkI ---------dsnwwqgrvegsske--------saglipspelqewr
E P85A fgyralypfrrerp-----edlell---pg4vlvvsraalqalgvaigniirc-pqevgwmpglnertr ---------qrgdfpgtyveflg
E P85B ycqyralydykkere-----ediTlh --- lgdiltvnkgslvalgfsdgq*aJ&-peiiigwlngynettg ---------ergdfpgtyveyig
H_P8BE ycyralydykkere-----edidlh --- lg4iltvnkgslvalgfsdgp4a&.-pe4igwlngynettg ---------ergdfpgtyveyig
Sp_STEE fqttaisdyenssn ------ kt--- ag4tiiviev1----- ""-4dgwcdgics--e ---------krgwfptscidssk
H Atk kkvvalydymupina----nalqlr --- kgeyfilees ---------nl1pwwrardkn-g-------- -q-egyipsnyvteae

Proteins share <12% sequence identity


Figure 4. CLUSTAL W alignment of a set of SH3 domains taken from Musacchio et al. (23). Secondary structure assignments for the solved Spectrin (24) and 44
Thomson et al, 1994 Fyn (39) domains are according to DSSP (40). The alignment was generated in two steps using default parameters. After full multiple alignment, the aligned sequences
were realigned. Segments which were correctly aligned in the second pass are underlined. The single misaligned segment in H-P55 and the misaligned residue
in H_NCKI2 are boxed. The sequences are coloured to illustrate significant features. All G (orange) and P (yellow) are coloured. Other residues matching a frequent
occurrence of a property in a column are coloured: hydrophobic = blue; hydrophobic tendency = light blue; basic = red; acidic = purple; hydrophilic = green;
unconserved = white. The alignment figure was prepared with the GDE sequence editor (S.Smith, Harvard University) and COLORMASK (J.Thompson, EMBL).
Summary
Multiple sequence alignment is the problem of finding
corresponding positions among more than two sequences
Scoring function:
Entropy based
Sum of pairs
Algorithms
Progressive
Star
Dependent upon a center
Keep adding all pairs of aligned sequences with the current alignment
Tree
Create an approximate guide tree
Use tree to align the sequences
Iterative
Dont commit to the fixed ordering, revisit the alignment until score does not
change

45

You might also like