You are on page 1of 17

BSE 322A

Bioinformatics and Computational Biology

Instructor: Nitin Gupta


Feb 13, 2017

Slide and Image credits: Wikipedia, Prof. Pavel Pevzner, Prof. R.


Sankararamakrishnan
Some useful applications of alignments
Given a newly sequenced organism,
Which subregions align with other organisms?
Potential genes
Other biological characteristics

Assume we try Smith-Waterman:

Our newly
sequenced
mammal

3109 The entire genomic database

1010 - 1012
Time required is proportional to the product of the two lengths O(mn). Slow!
word matching
Intuition: Approximately-matching strings share some
perfectly-matching substrings.

Perfect matching for substrings (also called words) can be


done using indexing or dictionary-matching which is very
fast.

Instead of searching for approximately matching strings


(difficult), search for perfectly matching substrings (easy).

Once you get a few candidate positions with exact substring


matches, then try to extend the matches to get longer
alignments
Advantage of substring matching:
Fast (because we only focus on regions that show a substring match)

Disadvantage:
Somewhat reduced sensitivity (might miss some genuine matches in
case there is no exact match of substring)

Substring matching is used by BLAST: Basic Local Alignment Search Tool

The paper describing BLAST is one of the most cited papers in the world!
Indexing-based local alignment
(BLAST- Basic Local Alignment Search Tool)

query
Main idea:

1. Construct a dictionary of all the words in the


query

2. Initiate a local alignment for each word match


between query and DB
DB

Running Time: O(MN)


However, in practice, BLAST is orders of
magnitude faster than Smith-Waterman
Indexing-based local alignment
Dictionary:
All words of length k are searched in the query
database
(Can also include approximate keywords:
e.g. if GKD is one keyword, we can include
GRD and GKE also in the list of keywords)

Alignment:
Ungapped extensions until score
below statistical threshold
Note that this step is done after perfect
matching of words, i.e. we have already scan
filtered the database quite a bit and
therefore we save time compared to DB
Smith-Waterman

Output:
All local alignments with score
> statistical threshold
query
Indexing-based local alignmentExtensions
A C G A A G T A A G G T C C A G T
Example:

C C C T T C C T G G A T T G C G A
k=4

The matching word GGTC


initiates an alignment

Extension to the left and


right with no gaps until
alignment falls below a
threshold

Output:
GTAAGGTCC
GTTAGGTCC
Indexing-based local alignmentExtensions
A C G A A G T A A G G T C C A G T
Gapped extensions until
threshold

C T G A T C C T G G A T T G C G A
Extensions with gaps
until score falls below a
threshold

Output:

GTAAGGTCCAGT
GTTAGGTC-AGT
BLAST algorithm
Keyword search of all words of length w from
the query of length n in database of length m
(including approximate keywords with match
scores to original keywords above a threshold)
Word length w = 11 for DNA queries, w =3 for proteins

Local alignment extension for each found


keyword
Extend result until longest match above
threshold is achieved
Sensitivity-Speed Tradeoff
long words short words
X% (k = 15) (k = 7)

Sensitivity
Speed

A. Sensitivity
decreases
with k.

B. Speed
increases
with k

Kent WJ, Genome Research 2002


How to interpret a BLAST search: Expect (E) value

For each alignment with score S, an Expect value (E) is provided by BLAST.

The expect value E is the number of alignments with scores greater than or equal
to score S that are expected to occur by chance in a database search.

The key equation describing an E value is: E = Kmn e-S

This equation is derived from a description of the Gumbel extreme value


distribution

m, n = the length of two sequences

, K = Karlin Altschul statistics, which depend on the database and the


substitution matrix used
Some properties of the equation E = Kmn e-S

The value of E decreases exponentially with increasing S


(higher S values correspond to better alignments). Very
high scores correspond to very low E values.

For E=1, one match with a similar score is expected to


occur by chance. For a larger or smaller
database, you would expect E to vary accordingly.

We normally want E-values to be much smaller than 1 (for


example, less than 0.001 or even 10-10)
E-values and Bit scores
Raw alignment scores and E-values depend on , K (which in
turn on the database size and scoring scheme) and therefore
cannot be directly compared for alignments obtained from
different databases

Bit scores are comparable between different database


searches because they are normalized to account for the use
of different scoring matrices and different database sizes
NR database, BLOSUM62

NR is a large database containing all known sequences.


PDB database, BLOSUM62

(the same alignment gets a better E-value, compared to last slide, when using a smaller database (PDB))
Practical problem:

How will the E-value on the slide change if PAM250 matrix is


used instead of BLOSUM62?
Online demonstration of BLAST
http://www.ncbi.nlm.nih.gov/BLAST/

You might also like