Bioinf 2017 02 13 BLAST

BSE 322A
Bioinformatics and Computational Biology
Instructor: Nitin Gupta

Feb 13, 2017
Slide and Image credits: Wikipedia, Prof. Pavel Pevzner, Prof. R.

Sankararamakrishnan
Some useful applications of alignments
Given a newly sequenced organism,
Which subregions align with other organisms?
Potential genes
Other biological characteristics
Assume we try Smith-Waterman:
Our newly
sequenced
mammal
3109 The entire genomic database
1010 - 1012
Time required is proportional to the product of the two lengths O(mn). Slow!
word matching
Intuition: Approximately-matching strings share some
perfectly-matching substrings.
Perfect matching for substrings (also called words) can be

done using indexing or dictionary-matching which is very
fast.
Instead of searching for approximately matching strings

(difficult), search for perfectly matching substrings (easy).
Once you get a few candidate positions with exact substring

matches, then try to extend the matches to get longer
alignments
Advantage of substring matching:
Fast (because we only focus on regions that show a substring match)
Disadvantage:
Somewhat reduced sensitivity (might miss some genuine matches in
case there is no exact match of substring)
Substring matching is used by BLAST: Basic Local Alignment Search Tool
The paper describing BLAST is one of the most cited papers in the world!
Indexing-based local alignment
(BLAST- Basic Local Alignment Search Tool)
query
Main idea:
1. Construct a dictionary of all the words in the

query
2. Initiate a local alignment for each word match

between query and DB
DB
Running Time: O(MN)

However, in practice, BLAST is orders of
magnitude faster than Smith-Waterman
Indexing-based local alignment
Dictionary:
All words of length k are searched in the query
database
(Can also include approximate keywords:
e.g. if GKD is one keyword, we can include
GRD and GKE also in the list of keywords)
Alignment:
Ungapped extensions until score
below statistical threshold
Note that this step is done after perfect
matching of words, i.e. we have already scan
filtered the database quite a bit and
therefore we save time compared to DB
Smith-Waterman
Output:
All local alignments with score
> statistical threshold
query
Indexing-based local alignmentExtensions
A C G A A G T A A G G T C C A G T
Example:
C C C T T C C T G G A T T G C G A
k=4
The matching word GGTC

initiates an alignment
Extension to the left and

right with no gaps until
alignment falls below a
threshold
Output:
GTAAGGTCC
GTTAGGTCC
Indexing-based local alignmentExtensions
A C G A A G T A A G G T C C A G T
Gapped extensions until
threshold
C T G A T C C T G G A T T G C G A
Extensions with gaps
until score falls below a
threshold
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
BLAST algorithm
Keyword search of all words of length w from
the query of length n in database of length m
(including approximate keywords with match
scores to original keywords above a threshold)
Word length w = 11 for DNA queries, w =3 for proteins
Local alignment extension for each found

keyword
Extend result until longest match above
threshold is achieved
Sensitivity-Speed Tradeoff
long words short words
X% (k = 15) (k = 7)
Sensitivity
Speed
A. Sensitivity
decreases
with k.
B. Speed
increases
with k
Kent WJ, Genome Research 2002

How to interpret a BLAST search: Expect (E) value
For each alignment with score S, an Expect value (E) is provided by BLAST.
The expect value E is the number of alignments with scores greater than or equal
to score S that are expected to occur by chance in a database search.
The key equation describing an E value is: E = Kmn e-S
This equation is derived from a description of the Gumbel extreme value

distribution
m, n = the length of two sequences
, K = Karlin Altschul statistics, which depend on the database and the

substitution matrix used
Some properties of the equation E = Kmn e-S
The value of E decreases exponentially with increasing S

(higher S values correspond to better alignments). Very
high scores correspond to very low E values.
For E=1, one match with a similar score is expected to

occur by chance. For a larger or smaller
database, you would expect E to vary accordingly.
We normally want E-values to be much smaller than 1 (for

example, less than 0.001 or even 10-10)
E-values and Bit scores
Raw alignment scores and E-values depend on , K (which in
turn on the database size and scoring scheme) and therefore
cannot be directly compared for alignments obtained from
different databases
Bit scores are comparable between different database

searches because they are normalized to account for the use
of different scoring matrices and different database sizes
NR database, BLOSUM62
NR is a large database containing all known sequences.

PDB database, BLOSUM62
(the same alignment gets a better E-value, compared to last slide, when using a smaller database (PDB))
Practical problem:
How will the E-value on the slide change if PAM250 matrix is

used instead of BLOSUM62?
Online demonstration of BLAST
http://www.ncbi.nlm.nih.gov/BLAST/

Bioinf 2017 02 13 BLAST

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinf 2017 02 13 BLAST

Uploaded by

Copyright:

Available Formats

BSE 322A

Bioinformatics and Computational Biology

Instructor: Nitin Gupta

Slide and Image credits: Wikipedia, Prof. Pavel Pevzner, Prof. R.

Assume we try Smith-Waterman:

3109 The entire genomic database

Perfect matching for substrings (also called words) can be

Instead of searching for approximately matching strings

Once you get a few candidate positions with exact substring

Substring matching is used by BLAST: Basic Local Alignment Search Tool

1. Construct a dictionary of all the words in the

2. Initiate a local alignment for each word match

Running Time: O(MN)

The matching word GGTC

Extension to the left and

Local alignment extension for each found

Kent WJ, Genome Research 2002

The key equation describing an E value is: E = Kmn e-S

This equation is derived from a description of the Gumbel extreme value

m, n = the length of two sequences

, K = Karlin Altschul statistics, which depend on the database and the

The value of E decreases exponentially with increasing S

For E=1, one match with a similar score is expected to

We normally want E-values to be much smaller than 1 (for

Bit scores are comparable between different database

NR is a large database containing all known sequences.

How will the E-value on the slide change if PAM250 matrix is

You might also like