Professional Documents
Culture Documents
Our newly
sequenced
mammal
1010 - 1012
Time required is proportional to the product of the two lengths O(mn). Slow!
word matching
Intuition: Approximately-matching strings share some
perfectly-matching substrings.
Disadvantage:
Somewhat reduced sensitivity (might miss some genuine matches in
case there is no exact match of substring)
The paper describing BLAST is one of the most cited papers in the world!
Indexing-based local alignment
(BLAST- Basic Local Alignment Search Tool)
query
Main idea:
Alignment:
Ungapped extensions until score
below statistical threshold
Note that this step is done after perfect
matching of words, i.e. we have already scan
filtered the database quite a bit and
therefore we save time compared to DB
Smith-Waterman
Output:
All local alignments with score
> statistical threshold
query
Indexing-based local alignmentExtensions
A C G A A G T A A G G T C C A G T
Example:
C C C T T C C T G G A T T G C G A
k=4
Output:
GTAAGGTCC
GTTAGGTCC
Indexing-based local alignmentExtensions
A C G A A G T A A G G T C C A G T
Gapped extensions until
threshold
C T G A T C C T G G A T T G C G A
Extensions with gaps
until score falls below a
threshold
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
BLAST algorithm
Keyword search of all words of length w from
the query of length n in database of length m
(including approximate keywords with match
scores to original keywords above a threshold)
Word length w = 11 for DNA queries, w =3 for proteins
Sensitivity
Speed
A. Sensitivity
decreases
with k.
B. Speed
increases
with k
For each alignment with score S, an Expect value (E) is provided by BLAST.
The expect value E is the number of alignments with scores greater than or equal
to score S that are expected to occur by chance in a database search.
(the same alignment gets a better E-value, compared to last slide, when using a smaller database (PDB))
Practical problem: