Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.
String Matching
We formalize the string-matching problem as follows. Given a text array, T[1 . . n], of n character and a pattern array,
P[1 . . m], of m characters. The problem is to find an integer s, called valid shift where 0 s < n-m and T[s+1 . . .
s+m] = P[1 . . m]. In other words, to find whether P in T i.e., whether P is a substring of T.
1) Naïve String Matching

The naïve approach simple test all the possible placement of Pattern P[1 . . m] relative to text T[1 . . n]. Specifically,
we try shift s = 0, 1, . . . , n-m, successively and for each shift, s. Compare T[s+1 . . s+m] to P[1 . . m]
NAÏVE_STRING_MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. for s ← 0 to n-m do
4. if P[1 . . m] = T[s+1 . . s+m]
5. then return valid shift s
Complexity: Worst-case= O((n-m+1)m) if m=n/2 then O(n2)
Q. Write an algorithm for naïve string matcher? What is its worst case complexity? Show the
comparisons the naïve string matcher makes for the pattern P=0001 in the text
T=000010001010001
@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 1

2) Rabin-Karp Algorithm
1. 2.
Rabin-
Rabin-Karp Algorithm …
Key idea:
 The pattern P[1..m] as a key, transform (hash) How to compute p?
it into an equivalent integer p
p = 2m-1 P[0] + 2m-2 P[1] + … + 2 P[m-
P[m-2] + P[m-
P[m-
 Similarly, we transform substrings in the text 1]
string T[] into integers
For s=0,1,…,n
s=0,1,…,n-m, transform T[s+1..s+m] to an
equivalent integer ts Using horner’s rule
 The pattern occurs at position s if and only if
p=t
p=ts
If we compute p and ts quickly, then the
pattern matching problem is reduced to
This takes O(m) time, assuming each arithmetic operation
comparing p with n-
n-m+1 integers can be done in O(1) time.
3. 4.
Upper limits
How it works
Problem
 For long patterns, or for large alphabets, the number
Hash pattern P into a numeric value representing a given string may be too large to be practical
Solution
 Let a string be represented by the sum of  Use MOD operation
these digits  Let q be a prime number so that 2q can be stored in one
Horner’
Horner’s rule (§
(§ 30.1) computer word.
Example
 Example
 BAN = 1 + 0 + 13 = 14
{ A, B, C, ..., Z } → { 0, 1, 2, ..., } 14 mod q = 1
BAN → 1 + 0 + 13 = 14 14 mod 13 = 1
BAN → 1
CARD → 2 + 0 + 17 + 3 = 22  CARD = 2 + 0 + 17 + 3 = 22
22 mod 13 = 9
CARD → 9

How it Works • if the hash values match, the strings might not match
Once we use the modulo arithmetic, when p=t p=ts and in those cases we have the spurious hits .
for some s, we can no longer be sure that P[1 ..
M] is equal to T[s+1 .. S+ m ]
Therefore, after the equality test p = ts, we
should compare P[1..m] with T[s+1..s+m]
character by character to ensure that we really
have a match.
So the worst-
worst-case running time becomes O(nm),
O(nm),
but it avoids a lot of unnecessary string
matchings in practice.
Algorithm : RabinKarp(T[1.. n], P[1..m])
1: hsub = hash(P[1::m]) i.e p

2: hs = hash(T[1::m]) i.e. ts
3: for s = 0 to n - m do
4: if hs = hsub then
5: if T[s+1.. S+m] = P then
6: print “Pattern occurs with shift” i
7: hs = hash(T[i + 1..i + m])
Q. Write a rabin-karp algo for string matching. Given working modulo q=11.how may spurious hits does the rabin
karp matcher encountered in the Text T=3151592653589793 when looking for pattern P=26.
Ans : Given q=11 , T=3151592653589793 and P=26.

p= P mod q  p= 26 mod 11 = 4 The find ts for the text T as ts= 31 mod 11 , ts+1= 15 mod 11
3 1 5 1 5 9 2 6 5 3 5 8 9 7 9 3
31 mod 11= 9 match
9 3 8 4 4 4 4 10 9 2 3 1 9 2
Spurious
Spurious hits=3 and match =1

3) The KMP Algorithm
The Knuth-
Knuth-Morris-
Morris-Pratt (KMP) algorithm If a mismatch occurs between the text and
looks for the pattern in the text in a left-
left-to-
to- pattern P at P[ j ], what is the most we can
right order (like the brute force algorithm). shift the pattern to avoid wasteful
comparisons?
But it shifts the pattern more intelligently
than the brute force algorithm. Answer:
Answer: the largest prefix of P[0 .. j-
j-1] that
is a suffix of P[1 .. j-
j-1]
continued
Example
i
T:
P: j=5
jnew = 2
The Prefix Function 
The KMP algorithm preprocess the pattern P by computing a prefix function  that indicates the largest possible shift
s using previously performed comparisons. Specifically, the prefix function (q) is defined as the length of the
longest prefix of P .
KNUTH-MORRIS-PRATT Prefix Function (P)
Input: Pattern with m characters
1. m=length[P]
2. [1]=0
3. k=0
4. for q=2 to m
5. while k>0 and P[k+1]<> P[q]
6. k=[k]
7. if P[k+1]=P[q] then
8. k=k+1
9. [q]=k
Note that the prefix function for P, which maps q to the length of the longest prefix of P that is a suffix of P[1 . . q],
encodes repeated substrings inside the pattern itself.
As an example, consider the pattern P = a b b a b a . The prefix function, using above algorithm is
q 1 2 3 4 5 6
P[q] a b b a b a
(q) 0 0 0 1 2 1
Example of Pattern matching:
Analysis
The running time of Knuth-Morris-Pratt algorithm is proportional to the time needed to read the characters in text and
pattern. In other words, the worst-case running time of the algorithm is O(m+n) and it requires O(m) extra space. It is
important to note that these quantities are independent of the size of the underlying alphabet.
Q. Explain kunth-morris-pratt string matching algorithm. Write an algorithm to find Prefix

function. Calculate the prefix function  for the patter – a b b a b a [ Ans : shown in above
prefix example.]



Unit-5 String Matching

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-5 String Matching

Uploaded by

Copyright:

Available Formats

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

1) Naïve String Matching

Complexity: Worst-case= O((n-m+1)m) if m=n/2 then O(n2)

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 1

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 2

Algorithm : RabinKarp(T[1.. n], P[1..m])

1: hsub = hash(P[1::m]) i.e p

Ans : Given q=11 , T=3151592653589793 and P=26.

31 mod 11= 9 match

Spurious hits=3 and match =1

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 3

3) The KMP Algorithm

The Prefix Function 

KNUTH-MORRIS-PRATT Prefix Function (P)

Input: Pattern with m characters

Example of Pattern matching:

Q. Explain kunth-morris-pratt string matching algorithm. Write an algorithm to find Prefix

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 5

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 6

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 7

You might also like