Rabin Karp Algorithm is used to find all occurrences of a pattern in a given text.
For example:
Input:
txt = “abdabcbabc”
pat = “abc”
Output:
3 7
Input:
Txt = “aaaaa”
pat = “aaa”
Output:
0 1 2
Input:
txt = “aabbccdd”
pat = “abc”
Output:
Not Found
The outputs of the above inputs showed the 0th based index of the beginning character of the pattern where it is found in the text. If there is no pattern matching in the text, the output displayed “Not Found”.
There is a naïve approach to this problem, which uses quadratic time in general. This algorithm also uses quadratic time in worst case. It performs much better than the naïve approach in general cases.
The basic idea of Rabin Karp algorithm is based on:
- Sliding the pattern over the text one-by-one.
- Calculating the hash value of the pattern as well as all the windows of the text.
- Matching the hash value of the pattern with the calculated hash value of the windows of the text. For each match, we will then match each character of the window. This will reduce the time complexity of the problem, specially in case of bigger patterns, as we won’t need to check for each window of the text in case of different hash value.
Let’s dig a bit deeper,
Index: 0 1 2 3 4 5 6 7 8 9
text = “a b d a b c b a b c”
pat = “a b c”
We will first compute the Hash value of the pattern.
P (hash value of pattern) = 1 + 2 + 3 = 6
(Note that we have taken the absolute value (a = 1, b = 2, c = 3, d = 4, and so on) of the alphabets for simplicity, instead of their ASCII values.)
T (hash value of current window)
Now, for each iteration, we will compute the hash value for the current window,
i = 0: T = (1 + 2 + 4) = 7
i = 1: T = (2 + 4 + 1) = 7
i = 2: T = (4 + 1 + 2) = 7
i = 3: T = (1 + 2 + 3) = 6 (Match)
i = 4: T = (2 + 3 + 2) = 7
i = 5: T = (3 + 2 + 1) = 6 (Spurious Hit)
i = 6: T = (2 + 1 + 2) = 7
i = 7: T = (1 + 2 + 3) = 6 (Match)
As we can observe, we got 3 matches out of which one is a Spurious Hit, i.e., the hash value of the window matched, but the character match failed, thus we would output, 3 and 7 as our answer.
One of the problems with the above approach is that, we are calculating the hash value of each window in O(N) time complexity. We can reduce this complexity with the help of Rolling Hash.
In Rolling Hash, we compute the Hash Value of the upcoming window with the help of previous window, by adding the value of last character of the next window to the net Hash value, and subtracting the value of first character of the previous window from the net Hash Value.
Rolling Hash Function:
Ti+1 = Ti + txt[i + m] – txt[i]
Now, we are left with the problem of Spurious Hits.
There can be any number of permutations present in the text. The hash value of all those permutations would be same, and we would need to check for the characters in that window using O(N) time complexity, and that would reduce the performance of the program in worst case.
Improved Hash:
h(“abc”) = 1 x r2 + 2 x r1 + 3 x r0 = 38
h(“dab”) = 4 x r2 + 1 x r1 + 2 x r0 = 107
Here, we are taking the value of r = 5.
Improved Rolling Hash Function:
Ti+1 = r x (Ti – txt[i] x rm-1) + txt[i + m]
m = Length of Pattern
Example:
Index: 0 1 2 3 4 5 6 7 8 9
text = “a b d a b c b a b c”
pat = “a b c”
P = 1 x 52 + 2 x 5 + 3 x 1 = 38
T0 = 1 x 52 + 2 x 51 + 4 x 50 = 39
T1 = 5 x (T0 – 1 x 52) + 1 = 71
T2 = 5 x (T1 – 2 x 52) + 2 = 107
T3 = 5 x (T2 – 4 x 52) + 3 = 38
// Rabin-Karp algorithm in C++
#include <string.h>
#include <iostream>
using namespace std;
#define d 10
void rabinKarp(char pattern[], char text[], int q) {
int m = strlen(pattern);
int n = strlen(text);
int i, j;
int p = 0;
int t = 0;
int h = 1;
for (i = 0; i < m - 1; i++)
h = (h * d) % q;
// Calculate hash value for pattern and text
for (i = 0; i < m; i++) {
p = (d * p + pattern[i]) % q;
t = (d * t + text[i]) % q;
}
// Find the match
for (i = 0; i <= n - m; i++) {
if (p == t) {
for (j = 0; j < m; j++) {
if (text[i + j] != pattern[j])
break;
}
if (j == m)
cout << "Pattern is found at position: " << i + 1 << endl;
}
if (i < n - m) {
t = (d * (t - text[i] * h) + text[i + m]) % q;
if (t < 0)
t = (t + q);
}
}
}
int main() {
char text[] = "ABCCDDAEFG";
char pattern[] = "CDD";
int q = 13;
rabinKarp(pattern, text, q);
}
// Rabin-Karp algorithm in Java
public class RabinKarp {
public final static int d = 10;
static void search(String pattern, String txt, int q) {
int m = pattern.length();
int n = txt.length();
int i, j;
int p = 0;
int t = 0;
int h = 1;
for (i = 0; i < m - 1; i++)
h = (h * d) % q;
// Calculate hash value for pattern and text
for (i = 0; i < m; i++) {
p = (d * p + pattern.charAt(i)) % q;
t = (d * t + txt.charAt(i)) % q;
}
// Find the match
for (i = 0; i <= n - m; i++) {
if (p == t) {
for (j = 0; j < m; j++) {
if (txt.charAt(i + j) != pattern.charAt(j))
break;
}
if (j == m)
System.out.println("Pattern is found at position: " + (i + 1));
}
if (i < n - m) {
t = (d * (t - txt.charAt(i) * h) + txt.charAt(i + m)) % q;
if (t < 0)
t = (t + q);
}
}
}
public static void main(String[] args) {
String txt = "ABCCDDAEFG";
String pattern = "CDD";
int q = 13;
search(pattern, txt, q);
}
}
The worst case time complexity of Rabin-Karp Algorithm is O(mxn), and average time complexity is O((n-m+1)xm).
There are many other algorithms such as KMP Algorithm which works on O(N) (linear) time complexity.
Rabin-Karp Algorithm is mainly used for searching various patterns in a single text. The hash value of the patterns is pre-computed, and then compared with each window of the text.
1 Comment
Thanks for such an informative article.