KMP Algorithm
You're reading a book searching for the word "abracadabra". You've matched 8 characters — abracada — then hit a mismatch. The naive approach throws away all that work and restarts one position later. But you already know what those 8 characters were! Why not use that knowledge?
That's Knuth-Morris-Pratt (KMP): never re-read a character you've already seen. The text pointer only moves forward, ever. By preprocessing the pattern into a failure function, KMP finds all matches in \(O(n+m)\) — where n is the text length and m is the pattern length.
Why Naive Matching Is Slow
Naive search tries every position: align pattern at position 0, scan until mismatch, shift by 1, repeat. On a mismatch after matching k characters, all k characters of work are discarded. In the worst case (pattern = "aaa...ab", text = "aaa...a"), this costs \(O(n×m)\) — catastrophic for large inputs.
The Failure Function (lps Array)
KMP preprocesses the pattern into the lps array (Longest Proper Prefix that is also a Suffix). For position i in the pattern, lps[i] tells you: "how long is the longest proper prefix of pattern[0..i] that is also a suffix?"
Example: pattern = "ABCABD"
- lps[0] = 0 ("A" — no proper prefix)
- lps[1] = 0 ("AB" — no match)
- lps[2] = 0 ("ABC" — no match)
- lps[3] = 1 ("ABCA" — prefix "A" = suffix "A")
- lps[4] = 2 ("ABCAB" — prefix "AB" = suffix "AB")
- lps[5] = 0 ("ABCABD" — no match)
The lps array is built in \(O(m)\) using a two-pointer scan on the pattern itself.
Using lps During Search
Walk through the text with pointer i and the pattern with pointer j (how many characters matched so far):
- If
text[i] == pattern[j]: both i and j advance. If j reaches m, report a match and reset j = lps[j-1]. - If mismatch and j > 0: set j = lps[j-1] without moving i. This is the key: we reuse lps[j-1] matched characters at the start of the pattern.
- If mismatch and j == 0: just advance i.
The text pointer i never goes backwards — each character is processed at most twice (once matched, once on a failure). Total search: \(O(n)\).
Why It Works
When a mismatch occurs after matching j characters, we know the last j characters of the text. The lps value lps[j-1] tells us how long a prefix of the pattern matches a suffix of those j characters — so we can continue from that overlap without re-reading anything.