String Algorithms9 min read

KMP Algorithm

When searching for a word, never go backwards — use what you already know
preprocessing:\(O(m)\)search:\(O(n)\)total:\(O(n+m)\)space:\(O(m)\)

You're reading a book searching for the word "abracadabra". You've matched 8 characters — abracada — then hit a mismatch. The naive approach throws away all that work and restarts one position later. But you already know what those 8 characters were! Why not use that knowledge?

That's Knuth-Morris-Pratt (KMP): never re-read a character you've already seen. The text pointer only moves forward, ever. By preprocessing the pattern into a failure function, KMP finds all matches in \(O(n+m)\) — where n is the text length and m is the pattern length.

Why Naive Matching Is Slow

Naive search tries every position: align pattern at position 0, scan until mismatch, shift by 1, repeat. On a mismatch after matching k characters, all k characters of work are discarded. In the worst case (pattern = "aaa...ab", text = "aaa...a"), this costs \(O(n×m)\) — catastrophic for large inputs.

The Failure Function (lps Array)

KMP preprocesses the pattern into the lps array (Longest Proper Prefix that is also a Suffix). For position i in the pattern, lps[i] tells you: "how long is the longest proper prefix of pattern[0..i] that is also a suffix?"

Example: pattern = "ABCABD"

  • lps[0] = 0 ("A" — no proper prefix)
  • lps[1] = 0 ("AB" — no match)
  • lps[2] = 0 ("ABC" — no match)
  • lps[3] = 1 ("ABCA" — prefix "A" = suffix "A")
  • lps[4] = 2 ("ABCAB" — prefix "AB" = suffix "AB")
  • lps[5] = 0 ("ABCABD" — no match)

The lps array is built in \(O(m)\) using a two-pointer scan on the pattern itself.

Using lps During Search

Walk through the text with pointer i and the pattern with pointer j (how many characters matched so far):

  • If text[i] == pattern[j]: both i and j advance. If j reaches m, report a match and reset j = lps[j-1].
  • If mismatch and j > 0: set j = lps[j-1] without moving i. This is the key: we reuse lps[j-1] matched characters at the start of the pattern.
  • If mismatch and j == 0: just advance i.

The text pointer i never goes backwards — each character is processed at most twice (once matched, once on a failure). Total search: \(O(n)\).

Why It Works

When a mismatch occurs after matching j characters, we know the last j characters of the text. The lps value lps[j-1] tells us how long a prefix of the pattern matches a suffix of those j characters — so we can continue from that overlap without re-reading anything.

Note: The failure function is the entire algorithm. Once you have it, the search is almost trivially simple. All the cleverness lives in lps: it encodes exactly how much you can reuse after a mismatch.

KMP — Build lps, Then Search

def build_lps(pattern):
m = len(pattern)
lps = [0] * m
k = 0 # length of current matching prefix
for i in range(1, m):
while k > 0 and pattern[k] != pattern[i]:
k = lps[k-1] # fall back
if pattern[k] == pattern[i]:
k += 1
lps[i] = k
return lps
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
if m == 0: return []
lps = build_lps(pattern)
matches = []
j = 0 # chars matched in pattern
for i in range(n):
while j > 0 and text[i] != pattern[j]:
j = lps[j-1]
if text[i] == pattern[j]:
j += 1
if j == m:
matches.append(i - m + 1) # match starts here
j = lps[j-1]
return matches
text = "AABAACAADAABAABA"
pattern = "AABA"
print("lps:", build_lps(pattern))
print("Matches at:", kmp_search(text, pattern))
Output
lps: [0, 1, 0, 1]
Matches at: [0, 9, 12]

Complexity

🔨 Build lps (failure function)Two-pointer scan of the pattern; the pointer k can only increase n times total, so total steps ≤ 2m
Linear in pattern length\(O(m)\)
🔍 KMP search phaseText pointer i only advances; j's total decrease is bounded by its total increase (at most n)
Linear in text length\(O(n)\)
⚡ Total (build + search)Far better than naive \(O(n·m)\); especially powerful on large texts with repetitive patterns
Linear in text + pattern\(O(n+m)\)
💾 SpaceOnly the lps array of size m; no extra space proportional to the text size
Linear in pattern only\(O(m)\)

Quick check

What does lps[i] store for pattern position i?

Continue reading

Rabin-Karp
Find a word in a book by its fingerprint — check character-by-character only when fingerprints match
Z-Algorithm
Z-array: match length at every position — linear pattern matching
Aho-Corasick
Find all patterns at once — one pass, \(O(n + matches)\)