String Algorithms8 min read

Boyer-Moore

Read the pattern backwards, skip huge chunks — sublinear in practice
best case:Blazing · \(O(n/m)\)worst case:Slow · \(O(nm)\)preprocessing:\(O(m + σ)\)

Imagine you're proofreading a document for the word "CHAMPION". Instead of reading every letter left to right, you glance at the last letter of each 8-letter window. If you see a 'Q' there, you know immediately that none of the 8 letters can be 'N' through 'A' in the right positions — skip the whole window! That's the core insight behind Boyer-Moore.

By scanning the pattern right-to-left, a single mismatch at the end of the pattern can skip the entire window. On real English text, Boyer-Moore often reads only 1 in every m characters — sublinear performance.

The Bad Character Rule

When a mismatch occurs at position j in the pattern with text character c, the bad character rule asks: does c appear anywhere in the pattern to the left of position j?

  • If no: shift the pattern completely past c — a skip of up to m positions!
  • If yes: slide the pattern right until the rightmost occurrence of c in the pattern aligns with the text's c

Preprocessing: build a table bc[c] = last position of character c in the pattern (or -1 if absent). This takes \(O(m + σ)\) where σ is the alphabet size.

The Good Suffix Rule

When some suffix of the pattern matched before the mismatch, the good suffix rule looks for the rightmost copy of that matched suffix elsewhere in the pattern — preceded by a different character (so we won't repeat the same mismatch). If no such copy exists, we look for the longest prefix of the pattern that is also a suffix of the matched portion.

Boyer-Moore always takes the maximum shift from the two rules. This is why it can be so fast — either rule alone already skips a lot, and taking the max skips even more.

Boyer-Moore-Horspool: The Simpler Cousin

Horspool's simplification: use only the bad character rule, but always apply it to the rightmost character of the current window (not the actual mismatch position). This is slightly less optimal but dramatically simpler to implement — just one precomputed shift table — and nearly as fast in practice for most text.

Note: Boyer-Moore is like a speed reader who glances at the last word of each line first. If the last word is wrong, skip the whole line without reading it. On rich alphabets like English, most mismatches happen on characters that don't appear in the pattern at all — instant full-window skips.

Boyer-Moore-Horspool (Practical Variant)

def horspool(text, pattern):
n, m = len(text), len(pattern)
if m > n:
return []
# Build bad-character shift table
shift = {c: m for c in set(text)}
for i in range(m - 1):
shift[pattern[i]] = m - 1 - i
results = []
i = m - 1 # index of rightmost char in window
while i < n:
j = m - 1
k = i
while j >= 0 and text[k] == pattern[j]:
j -= 1
k -= 1
if j == -1:
results.append(k + 1)
i += shift.get(text[i], m)
return results
print(horspool("HERE IS A SIMPLE EXAMPLE", "EXAMPLE")) # [17]
print(horspool("abcabcabc", "abc")) # [0, 3, 6]
Output
[17]
[0, 3, 6]

When Is Boyer-Moore the Right Choice?

  • Large alphabets (English text, source code): bad character skips are big — most characters in the text simply won't appear in the pattern
  • Long patterns: skipping m positions at a time means fewer windows to check
  • Real-world text search: grep, most text editors, and search engines use Boyer-Moore variants

For small alphabets (like DNA with only A/C/G/T), the bad character table gives smaller skips and KMP may be competitive. For multiple patterns, use Aho-Corasick instead.

Complexity

🚀 Best case (large alphabet)Each mismatch at pattern end can skip m characters — typical on English text
Sublinear — skips most of text\(O(n/m)\)
💀 Worst case (adversarial input)Happens with repeated characters and a pattern like 'aaab' — bad char rule gives tiny shifts
Quadratic — pathological\(O(nm)\)
🔧 Preprocessingσ = alphabet size (256 for ASCII); bad char table \(O(σ)\), good suffix table \(O(m)\)
Linear in pattern + alphabet\(O(m+σ)\)
💾 SpaceHorspool only needs the bad char table \(O(σ)\) — very lightweight
Pattern + alphabet tables\(O(m+σ)\)

Quick check

Boyer-Moore scans the pattern right-to-left. Why does this enable sublinear search?

Continue reading

KMP Algorithm
When searching for a word, never go backwards — use what you already know
Z-Algorithm
Z-array: match length at every position — linear pattern matching
Rabin-Karp
Find a word in a book by its fingerprint — check character-by-character only when fingerprints match