Boyer-Moore
Imagine you're proofreading a document for the word "CHAMPION". Instead of reading every letter left to right, you glance at the last letter of each 8-letter window. If you see a 'Q' there, you know immediately that none of the 8 letters can be 'N' through 'A' in the right positions — skip the whole window! That's the core insight behind Boyer-Moore.
By scanning the pattern right-to-left, a single mismatch at the end of the pattern can skip the entire window. On real English text, Boyer-Moore often reads only 1 in every m characters — sublinear performance.
The Bad Character Rule
When a mismatch occurs at position j in the pattern with text character c, the bad character rule asks: does c appear anywhere in the pattern to the left of position j?
- If no: shift the pattern completely past
c— a skip of up to m positions! - If yes: slide the pattern right until the rightmost occurrence of
cin the pattern aligns with the text'sc
Preprocessing: build a table bc[c] = last position of character c in the pattern (or -1 if absent). This takes \(O(m + σ)\) where σ is the alphabet size.
The Good Suffix Rule
When some suffix of the pattern matched before the mismatch, the good suffix rule looks for the rightmost copy of that matched suffix elsewhere in the pattern — preceded by a different character (so we won't repeat the same mismatch). If no such copy exists, we look for the longest prefix of the pattern that is also a suffix of the matched portion.
Boyer-Moore always takes the maximum shift from the two rules. This is why it can be so fast — either rule alone already skips a lot, and taking the max skips even more.
Boyer-Moore-Horspool: The Simpler Cousin
Horspool's simplification: use only the bad character rule, but always apply it to the rightmost character of the current window (not the actual mismatch position). This is slightly less optimal but dramatically simpler to implement — just one precomputed shift table — and nearly as fast in practice for most text.
Boyer-Moore-Horspool (Practical Variant)
When Is Boyer-Moore the Right Choice?
- Large alphabets (English text, source code): bad character skips are big — most characters in the text simply won't appear in the pattern
- Long patterns: skipping m positions at a time means fewer windows to check
- Real-world text search: grep, most text editors, and search engines use Boyer-Moore variants
For small alphabets (like DNA with only A/C/G/T), the bad character table gives smaller skips and KMP may be competitive. For multiple patterns, use Aho-Corasick instead.