Aho-Corasick
Imagine you want to find 1,000 banned words in a 10 MB document. The obvious approach — run a separate search for each word — would read the document 1,000 times. That's 10 GB of reading for a 10 MB file.
Aho-Corasick does it in one pass. Build a smart lookup structure from all your words, then walk through the document exactly once. At every character, the structure instantly tells you which words (if any) end here. Done in time proportional to the text length plus the total match count.
Step 1: Build the Trie
A trie is a tree where each path from root to a node spells out a prefix. Insert all your patterns into the trie. Nodes at the end of a complete word are marked as output nodes. After inserting k patterns of total length M, the trie has at most M nodes.
So far, the trie lets you match one pattern at a time. The magic comes next.
Step 2: Add Failure Links (the KMP trick, on a trie)
When you're walking the trie and there's no edge for the next character, you need to fall back to a shorter match — just like KMP's failure function. The failure link of a node v points to the trie node representing the longest proper suffix of v's string that is also a prefix of some pattern.
Compute failure links via BFS (level by level): the root fails to itself. For a child c of node v on character a — if v's failure node also has an edge on a, c's failure link goes there; otherwise follow v's failure link's failure link, and so on. This is the exact KMP idea, applied to the entire trie.
Step 3: Add Output Links (catch suffix-pattern matches)
Sometimes a shorter pattern is a suffix of a longer one. When you reach a node representing the end of a long word, a shorter word might also be ending at the same position. Output links chase the suffix-link chain to find all such endings. Each node stores a shortcut to the nearest ancestor (via suffix links) that is itself an output node.
Step 4: Search in One Pass
Walk through the text character by character, maintaining a current node in the automaton. At each character: follow the trie edge if it exists, otherwise follow failure links until you can. Then output every pattern that ends at the current node (via output links). Each character triggers \(O(1)\) amortized transitions — same argument as KMP. Total search time: \(O(n + total matches)\).
Aho-Corasick Multi-Pattern Search
Where Is Aho-Corasick Used?
- Network security: Snort/Suricata match thousands of attack signatures against live traffic
- Spam and content filtering: detect prohibited keywords across messages
- Bioinformatics: search for multiple gene sequences in genomes simultaneously
- Search engines: highlight multiple query terms in a document in one pass
If you only need to search for a single pattern, KMP or Boyer-Moore are simpler. But the moment you have multiple patterns and a single large text, Aho-Corasick is the right tool.