String Algorithms9 min read

Aho-Corasick

Find all patterns at once — one pass, \(O(n + matches)\)
build trie:\(O(Σ|Pi|)\)search text:\(O(n + total matches)\)space:\(O(σ · Σ|Pi|)\)

Imagine you want to find 1,000 banned words in a 10 MB document. The obvious approach — run a separate search for each word — would read the document 1,000 times. That's 10 GB of reading for a 10 MB file.

Aho-Corasick does it in one pass. Build a smart lookup structure from all your words, then walk through the document exactly once. At every character, the structure instantly tells you which words (if any) end here. Done in time proportional to the text length plus the total match count.

Step 1: Build the Trie

A trie is a tree where each path from root to a node spells out a prefix. Insert all your patterns into the trie. Nodes at the end of a complete word are marked as output nodes. After inserting k patterns of total length M, the trie has at most M nodes.

So far, the trie lets you match one pattern at a time. The magic comes next.

Step 2: Add Failure Links (the KMP trick, on a trie)

When you're walking the trie and there's no edge for the next character, you need to fall back to a shorter match — just like KMP's failure function. The failure link of a node v points to the trie node representing the longest proper suffix of v's string that is also a prefix of some pattern.

Compute failure links via BFS (level by level): the root fails to itself. For a child c of node v on character a — if v's failure node also has an edge on a, c's failure link goes there; otherwise follow v's failure link's failure link, and so on. This is the exact KMP idea, applied to the entire trie.

Step 3: Add Output Links (catch suffix-pattern matches)

Sometimes a shorter pattern is a suffix of a longer one. When you reach a node representing the end of a long word, a shorter word might also be ending at the same position. Output links chase the suffix-link chain to find all such endings. Each node stores a shortcut to the nearest ancestor (via suffix links) that is itself an output node.

Step 4: Search in One Pass

Walk through the text character by character, maintaining a current node in the automaton. At each character: follow the trie edge if it exists, otherwise follow failure links until you can. Then output every pattern that ends at the current node (via output links). Each character triggers \(O(1)\) amortized transitions — same argument as KMP. Total search time: \(O(n + total matches)\).

Note: Aho-Corasick powers real-world security tools. Snort and Suricata (network intrusion detection) use it to match thousands of attack signatures against live network traffic — millions of packets per second — in a single pass.

Aho-Corasick Multi-Pattern Search

from collections import deque
class AhoCorasick:
def __init__(self):
self.goto = [{}] # trie nodes
self.fail = [0]
self.output = [[]]
def add_pattern(self, pattern, idx):
node = 0
for ch in pattern:
if ch not in self.goto[node]:
self.goto.append({})
self.fail.append(0)
self.output.append([])
self.goto[node][ch] = len(self.goto) - 1
node = self.goto[node][ch]
self.output[node].append(idx)
def build(self):
q = deque()
for ch, child in self.goto[0].items():
self.fail[child] = 0
q.append(child)
while q:
node = q.popleft()
for ch, child in self.goto[node].items():
f = self.fail[node]
while f and ch not in self.goto[f]:
f = self.fail[f]
self.fail[child] = self.goto[f].get(ch, 0)
if self.fail[child] == child:
self.fail[child] = 0
self.output[child] += self.output[self.fail[child]]
q.append(child)
def search(self, text):
node = 0
results = []
for i, ch in enumerate(text):
while node and ch not in self.goto[node]:
node = self.fail[node]
node = self.goto[node].get(ch, 0)
for pat_idx in self.output[node]:
results.append((i, pat_idx))
return results
ac = AhoCorasick()
patterns = ["he", "she", "his", "hers"]
for i, p in enumerate(patterns):
ac.add_pattern(p, i)
ac.build()
matches = ac.search("ushers")
for pos, idx in matches:
p = patterns[idx]
print(f"'{p}' ends at position {pos}")
Output
'she' ends at position 4
'he' ends at position 4
'hers' ends at position 5
'he' ends at position 5

Where Is Aho-Corasick Used?

  • Network security: Snort/Suricata match thousands of attack signatures against live traffic
  • Spam and content filtering: detect prohibited keywords across messages
  • Bioinformatics: search for multiple gene sequences in genomes simultaneously
  • Search engines: highlight multiple query terms in a document in one pass

If you only need to search for a single pattern, KMP or Boyer-Moore are simpler. But the moment you have multiple patterns and a single large text, Aho-Corasick is the right tool.

Complexity

🏗️ Trie constructionInsert each pattern character by character — total nodes ≤ total characters across all patterns
Linear in total pattern length\(O(Σ|Pi|)\)
🔗 Failure link BFSBFS visits each trie node exactly once; failure link computation is \(O(1)\) per node
Linear in total pattern length\(O(Σ|Pi|)\)
🔍 Search phaseEach character causes \(O(1)\) amortized automaton transitions — text is read exactly once
Linear in text + matches\(O(n + total matches)\)
💾 SpaceEach trie node stores an edge per alphabet character; use hash maps to reduce to \(O(Σ|Pi|)\)
Alphabet × trie size\(O(σ · Σ|Pi|)\)

Quick check

You need to find 500 keywords in a 1 GB log file. Why is Aho-Corasick better than running KMP 500 times?

Continue reading

KMP Algorithm
When searching for a word, never go backwards — use what you already know
Suffix Array Construction
All suffixes sorted — \(O(m \log n)\) substring search
Rabin-Karp
Find a word in a book by its fingerprint — check character-by-character only when fingerprints match