Aho-Corasick

Imagine you want to find 1,000 banned words in a 10 MB document. The obvious approach — run a separate search for each word — would read the document 1,000 times. That's 10 GB of reading for a 10 MB file.

Aho-Corasick does it in one pass. Build a smart lookup structure from all your words, then walk through the document exactly once. At every character, the structure instantly tells you which words (if any) end here. Done in time proportional to the text length plus the total match count.

Step 1: Build the Trie

A trie is a tree where each path from root to a node spells out a prefix. Insert all your patterns into the trie. Nodes at the end of a complete word are marked as output nodes. After inserting k patterns of total length M, the trie has at most M nodes.

So far, the trie lets you match one pattern at a time. The magic comes next.

Step 2: Add Failure Links (the KMP trick, on a trie)

When you're walking the trie and there's no edge for the next character, you need to fall back to a shorter match — just like KMP's failure function. The failure link of a node v points to the trie node representing the longest proper suffix of v's string that is also a prefix of some pattern.

Compute failure links via BFS (level by level): the root fails to itself. For a child c of node v on character a — if v's failure node also has an edge on a, c's failure link goes there; otherwise follow v's failure link's failure link, and so on. This is the exact KMP idea, applied to the entire trie.

Step 3: Add Output Links (catch suffix-pattern matches)

Sometimes a shorter pattern is a suffix of a longer one. When you reach a node representing the end of a long word, a shorter word might also be ending at the same position. Output links chase the suffix-link chain to find all such endings. Each node stores a shortcut to the nearest ancestor (via suffix links) that is itself an output node.

Step 4: Search in One Pass

Walk through the text character by character, maintaining a current node in the automaton. At each character: follow the trie edge if it exists, otherwise follow failure links until you can. Then output every pattern that ends at the current node (via output links). Each character triggers \(O(1)\) amortized transitions — same argument as KMP. Total search time: \(O(n + total matches)\).

Aho-Corasick Multi-Pattern Search

from collections import deque

class AhoCorasick:
    def __init__(self):
        self.goto = [{}]  # trie nodes
        self.fail = [0]
        self.output = [[]]

    def add_pattern(self, pattern, idx):
        node = 0
        for ch in pattern:
            if ch not in self.goto[node]:
                self.goto.append({})
                self.fail.append(0)
                self.output.append([])
                self.goto[node][ch] = len(self.goto) - 1
            node = self.goto[node][ch]
        self.output[node].append(idx)

    def build(self):
        q = deque()
        for ch, child in self.goto[0].items():
            self.fail[child] = 0
            q.append(child)
        while q:
            node = q.popleft()
            for ch, child in self.goto[node].items():
                f = self.fail[node]
                while f and ch not in self.goto[f]:
                    f = self.fail[f]
                self.fail[child] = self.goto[f].get(ch, 0)
                if self.fail[child] == child:
                    self.fail[child] = 0
                self.output[child] += self.output[self.fail[child]]
                q.append(child)

    def search(self, text):
        node = 0
        results = []
        for i, ch in enumerate(text):
            while node and ch not in self.goto[node]:
                node = self.fail[node]
            node = self.goto[node].get(ch, 0)
            for pat_idx in self.output[node]:
                results.append((i, pat_idx))
        return results

ac = AhoCorasick()
patterns = ["he", "she", "his", "hers"]
for i, p in enumerate(patterns):
    ac.add_pattern(p, i)
ac.build()
matches = ac.search("ushers")
for pos, idx in matches:
    p = patterns[idx]
    print(f"'{p}' ends at position {pos}")

Output

'she' ends at position 4
'he' ends at position 4
'hers' ends at position 5
'he' ends at position 5

Where Is Aho-Corasick Used?

Network security: Snort/Suricata match thousands of attack signatures against live traffic
Spam and content filtering: detect prohibited keywords across messages
Bioinformatics: search for multiple gene sequences in genomes simultaneously
Search engines: highlight multiple query terms in a document in one pass

If you only need to search for a single pattern, KMP or Boyer-Moore are simpler. But the moment you have multiple patterns and a single large text, Aho-Corasick is the right tool.

Complexity

🏗️ Trie construction

Insert each pattern character by character — total nodes ≤ total characters across all patterns

Linear in total pattern length \(O(Σ|Pi|)\)

🔗 Failure link BFS

BFS visits each trie node exactly once; failure link computation is \(O(1)\) per node

Linear in total pattern length \(O(Σ|Pi|)\)

🔍 Search phase

Each character causes \(O(1)\) amortized automaton transitions — text is read exactly once

Linear in text + matches \(O(n + total matches)\)

💾 Space

Each trie node stores an edge per alphabet character; use hash maps to reduce to \(O(Σ|Pi|)\)

Alphabet × trie size \(O(σ · Σ|Pi|)\)