Suffix Array Construction
Take the word "banana". Now write out every possible suffix — every way to chop the beginning off:
- "banana"
- "anana"
- "nana"
- "ana"
- "na"
- "a"
Now sort them alphabetically:
- "a" (starts at index 5)
- "ana" (starts at index 3)
- "anana" (starts at index 1)
- "banana" (starts at index 0)
- "na" (starts at index 4)
- "nana" (starts at index 2)
The suffix array is just the list of starting indices in sorted order: [5, 3, 1, 0, 4, 2]. That's it! But this simple sorted list unlocks incredibly powerful operations.
Why Is Sorting Suffixes Useful?
Because sorted things are binary searchable! If you want to find every occurrence of pattern "ana" in "banana", you binary search the sorted suffix list for entries that start with "ana". You find them in \(O(m \log n)\) — no full scan of the text required.
Compare this to naively scanning the text character by character: \(O(nm)\). The suffix array turns text search into a sorted-lookup problem.
Building It Efficiently: Prefix Doubling
Naively sorting n suffixes takes \(O(n^2 \log n)\) — comparing two suffixes can take \(O(n)\) each, and you do \(O(n \log n)\) comparisons. For a 1 MB file, that's way too slow.
The clever \(O(n \log n)\) approach is prefix doubling:
- Round 0: Rank each suffix by just its first character.
- Round 1: Rank each suffix by its first 2 characters, using the round-0 ranks as keys.
- Round 2: Rank by first 4 characters using round-1 ranks. Each round doubles the comparison window.
After log n rounds, the window covers the entire suffix and you have the final sorted order. Each round uses radix sort (\(O(n)\)), so the total is \(O(n \log n)\).
The LCP Array: Bonus Superpower
Once you have the suffix array, you can compute the LCP array (Longest Common Prefix). LCP[i] tells you how many characters suffix SA[i] and suffix SA[i-1] share at the start — how long they overlap.
The LCP array is like reading the suffix array with X-ray vision. It reveals the longest repeated substring in \(O(n)\), the number of distinct substrings, and powers efficient range queries. Kasai's algorithm computes the entire LCP array in \(O(n)\).
Suffix Array + Binary Search
What You Can Do With a Suffix Array
- Pattern search: binary search for all occurrences of a pattern in \(O(m \log n)\)
- Longest repeated substring: the maximum value in the LCP array gives the length — just scan LCP once
- Count distinct substrings: n(n+1)/2 minus the sum of all LCP values
- Burrows-Wheeler Transform: the basis of compressed text indexes (used in DNA assemblers and file compressors like bzip2)
The suffix array is one of the most versatile string data structures in existence — and unlike the conceptually equivalent suffix tree, it's just an integer array you can build in a few lines of code.