Suffix Array & LCP Array
Take the string "banana". It has 6 suffixes: "banana", "anana", "nana", "ana", "na", "a". Sort them alphabetically and record their starting positions. That sorted list of starting positions is the suffix array.
For "banana":
- 5 →
"a" - 3 →
"ana" - 1 →
"anana" - 0 →
"banana" - 4 →
"na" - 2 →
"nana"
The suffix array is just the array of starting indices: [5, 3, 1, 0, 4, 2]. The actual suffixes are never stored — you reconstruct them on the fly from the original string. This keeps space at \(O(n)\).
← → arrow keys to step · click elements to interact
Building with prefix doubling — \(O(n \log n)\)
The naive approach sorts all n suffixes using string comparison — \(O(n \log n)\) comparisons each taking up to \(O(n)\) characters = \(O(n^2 \log n)\). Too slow for large strings.
Prefix doubling (Manber & Myers, 1990) builds the suffix array in \(O(n \log n)\). The idea:
- Start by ranking each suffix by its first 1 character.
- Double the comparison length each round: rank by 2, then 4, then 8 characters.
- After \(\log_2(n)\) rounds, all suffixes are distinguished and the array is complete.
Each round uses radix sort on the current ranks, making each round \(O(n)\). Total: \(O(n \log n)\) rounds × \(O(n)\) = \(O(n \log n)\).
LCP Array — Longest Common Prefix
The LCP array is built alongside the suffix array. lcp[i] = the length of the longest common prefix between the suffix at position i and the suffix at position i-1 in the sorted order.
For "banana": LCP = [-, 1, 3, 0, 0, 2] (positions 0 is undefined; "a" and "ana" share 1 char; "ana" and "anana" share 3 chars, etc.)
Using suffix array + LCP for queries
- Pattern search: binary search on the suffix array to find where the pattern would appear. \(O(m \log n)\) where m is the pattern length.
- Longest repeated substring: the maximum value in the LCP array — that prefix length appears at two adjacent sorted suffixes, meaning it repeats in the string.
- Number of distinct substrings: n(n+1)/2 minus the sum of the LCP array.