Advanced / Specialized7 min read

Suffix Array & LCP Array

Sort every suffix of a string — unlock powerful substring queries
build (naive): \(O(n^2 \log n)\)build (prefix doubling): \(O(n \log n)\)pattern search: \(O(m \log n)\)space: \(O(n)\)

Take the string "banana". It has 6 suffixes: "banana", "anana", "nana", "ana", "na", "a". Sort them alphabetically and record their starting positions. That sorted list of starting positions is the suffix array.

For "banana":

  • 5 → "a"
  • 3 → "ana"
  • 1 → "anana"
  • 0 → "banana"
  • 4 → "na"
  • 2 → "nana"

The suffix array is just the array of starting indices: [5, 3, 1, 0, 4, 2]. The actual suffixes are never stored — you reconstruct them on the fly from the original string. This keeps space at \(O(n)\).

Explore suffix array

← → arrow keys to step · click elements to interact

Building with prefix doubling — \(O(n \log n)\)

The naive approach sorts all n suffixes using string comparison — \(O(n \log n)\) comparisons each taking up to \(O(n)\) characters = \(O(n^2 \log n)\). Too slow for large strings.

Prefix doubling (Manber & Myers, 1990) builds the suffix array in \(O(n \log n)\). The idea:

  1. Start by ranking each suffix by its first 1 character.
  2. Double the comparison length each round: rank by 2, then 4, then 8 characters.
  3. After \(\log_2(n)\) rounds, all suffixes are distinguished and the array is complete.

Each round uses radix sort on the current ranks, making each round \(O(n)\). Total: \(O(n \log n)\) rounds × \(O(n)\) = \(O(n \log n)\).

LCP Array — Longest Common Prefix

The LCP array is built alongside the suffix array. lcp[i] = the length of the longest common prefix between the suffix at position i and the suffix at position i-1 in the sorted order.

For "banana": LCP = [-, 1, 3, 0, 0, 2] (positions 0 is undefined; "a" and "ana" share 1 char; "ana" and "anana" share 3 chars, etc.)

Using suffix array + LCP for queries

  • Pattern search: binary search on the suffix array to find where the pattern would appear. \(O(m \log n)\) where m is the pattern length.
  • Longest repeated substring: the maximum value in the LCP array — that prefix length appears at two adjacent sorted suffixes, meaning it repeats in the string.
  • Number of distinct substrings: n(n+1)/2 minus the sum of the LCP array.
Note: Suffix arrays use less memory than suffix trees (another structure for the same purpose, related to tries) while supporting the same queries. A suffix tree for 'banana' stores nodes and edges explicitly; the suffix array stores just n integers.

Complexity

Build (prefix doubling)log n doubling rounds, each \(O(n)\) with radix sort
Fast\(O(n \log n)\)
Pattern searchBinary search on suffix array; m = pattern length
Fast\(O(m \log n)\)
Longest repeated substringMax value in the LCP array
Linear (after build)\(O(n)\)
SpaceStore only starting positions, not the suffixes themselves
Linear\(O(n)\)

Build Suffix Array (Naive \(O(n^2 \log n)\))

def build_suffix_array(s):
n = len(s)
suffixes = sorted(range(n), key=lambda i: s[i:])
return suffixes
sa = build_suffix_array('banana')
print(sa) # [5, 3, 1, 0, 4, 2]
# Sorted suffixes:
for i in sa:
print(i, '->', 'banana'[i:])
Output
[5, 3, 1, 0, 4, 2]
5 -> a
3 -> ana
1 -> anana
0 -> banana
4 -> na
2 -> nana

Pattern Search Using Suffix Array

def build_suffix_array(s):
return sorted(range(len(s)), key=lambda i: s[i:])
def search(s, sa, pattern):
"""Binary search on suffix array — O(m log n)."""
lo, hi = 0, len(sa)
m = len(pattern)
while lo < hi:
mid = (lo + hi) // 2
if s[sa[mid]:sa[mid]+m] < pattern:
lo = mid + 1
else:
hi = mid
# Check if found
if lo < len(sa) and s[sa[lo]:sa[lo]+m] == pattern:
return sa[lo] # starting index
return -1
s = 'banana'
sa = build_suffix_array(s)
print(search(s, sa, 'ana')) # 3 (or 1, first occurrence)
print(search(s, sa, 'xyz')) # -1
Output
3
-1
Challenge

Quick check

The suffix array for 'banana' is [5,3,1,0,4,2]. What does the value 3 at position 1 mean?

Continue reading

Trie (Prefix Tree)
Store strings by their prefixes — autocomplete in \(O(L)\) time
Binary Trees
A family tree where every parent has at most two children
Sparse Table
Answer range minimum queries in \(O(1)\) after \(O(n \log n)\) setup