Advanced / Specialized7 min read

Suffix Array & LCP Array

Sort every suffix of a string β€” unlock powerful substring queries
build (naive): \(O(n^2 \log n)\)build (prefix doubling): \(O(n \log n)\)pattern search: \(O(m \log n)\)space: \(O(n)\)

Take the string "banana". It has 6 suffixes: "banana", "anana", "nana", "ana", "na", "a". Sort them alphabetically and record their starting positions. That sorted list of starting positions is the suffix array.

For "banana":

  • 5 β†’ "a"
  • 3 β†’ "ana"
  • 1 β†’ "anana"
  • 0 β†’ "banana"
  • 4 β†’ "na"
  • 2 β†’ "nana"

The suffix array is just the array of starting indices: [5, 3, 1, 0, 4, 2]. The actual suffixes are never stored β€” you reconstruct them on the fly from the original string. This keeps space at \(O(n)\).

β—ˆ Explore suffix array

← β†’ arrow keys to step Β· click elements to interact

Building with prefix doubling β€” \(O(n \log n)\)

The naive approach sorts all n suffixes using string comparison β€” \(O(n \log n)\) comparisons each taking up to \(O(n)\) characters = \(O(n^2 \log n)\). Too slow for large strings.

Prefix doubling (Manber & Myers, 1990) builds the suffix array in \(O(n \log n)\). The idea:

  1. Start by ranking each suffix by its first 1 character.
  2. Double the comparison length each round: rank by 2, then 4, then 8 characters.
  3. After \(\log_2(n)\) rounds, all suffixes are distinguished and the array is complete.

Each round uses radix sort on the current ranks, making each round \(O(n)\). Total: \(O(n \log n)\) rounds Γ— \(O(n)\) = \(O(n \log n)\).

LCP Array β€” Longest Common Prefix

The LCP array is built alongside the suffix array. lcp[i] = the length of the longest common prefix between the suffix at position i and the suffix at position i-1 in the sorted order.

For "banana": LCP = [-, 1, 3, 0, 0, 2] (positions 0 is undefined; "a" and "ana" share 1 char; "ana" and "anana" share 3 chars, etc.)

Using suffix array + LCP for queries

  • Pattern search: binary search on the suffix array to find where the pattern would appear. \(O(m \log n)\) where m is the pattern length.
  • Longest repeated substring: the maximum value in the LCP array β€” that prefix length appears at two adjacent sorted suffixes, meaning it repeats in the string.
  • Number of distinct substrings: n(n+1)/2 minus the sum of the LCP array.
Note: Suffix arrays use less memory than suffix trees (another structure for the same purpose, related to tries) while supporting the same queries. A suffix tree for 'banana' stores nodes and edges explicitly; the suffix array stores just n integers.

Complexity

Build (prefix doubling)
log n doubling rounds, each \(O(n)\) with radix sort
Fast \(O(n \log n)\)
Pattern search
Binary search on suffix array; m = pattern length
Fast \(O(m \log n)\)
Longest repeated substring
Max value in the LCP array
Linear (after build) \(O(n)\)
Space
Store only starting positions, not the suffixes themselves
Linear \(O(n)\)

Build Suffix Array (Naive \(O(n^2 \log n)\))

def build_suffix_array(s):
n = len(s)
suffixes = sorted(range(n), key=lambda i: s[i:])
return suffixes
sa = build_suffix_array('banana')
print(sa) # [5, 3, 1, 0, 4, 2]
# Sorted suffixes:
for i in sa:
print(i, '->', 'banana'[i:])
Output
[5, 3, 1, 0, 4, 2]
5 -> a
3 -> ana
1 -> anana
0 -> banana
4 -> na
2 -> nana

Pattern Search Using Suffix Array

def build_suffix_array(s):
return sorted(range(len(s)), key=lambda i: s[i:])
def search(s, sa, pattern):
"""Binary search on suffix array β€” O(m log n)."""
lo, hi = 0, len(sa)
m = len(pattern)
while lo < hi:
mid = (lo + hi) // 2
if s[sa[mid]:sa[mid]+m] < pattern:
lo = mid + 1
else:
hi = mid
# Check if found
if lo < len(sa) and s[sa[lo]:sa[lo]+m] == pattern:
return sa[lo] # starting index
return -1
s = 'banana'
sa = build_suffix_array(s)
print(search(s, sa, 'ana')) # 3 (or 1, first occurrence)
print(search(s, sa, 'xyz')) # -1
Output
3
-1
Challenge

Quick check

The suffix array for 'banana' is [5,3,1,0,4,2]. What does the value 3 at position 1 mean?

Continue reading