Suffix Array & LCP Array

Take the string "banana". It has 6 suffixes: "banana", "anana", "nana", "ana", "na", "a". Sort them alphabetically and record their starting positions. That sorted list of starting positions is the suffix array.

For "banana":

5 → "a"
3 → "ana"
1 → "anana"
0 → "banana"
4 → "na"
2 → "nana"

The suffix array is just the array of starting indices: [5, 3, 1, 0, 4, 2]. The actual suffixes are never stored — you reconstruct them on the fly from the original string. This keeps space at \(O(n)\).

Building with prefix doubling — \(O(n \log n)\)

The naive approach sorts all n suffixes using string comparison — \(O(n \log n)\) comparisons each taking up to \(O(n)\) characters = \(O(n^2 \log n)\). Too slow for large strings.

Prefix doubling (Manber & Myers, 1990) builds the suffix array in \(O(n \log n)\). The idea:

Start by ranking each suffix by its first 1 character.
Double the comparison length each round: rank by 2, then 4, then 8 characters.
After \(\log_2(n)\) rounds, all suffixes are distinguished and the array is complete.

Each round uses radix sort on the current ranks, making each round \(O(n)\). Total: \(O(n \log n)\) rounds × \(O(n)\) = \(O(n \log n)\).

LCP Array — Longest Common Prefix

The LCP array is built alongside the suffix array. lcp[i] = the length of the longest common prefix between the suffix at position i and the suffix at position i-1 in the sorted order.

For "banana": LCP = [-, 1, 3, 0, 0, 2] (positions 0 is undefined; "a" and "ana" share 1 char; "ana" and "anana" share 3 chars, etc.)

Using suffix array + LCP for queries

Pattern search: binary search on the suffix array to find where the pattern would appear. \(O(m \log n)\) where m is the pattern length.
Longest repeated substring: the maximum value in the LCP array — that prefix length appears at two adjacent sorted suffixes, meaning it repeats in the string.
Number of distinct substrings: n(n+1)/2 minus the sum of the LCP array.

Complexity

Build (prefix doubling)

log n doubling rounds, each \(O(n)\) with radix sort

Fast \(O(n \log n)\)

Pattern search

Binary search on suffix array; m = pattern length

Fast \(O(m \log n)\)

Longest repeated substring

Max value in the LCP array

Linear (after build) \(O(n)\)

Space

Store only starting positions, not the suffixes themselves

Linear \(O(n)\)

Build Suffix Array (Naive \(O(n^2 \log n)\))

def build_suffix_array(s):
    n = len(s)
    suffixes = sorted(range(n), key=lambda i: s[i:])
    return suffixes

sa = build_suffix_array('banana')
print(sa)  # [5, 3, 1, 0, 4, 2]
# Sorted suffixes:
for i in sa:
    print(i, '->', 'banana'[i:])

Output

[5, 3, 1, 0, 4, 2]
5 -> a
3 -> ana
1 -> anana
0 -> banana
4 -> na
2 -> nana

Pattern Search Using Suffix Array

def build_suffix_array(s):
    return sorted(range(len(s)), key=lambda i: s[i:])

def search(s, sa, pattern):
    """Binary search on suffix array — O(m log n)."""
    lo, hi = 0, len(sa)
    m = len(pattern)
    while lo < hi:
        mid = (lo + hi) // 2
        if s[sa[mid]:sa[mid]+m] < pattern:
            lo = mid + 1
        else:
            hi = mid
    # Check if found
    if lo < len(sa) and s[sa[lo]:sa[lo]+m] == pattern:
        return sa[lo]  # starting index
    return -1

s = 'banana'
sa = build_suffix_array(s)
print(search(s, sa, 'ana'))  # 3 (or 1, first occurrence)
print(search(s, sa, 'xyz'))  # -1