String Algorithms9 min read

Suffix Array Construction

All suffixes sorted — \(O(m \log n)\) substring search
naive build:Slow · \(O(n^2 \log n)\)efficient build:Fast · \(O(n \log n)\)search:\(O(m \log n)\)LCP array:\(O(n)\) via Kasai

Take the word "banana". Now write out every possible suffix — every way to chop the beginning off:

  • "banana"
  • "anana"
  • "nana"
  • "ana"
  • "na"
  • "a"

Now sort them alphabetically:

  • "a" (starts at index 5)
  • "ana" (starts at index 3)
  • "anana" (starts at index 1)
  • "banana" (starts at index 0)
  • "na" (starts at index 4)
  • "nana" (starts at index 2)

The suffix array is just the list of starting indices in sorted order: [5, 3, 1, 0, 4, 2]. That's it! But this simple sorted list unlocks incredibly powerful operations.

Why Is Sorting Suffixes Useful?

Because sorted things are binary searchable! If you want to find every occurrence of pattern "ana" in "banana", you binary search the sorted suffix list for entries that start with "ana". You find them in \(O(m \log n)\) — no full scan of the text required.

Compare this to naively scanning the text character by character: \(O(nm)\). The suffix array turns text search into a sorted-lookup problem.

Building It Efficiently: Prefix Doubling

Naively sorting n suffixes takes \(O(n^2 \log n)\) — comparing two suffixes can take \(O(n)\) each, and you do \(O(n \log n)\) comparisons. For a 1 MB file, that's way too slow.

The clever \(O(n \log n)\) approach is prefix doubling:

  • Round 0: Rank each suffix by just its first character.
  • Round 1: Rank each suffix by its first 2 characters, using the round-0 ranks as keys.
  • Round 2: Rank by first 4 characters using round-1 ranks. Each round doubles the comparison window.

After log n rounds, the window covers the entire suffix and you have the final sorted order. Each round uses radix sort (\(O(n)\)), so the total is \(O(n \log n)\).

The LCP Array: Bonus Superpower

Once you have the suffix array, you can compute the LCP array (Longest Common Prefix). LCP[i] tells you how many characters suffix SA[i] and suffix SA[i-1] share at the start — how long they overlap.

The LCP array is like reading the suffix array with X-ray vision. It reveals the longest repeated substring in \(O(n)\), the number of distinct substrings, and powers efficient range queries. Kasai's algorithm computes the entire LCP array in \(O(n)\).

Note: Suffix arrays replaced suffix trees in most practical applications. A suffix tree (closely related to a trie) is theoretically equivalent but requires complex pointer-heavy code and large constant factors. A suffix array is just an integer array — cache-friendly, simple to build, and tiny in memory.

Suffix Array + Binary Search

def build_suffix_array(s):
n = len(s)
# Start: rank by first character
sa = sorted(range(n), key=lambda i: s[i:])
return sa
def search_suffix_array(text, sa, pattern):
"""Find all occurrences of pattern in text using binary search."""
m = len(pattern)
# Binary search for leftmost match
lo, hi = 0, len(sa)
while lo < hi:
mid = (lo + hi) // 2
if text[sa[mid]:sa[mid]+m] < pattern:
lo = mid + 1
else:
hi = mid
left = lo
# Binary search for rightmost match
lo, hi = left, len(sa)
while lo < hi:
mid = (lo + hi) // 2
if text[sa[mid]:sa[mid]+m] <= pattern:
lo = mid + 1
else:
hi = mid
# Collect all matches in [left, lo)
return sorted(sa[left:lo])
text = "banana"
sa = build_suffix_array(text)
print("SA:", sa) # [5, 3, 1, 0, 4, 2]
print("Suffixes:", [text[i:] for i in sa])
print(search_suffix_array(text, sa, "ana")) # [1, 3]
Output
SA: [5, 3, 1, 0, 4, 2]
Suffixes: ['a', 'ana', 'anana', 'banana', 'na', 'nana']
[1, 3]

What You Can Do With a Suffix Array

  • Pattern search: binary search for all occurrences of a pattern in \(O(m \log n)\)
  • Longest repeated substring: the maximum value in the LCP array gives the length — just scan LCP once
  • Count distinct substrings: n(n+1)/2 minus the sum of all LCP values
  • Burrows-Wheeler Transform: the basis of compressed text indexes (used in DNA assemblers and file compressors like bzip2)

The suffix array is one of the most versatile string data structures in existence — and unlike the conceptually equivalent suffix tree, it's just an integer array you can build in a few lines of code.

Complexity

🐌 Naive constructionSort n suffixes with up-to-\(O(n)\) string comparison each — fine for n < 10,000
Infeasible for large strings\(O(n^2 \log n)\)
⚡ Prefix doublinglog n rounds of \(O(n)\) radix sort — SA-IS achieves \(O(n)\) but is complex to implement
Efficient for large strings\(O(n \log n)\)
🔍 Pattern searchBinary search over n sorted suffixes with \(O(m)\) string comparison per step
Log-linear in text + pattern\(O(m \log n)\)
📏 LCP array via KasaiLCP can decrease by at most 1 per step — total increments bounded by \(O(n)\) amortized
One pass through text\(O(n)\)

Quick check

The suffix array of 'banana' is [5, 3, 1, 0, 4, 2]. What does SA[0] = 5 mean?

Continue reading

KMP Algorithm
When searching for a word, never go backwards — use what you already know
Aho-Corasick
Find all patterns at once — one pass, \(O(n + matches)\)
Rabin-Karp
Find a word in a book by its fingerprint — check character-by-character only when fingerprints match