Suffix Array Construction and Applications

A suffix array is a compact representation of all suffixes of a string, sorted lexicographically. Constructing this array efficiently is foundational for various string processing tasks.

Construction via Prefix Doubling

To construct the suffix array in $O(n \log n)$ time, we use a prefix doubling strategy cobmined with Radix Sort. We iteratively sort suffixes based on their prefixes of length $2^k$.

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <cstring>

const int MAXN = 1000005;
int n, m = 128;
char str[MAXN];
int sa[MAXN], rank_arr[MAXN], tmp[MAXN], cnt[MAXN];

void radix_sort() {
    for (int i = 0; i <= m; i++) cnt[i] = 0;
    for (int i = 1; i <= n; i++) cnt[rank_arr[i]]++;
    for (int i = 1; i <= m; i++) cnt[i] += cnt[i - 1];
    for (int i = n; i >= 1; i--) sa[cnt[rank_arr[tmp[i]]]--] = tmp[i];
}

void build_sa() {
    for (int i = 1; i <= n; i++) rank_arr[i] = str[i], tmp[i] = i;
    radix_sort();
    for (int len = 1; len < n; len <<= 1) {
        int p = 0;
        for (int i = n - len + 1; i <= n; i++) tmp[++p] = i;
        for (int i = 1; i <= n; i++) if (sa[i] > len) tmp[++p] = sa[i] - len;
        radix_sort();
        std::swap(rank_arr, tmp);
        p = 1; rank_arr[sa[1]] = 1;
        for (int i = 2; i <= n; i++) 
            rank_arr[sa[i]] = (tmp[sa[i]] == tmp[sa[i - 1]] && tmp[sa[i] + len] == tmp[sa[i - 1] + len]) ? p : ++p;
        if (p >= n) break;
        m = p;
    }
}

The Height Array

The height[i] array stores the Longest Common Prefix (LCP) between the $i$-th and $(i-1)$-th lexicographically smallest suffixes. Using the property height[rank[i]] >= height[rank[i-1]] - 1, we can compute this array in linear time.

int height[MAXN];
void build_height() {
    int k = 0;
    for (int i = 1; i <= n; i++) {
        if (rank_arr[i] == 1) continue;
        if (k) k--;
        int j = sa[rank_arr[i] - 1];
        while (i + k <= n && j + k <= n && str[i + k] == str[j + k]) k++;
        height[rank_arr[i]] = k;
    }
}

Key Applications

  • Counting Distinct Substrings: A string of length $n$ has $\frac{n(n+1)}{2}$ total substrings. Subtracting the sum of all values in the height array gives the number of unique substrings.
  • Longest Common Prefix of Substrings: The LCP of two suffixes $i$ and $j$ in the sorted array is the minimum value in the height array within the range $[\text{rank}[i]+1, \text{rank}[j]]$. This can be solved using Sparse Table for RMQ (Range Minimum Query).
  • Pattern Matching: Since suffixes are sorted, one can binary search for a pattern string $P$ in $O(|P| \log n)$ time.
  • Repeaetd Substrings: The longest substring that appears at least $k$ times can be found by looking at the sliding window minimum of length $k-1$ in the height array.

Tags: algorithm string suffix-array cpp

Posted on Sat, 16 May 2026 18:59:23 +0000 by eaglelegend