A suffix array is a compact representation of all suffixes of a string, sorted lexicographically. Constructing this array efficiently is foundational for various string processing tasks.
Construction via Prefix Doubling
To construct the suffix array in $O(n \log n)$ time, we use a prefix doubling strategy cobmined with Radix Sort. We iteratively sort suffixes based on their prefixes of length $2^k$.
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <cstring>
const int MAXN = 1000005;
int n, m = 128;
char str[MAXN];
int sa[MAXN], rank_arr[MAXN], tmp[MAXN], cnt[MAXN];
void radix_sort() {
for (int i = 0; i <= m; i++) cnt[i] = 0;
for (int i = 1; i <= n; i++) cnt[rank_arr[i]]++;
for (int i = 1; i <= m; i++) cnt[i] += cnt[i - 1];
for (int i = n; i >= 1; i--) sa[cnt[rank_arr[tmp[i]]]--] = tmp[i];
}
void build_sa() {
for (int i = 1; i <= n; i++) rank_arr[i] = str[i], tmp[i] = i;
radix_sort();
for (int len = 1; len < n; len <<= 1) {
int p = 0;
for (int i = n - len + 1; i <= n; i++) tmp[++p] = i;
for (int i = 1; i <= n; i++) if (sa[i] > len) tmp[++p] = sa[i] - len;
radix_sort();
std::swap(rank_arr, tmp);
p = 1; rank_arr[sa[1]] = 1;
for (int i = 2; i <= n; i++)
rank_arr[sa[i]] = (tmp[sa[i]] == tmp[sa[i - 1]] && tmp[sa[i] + len] == tmp[sa[i - 1] + len]) ? p : ++p;
if (p >= n) break;
m = p;
}
}
The Height Array
The height[i] array stores the Longest Common Prefix (LCP) between the $i$-th and $(i-1)$-th lexicographically smallest suffixes. Using the property height[rank[i]] >= height[rank[i-1]] - 1, we can compute this array in linear time.
int height[MAXN];
void build_height() {
int k = 0;
for (int i = 1; i <= n; i++) {
if (rank_arr[i] == 1) continue;
if (k) k--;
int j = sa[rank_arr[i] - 1];
while (i + k <= n && j + k <= n && str[i + k] == str[j + k]) k++;
height[rank_arr[i]] = k;
}
}
Key Applications
- Counting Distinct Substrings: A string of length $n$ has $\frac{n(n+1)}{2}$ total substrings. Subtracting the sum of all values in the height array gives the number of unique substrings.
- Longest Common Prefix of Substrings: The LCP of two suffixes $i$ and $j$ in the sorted array is the minimum value in the height array within the range $[\text{rank}[i]+1, \text{rank}[j]]$. This can be solved using Sparse Table for RMQ (Range Minimum Query).
- Pattern Matching: Since suffixes are sorted, one can binary search for a pattern string $P$ in $O(|P| \log n)$ time.
- Repeaetd Substrings: The longest substring that appears at least $k$ times can be found by looking at the sliding window minimum of length $k-1$ in the height array.