Definition
A suffix automaton (SAM) for a string (s) is the minimal deterministic finite automaton (DFA) that accepts all suffixes of (s). Formally:
- A SAM is a directed acyclic graph (DAG) where nodes represent states and edges represent transitions.
- The source node (t_0) serves as the initial state. All states are reachable from (t_0).
- Each transition is labeled with a character, and outgoing transitions from any node have distinct labels.
- Certain states are designated as terminal states. Paths from (t_0) to terminal states form suffixes of (s), and every suffix of (s) corresponds to such a path.
- Among all automata satisfying the above conditions, the SAM has the minimum number of states.
Substring Property
The SAM encodes all substring information of (s). Every path starting from (t_0) corresponds to a substring of (s), and each substring corresponds to a unique path.
Each state represents a set of strings (all substrings sharing the same endpos equivalence class).
Endpos Equivalence Classes
For any non-empty substring (t) of (s), define (\operatorname{endpos}(t)) as the set of all ending positions of (t) in (s).
For example, in string abcbc, (\operatorname{endpos}(bc) = {3, 5}).
Two substrings (t_1) and (t_2) may have identical endpos sets. All non-empty substrings partition into equivalence classes:
- Each SAM state corresponds to all substrings with identical endpos sets.
- The number of states equals the number of equivalence classes plus the initial state.
Key properties:
-
For two non-empty substrings (u) and (w) of (s) (assuming (|u| \le |w|)), (\operatorname{endpos}(u) = \operatorname{endpos}(w)) if and only if every occurrence of (u) in (s) appears as a suffix of some occurrence of (w).
-
For such (u) and (w), if (u) is a suffix of (w), then (\operatorname{endpos}(w) \subseteq \operatorname{endpos}(u)); otherwise the two sets are disjoint.
-
Within the same equivalence class, the shorter string is always a suffix of the longer one. The lengths of strings in an equivalence class form a continuous interval.
Suffix Links
For any state (v \neq t_0), it represents an equivalence class of substrings with identical endpos sets. Let (w) be the longest string in this class; all other strings are suffixes of (w).
Sorting these suffixes by length in descending order, the first several belong to this equivalence class, while the remaining suffixes (including the empty string) belong to other classes. Let (t) be the longest string among the latter group; define (\operatorname{link}(v) = t).
In other words, (\operatorname{link}(v)) corresponds to the state representing the equivalence class of the longest proper suffix of (w).
For (t_0), its equivalence class contains only itself, with (\operatorname{endpos}(t_0) = {-1, 0, \dots, |s|-1}).
-
All suffix links form a rooted tree with (t_0) as the root. The linked state always represents strictly shorter strings.
-
The tree constructed via endpos sets and the tree constructed via suffix links are identical.
Construction Algorithm
SAM construction runs in (O(n)) time.
Process the string from left to right, incrementally adding new states.
Maintain a pointer last representing the state corresponding to the entire processed prefix.
-
Read character (c), create new state
curwith (\operatorname{len}(cur) = \operatorname{len}(last) + 1). This represents the longest substring ending at the new position. -
Starting from
last, follow suffix links upward to find a statepthat either has a transition labeledcor reaches (t_0). -
If such a transition exists from
pto stateq, check whether (\operatorname{len}(p) + 1 = \operatorname{len}(q)):-
If true, set (\operatorname{link}(cur) = q) and add the transition from
ptocurlabeledc. -
If false,
qrepresents substrings of multiple lengths. Cloneqinto a new stateclonewith (\operatorname{len}(clone) = \operatorname{len}(p) + 1). Redirect all transitions fromp(and its ancestors along suffix links) that previously pointed toqtoclone. Set (\operatorname{link}(q) = \operatorname{link}(cur) = clone).
-
-
If
p = t_0and no transition exists, set (\operatorname{link}(cur) = t_0) and add the transision. -
Update
last = curand continue scanning.
With constant alphabet size, construction is linear. Otherwise, complexity becomes (O(n \log |\Sigma|)).
Implementation
Memory usage is (O(n|\Sigma|)). Use 1-based indexing and allocate (2n) states (upper bound is (2n - 1) for (n \ge 2)).
Transition count upper bound: (3n - 4) for (n \ge 3).
Using map-based transitions:
struct State {
int length;
int link;
int occurrence;
unordered_map<char, int> next;
};
class SuffixAutomaton {
static const int MAXN = 2000005;
State st[MAXN];
int nodeCount;
int lastState;
public:
void initialize() {
st[1].length = 0;
st[1].link = 0;
st[1].occurrence = 0;
nodeCount = 1;
lastState = 1;
}
void extend(char character) {
int current = ++nodeCount;
st[current].length = st[lastState].length + 1;
st[current].occurrence = 1;
st[current].next.clear();
int position = lastState;
while (position && !st[position].next.count(character)) {
st[position].next[character] = current;
position = st[position].link;
}
if (!position) {
st[current].link = 1;
} else {
int destination = st[position].next[character];
if (st[position].length + 1 == st[destination].length) {
st[current].link = destination;
} else {
int cloned = ++nodeCount;
st[cloned].length = st[position].length + 1;
st[cloned].next = st[destination].next;
st[cloned].link = st[destination].link;
st[cloned].occurrence = 0;
while (position && st[position].next[character] == destination) {
st[position].next[character] = cloned;
position = st[position].link;
}
st[destination].link = st[current].link = cloned;
}
}
lastState = current;
}
};
Using array-based transitions (for small alphabets):
struct State {
int length;
int link;
int count;
int transitions[26];
} st[N];
int nodeCount;
int lastState;
void buildAutomaton() {
st[1].length = 0;
st[1].link = 0;
st[1].count = 0;
memset(st[1].transitions, 0, sizeof(st[1].transitions));
nodeCount = lastState = 1;
}
void insertCharacter(int c) {
int current = ++nodeCount;
st[current].length = st[lastState].length + 1;
st[current].count = 1;
memset(st[current].transitions, 0, sizeof(st[current].transitions));
int parent = lastState;
while (parent && !st[parent].transitions[c]) {
st[parent].transitions[c] = current;
parent = st[parent].link;
}
if (!parent) {
st[current].link = 1;
} else {
int target = st[parent].transitions[c];
if (st[parent].length + 1 == st[target].length) {
st[current].link = target;
} else {
int clone = ++nodeCount;
st[clone] = st[target];
st[clone].count = 0;
st[clone].length = st[parent].length + 1;
while (parent && st[parent].transitions[c] == target) {
st[parent].transitions[c] = clone;
parent = st[parent].link;
}
st[target].link = st[current].link = clone;
}
}
lastState = current;
}
Computing Substring Occurrences
Let (\operatorname{cnt}(v) = |\operatorname{endpos}(v)|), representing the number of occurrences in the string.
For any state (u):
(\displaystyle \operatorname{cnt}(u) = \sum_{\operatorname{link}(v) = u} \operatorname{cnt}(v))
Compute by topological sorting states based on length in descending order, then propagate counts through suffix links.
Longest Common Substring
Problem: Given strings (S) and (T), find the length of their longest common substring.
- Build SAM for (S).
- Traverse (T) on the automaton, maintaining current state
vand matched lengthl. - When processing character (T_i):
- If transition exists, move to that state and increment (l).
- Otherwise, follow suffix links until a valid transition exists or reach (t_0). Update (l = \operatorname{len}(v)) accordingly.
- If reaching (t_0) with no valid transition, reset (v = t_0) and (l = 0).
- Track maximum (l) throughout traversal.
Time complexity: (O(|T|)), since each step either increments (l) or decreases it via suffix link jumps.
Applications
Finding Top Substring by Occurrence × Length
For all substrings with occurrences > 1, maximize (\operatorname{len}(t) \times \operatorname{occur}(t)).
For each state with (\operatorname{cnt}(v) = k > 1), it contributes to lengths in interval ((\operatorname{len}(\operatorname{link}(v)), \operatorname{len}(v)]). Use difference array technique.
Counting Distinct Substrings Online
When adding a character with current state cur, the number of new distinct substrings is:
(\Delta = \operatorname{len}(cur) - \operatorname{len}(\operatorname{link}(cur)))
K-th Smallest Substring
To find the (k)-th smallest substring (with or without duplicates):
- Compute for each state the total number of substrings in its subtree (either counting all paths or only unique paths based on requirement).
- Traverse transitions in lexicographic order, subtracting counts until finding the state containing position (k).
Complexity: (O(|\Sigma| \cdot |answer|))
Generalized SAM
For multiple strings (s_1, s_2, \dots, s_n), build SAM on a Trie containing all strings. When inserting, set current state to the parent's SAM state. This allows:
- Counting distinct substrings across all strings
- Building minimal DFA accepting all suffixes
Substring Queries on Intervals
Precompute for each position (i) in string (s) the maximum length len_i such that the suffix of (s[1..i]of lengthlen_i` appears in pattern string (t). For query interval ([l, r]), answer is:
(\max_{i=l}^{r} \min(len_i, i - l + 1))
Use binary search on breakpoints of the piecewise function.
Tree-Based Problems
For problems on trees (where leaf count is limited), build SAM rooted at each leaf and merge results.
Substring Set Queries
Build generalized SAM for pattern strings. For each query string, its answer equals the number of pattern strings in the subtree of the final state. Use segment tree merging for online queries.
Finding Shortest Unique Common Substring
For two strings, find the shortest substring appearing exactly once in each. Build SAM for both strings and enumerate starting positions, tracking usage counts via two-way mechanism.
Multiple String Matching
For queries about (A[a:a+L-1] = B[b:b+L-1] = C[c:c+L-1]) across three strings, directly use generalized SAM and count occurrences of each state.
Complexity Summary
- State count: (O(n)), upper bound (2n - 1)
- Transition count: (O(n)), upper bound (3n - 4)
- Construction: (O(n)) with constant alphabet, (O(n \log |\Sigma|)) otherwise
- Memory: (O(n|\Sigma|))