Suffix Automaton: Definition, Construction, and Applications

Definition

A suffix automaton (SAM) for a string (s) is the minimal deterministic finite automaton (DFA) that accepts all suffixes of (s). Formally:

  • A SAM is a directed acyclic graph (DAG) where nodes represent states and edges represent transitions.
  • The source node (t_0) serves as the initial state. All states are reachable from (t_0).
  • Each transition is labeled with a character, and outgoing transitions from any node have distinct labels.
  • Certain states are designated as terminal states. Paths from (t_0) to terminal states form suffixes of (s), and every suffix of (s) corresponds to such a path.
  • Among all automata satisfying the above conditions, the SAM has the minimum number of states.

Substring Property

The SAM encodes all substring information of (s). Every path starting from (t_0) corresponds to a substring of (s), and each substring corresponds to a unique path.

Each state represents a set of strings (all substrings sharing the same endpos equivalence class).

Endpos Equivalence Classes

For any non-empty substring (t) of (s), define (\operatorname{endpos}(t)) as the set of all ending positions of (t) in (s).

For example, in string abcbc, (\operatorname{endpos}(bc) = {3, 5}).

Two substrings (t_1) and (t_2) may have identical endpos sets. All non-empty substrings partition into equivalence classes:

  • Each SAM state corresponds to all substrings with identical endpos sets.
  • The number of states equals the number of equivalence classes plus the initial state.

Key properties:

  1. For two non-empty substrings (u) and (w) of (s) (assuming (|u| \le |w|)), (\operatorname{endpos}(u) = \operatorname{endpos}(w)) if and only if every occurrence of (u) in (s) appears as a suffix of some occurrence of (w).

  2. For such (u) and (w), if (u) is a suffix of (w), then (\operatorname{endpos}(w) \subseteq \operatorname{endpos}(u)); otherwise the two sets are disjoint.

  3. Within the same equivalence class, the shorter string is always a suffix of the longer one. The lengths of strings in an equivalence class form a continuous interval.

Suffix Links

For any state (v \neq t_0), it represents an equivalence class of substrings with identical endpos sets. Let (w) be the longest string in this class; all other strings are suffixes of (w).

Sorting these suffixes by length in descending order, the first several belong to this equivalence class, while the remaining suffixes (including the empty string) belong to other classes. Let (t) be the longest string among the latter group; define (\operatorname{link}(v) = t).

In other words, (\operatorname{link}(v)) corresponds to the state representing the equivalence class of the longest proper suffix of (w).

For (t_0), its equivalence class contains only itself, with (\operatorname{endpos}(t_0) = {-1, 0, \dots, |s|-1}).

  1. All suffix links form a rooted tree with (t_0) as the root. The linked state always represents strictly shorter strings.

  2. The tree constructed via endpos sets and the tree constructed via suffix links are identical.

Construction Algorithm

SAM construction runs in (O(n)) time.

Process the string from left to right, incrementally adding new states.

Maintain a pointer last representing the state corresponding to the entire processed prefix.

  1. Read character (c), create new state cur with (\operatorname{len}(cur) = \operatorname{len}(last) + 1). This represents the longest substring ending at the new position.

  2. Starting from last, follow suffix links upward to find a state p that either has a transition labeled c or reaches (t_0).

  3. If such a transition exists from p to state q, check whether (\operatorname{len}(p) + 1 = \operatorname{len}(q)):

    • If true, set (\operatorname{link}(cur) = q) and add the transition from p to cur labeled c.

    • If false, q represents substrings of multiple lengths. Clone q into a new state clone with (\operatorname{len}(clone) = \operatorname{len}(p) + 1). Redirect all transitions from p (and its ancestors along suffix links) that previously pointed to q to clone. Set (\operatorname{link}(q) = \operatorname{link}(cur) = clone).

  4. If p = t_0 and no transition exists, set (\operatorname{link}(cur) = t_0) and add the transision.

  5. Update last = cur and continue scanning.

With constant alphabet size, construction is linear. Otherwise, complexity becomes (O(n \log |\Sigma|)).

Implementation

Memory usage is (O(n|\Sigma|)). Use 1-based indexing and allocate (2n) states (upper bound is (2n - 1) for (n \ge 2)).

Transition count upper bound: (3n - 4) for (n \ge 3).

Using map-based transitions:

struct State {
    int length;
    int link;
    int occurrence;
    unordered_map<char, int> next;
};

class SuffixAutomaton {
    static const int MAXN = 2000005;
    State st[MAXN];
    int nodeCount;
    int lastState;

public:
    void initialize() {
        st[1].length = 0;
        st[1].link = 0;
        st[1].occurrence = 0;
        nodeCount = 1;
        lastState = 1;
    }

    void extend(char character) {
        int current = ++nodeCount;
        st[current].length = st[lastState].length + 1;
        st[current].occurrence = 1;
        st[current].next.clear();

        int position = lastState;
        while (position && !st[position].next.count(character)) {
            st[position].next[character] = current;
            position = st[position].link;
        }

        if (!position) {
            st[current].link = 1;
        } else {
            int destination = st[position].next[character];
            if (st[position].length + 1 == st[destination].length) {
                st[current].link = destination;
            } else {
                int cloned = ++nodeCount;
                st[cloned].length = st[position].length + 1;
                st[cloned].next = st[destination].next;
                st[cloned].link = st[destination].link;
                st[cloned].occurrence = 0;

                while (position && st[position].next[character] == destination) {
                    st[position].next[character] = cloned;
                    position = st[position].link;
                }

                st[destination].link = st[current].link = cloned;
            }
        }
        lastState = current;
    }
};

Using array-based transitions (for small alphabets):

struct State {
    int length;
    int link;
    int count;
    int transitions[26];
} st[N];

int nodeCount;
int lastState;

void buildAutomaton() {
    st[1].length = 0;
    st[1].link = 0;
    st[1].count = 0;
    memset(st[1].transitions, 0, sizeof(st[1].transitions));
    nodeCount = lastState = 1;
}

void insertCharacter(int c) {
    int current = ++nodeCount;
    st[current].length = st[lastState].length + 1;
    st[current].count = 1;
    memset(st[current].transitions, 0, sizeof(st[current].transitions));

    int parent = lastState;
    while (parent && !st[parent].transitions[c]) {
        st[parent].transitions[c] = current;
        parent = st[parent].link;
    }

    if (!parent) {
        st[current].link = 1;
    } else {
        int target = st[parent].transitions[c];
        if (st[parent].length + 1 == st[target].length) {
            st[current].link = target;
        } else {
            int clone = ++nodeCount;
            st[clone] = st[target];
            st[clone].count = 0;
            st[clone].length = st[parent].length + 1;

            while (parent && st[parent].transitions[c] == target) {
                st[parent].transitions[c] = clone;
                parent = st[parent].link;
            }

            st[target].link = st[current].link = clone;
        }
    }
    lastState = current;
}

Computing Substring Occurrences

Let (\operatorname{cnt}(v) = |\operatorname{endpos}(v)|), representing the number of occurrences in the string.

For any state (u):

(\displaystyle \operatorname{cnt}(u) = \sum_{\operatorname{link}(v) = u} \operatorname{cnt}(v))

Compute by topological sorting states based on length in descending order, then propagate counts through suffix links.

Longest Common Substring

Problem: Given strings (S) and (T), find the length of their longest common substring.

  1. Build SAM for (S).
  2. Traverse (T) on the automaton, maintaining current state v and matched length l.
  3. When processing character (T_i):
    • If transition exists, move to that state and increment (l).
    • Otherwise, follow suffix links until a valid transition exists or reach (t_0). Update (l = \operatorname{len}(v)) accordingly.
    • If reaching (t_0) with no valid transition, reset (v = t_0) and (l = 0).
  4. Track maximum (l) throughout traversal.

Time complexity: (O(|T|)), since each step either increments (l) or decreases it via suffix link jumps.

Applications

Finding Top Substring by Occurrence × Length

For all substrings with occurrences > 1, maximize (\operatorname{len}(t) \times \operatorname{occur}(t)).

For each state with (\operatorname{cnt}(v) = k > 1), it contributes to lengths in interval ((\operatorname{len}(\operatorname{link}(v)), \operatorname{len}(v)]). Use difference array technique.

Counting Distinct Substrings Online

When adding a character with current state cur, the number of new distinct substrings is:

(\Delta = \operatorname{len}(cur) - \operatorname{len}(\operatorname{link}(cur)))

K-th Smallest Substring

To find the (k)-th smallest substring (with or without duplicates):

  1. Compute for each state the total number of substrings in its subtree (either counting all paths or only unique paths based on requirement).
  2. Traverse transitions in lexicographic order, subtracting counts until finding the state containing position (k).

Complexity: (O(|\Sigma| \cdot |answer|))

Generalized SAM

For multiple strings (s_1, s_2, \dots, s_n), build SAM on a Trie containing all strings. When inserting, set current state to the parent's SAM state. This allows:

  • Counting distinct substrings across all strings
  • Building minimal DFA accepting all suffixes

Substring Queries on Intervals

Precompute for each position (i) in string (s) the maximum length len_i such that the suffix of (s[1..i]of lengthlen_i` appears in pattern string (t). For query interval ([l, r]), answer is:

(\max_{i=l}^{r} \min(len_i, i - l + 1))

Use binary search on breakpoints of the piecewise function.

Tree-Based Problems

For problems on trees (where leaf count is limited), build SAM rooted at each leaf and merge results.

Substring Set Queries

Build generalized SAM for pattern strings. For each query string, its answer equals the number of pattern strings in the subtree of the final state. Use segment tree merging for online queries.

Finding Shortest Unique Common Substring

For two strings, find the shortest substring appearing exactly once in each. Build SAM for both strings and enumerate starting positions, tracking usage counts via two-way mechanism.

Multiple String Matching

For queries about (A[a:a+L-1] = B[b:b+L-1] = C[c:c+L-1]) across three strings, directly use generalized SAM and count occurrences of each state.

Complexity Summary

  • State count: (O(n)), upper bound (2n - 1)
  • Transition count: (O(n)), upper bound (3n - 4)
  • Construction: (O(n)) with constant alphabet, (O(n \log |\Sigma|)) otherwise
  • Memory: (O(n|\Sigma|))

Tags: suffix automaton string algorithms automaton Data Structures algorithm

Posted on Fri, 08 May 2026 15:39:54 +0000 by james13009