Longest Common Substring Using Hashing

The problem involves finding the longest common substring between two strings composed solely of lowercase letters. The string lengths can reach up to 250,000.

The solution uses binary search combined with hashing techniques to efficiently determine the maximum length of the shared substring.

Let's denote the two input strings as $ s_1 $ and $ s_2 $, where $ s_2 $ is assumed to be the longer one. The answer lies within the range [0, length of $ s_1 $]. Binary search is applied to this range, contributing an $ O(\log n) $ factor to the overall complexity.

For each candidate length $ k $ during the binary search, we need to verify if there exists a common substring of that length. To do so:

Extract all substrings of length $ k $ from both strings.
Compute their hash values for fast comparison.
Compare these hashes to detect any matches.

A naive approach would involve comparing every hash from the first string with every hash from the second string, resulting in $ O(n^2) $ comparisons per binary search step — leading to an overall time comlpexity of $ O(n^2 \log n) $. This is inefficient.

To improve performance, we store the hashes of all substrings from $ s_1 $ into a hash table (or unordered map). Then, we iterate over the hashes of substrings from $ s_2 $, checking whether any of them exist in our hash table. This reduces the comparison phase to expected $ O(n) $ time per check.

However, due to potential worst-case behavior of hash tables, this method may still degrade to $ O(n^2 \log n) $ in some scenarios. An alternative approach uses sorting followed by a two-pointer technique:

Collect all substring hashes from both strings.
Sort both lists of hashes.
Apply a two-pointer algorithm to find matching elements.

This method achieves a complexity of $ O(n \log^2 n) $ due to sorting steps, which remains acceptable for large inputs.

Here’s the implementation using standard sorting and two pointers:

#pragma GCC optimize("3,Ofast,unroll-loops")
#include <iostream>
#include <cstring>
#include <algorithm>
#include <vector>
#define ull unsigned long long
using namespace std;

const int N = 250005;

int n1, n2;
char s1[N], s2[N];

namespace Hasher {
    const ull base = 1013;
    ull power[N];
    ull prefix1[N], prefix2[N];

    inline ull multiply(ull a, ull b) {
        return a * b;
    }

    inline void preprocess() {
        if (strlen(s1 + 1) > strlen(s2 + 1))
            swap(s1, s2);
        n1 = strlen(s1 + 1), n2 = strlen(s2 + 1);

        power[0] = 1;
        for (int i = 1; i <= n2; ++i)
            power[i] = multiply(power[i - 1], base);

        for (int i = 1; i <= n1; ++i)
            prefix1[i] = multiply(prefix1[i - 1], base) + s1[i];

        for (int i = 1; i <= n2; ++i)
            prefix2[i] = multiply(prefix2[i - 1], base) + s2[i];
    }

    inline ull getHash1(int l, int r) {
        return prefix1[r] - multiply(prefix1[l - 1], power[r - l + 1]);
    }

    inline ull getHash2(int l, int r) {
        return prefix2[r] - multiply(prefix2[l - 1], power[r - l + 1]);
    }
} using namespace Hasher;

inline bool isValid(int len) {
    vector<ull> hashes1, hashes2;
    for (int i = 1; i + len - 1 <= n1; ++i)
        hashes1.push_back(getHash1(i, i + len - 1));

    for (int i = 1; i + len - 1 <= n2; ++i)
        hashes2.push_back(getHash2(i, i + len - 1));

    sort(hashes1.begin(), hashes1.end());
    sort(hashes2.begin(), hashes2.end());

    int idx1 = 0, idx2 = 0;
    while (idx1 < hashes1.size() && idx2 < hashes2.size()) {
        while (hashes1[idx1] < hashes2[idx2] && idx1 < hashes1.size())
            ++idx1;
        if (idx1 == hashes1.size()) return false;
        while (hashes1[idx1] > hashes2[idx2] && idx2 < hashes2.size())
            ++idx2;
        if (idx2 == hashes2.size()) return false;
        if (hashes1[idx1] == hashes2[idx2])
            return true;
    }
    return false;
}

signed main() {
    ios::sync_with_stdio(false);
    cin.tie(0);
    cin >> (s1 + 1) >> (s2 + 1);
    preprocess();
    int left = 0, right = n1;
    while (left < right) {
        int mid = (left + right + 1) >> 1;
        if (isValid(mid))
            left = mid;
        else
            right = mid - 1;
    }
    cout << right << "\n";
    return 0;
}

An optimized version replaces standard sorting with radix sort to further reduce complexity to $ O(n \log n) $:

#pragma GCC optimize("3,Ofast,unroll-loops")
#include 
#include 
#include 
#include 
#define ull unsigned long long
using namespace std;

const int N = 250005;
const int BITS = 8;
const int RADIX = 1  strlen(s2 + 1))
            swap(s1, s2);
        n1 = strlen(s1 + 1), n2 = strlen(s2 + 1);

        power[0] = 1;
        for (int i = 1; i = 0; --i) {
            ull x = arr[i];
            temp[--count[(x >> shift) & MASK]] = x;
        }
        swap(arr, temp);
    }
}

inline bool isValid(int len) {
    vector hashes1, hashes2;
    for (int i = 1; i + len - 1 > (s1 + 1) >> (s2 + 1);
    preprocess();
    int left = 0, right = n1;
    while (left < right) {
        int mid = (left + right + 1) >> 1;
        if (isValid(mid))
            left = mid;
        else
            right = mid - 1;
    }
    cout

Tags: string-algorithm Hashing substring algorithm binary-search

Posted on Sat, 06 Jun 2026 17:48:41 +0000 by HuggyBear

Freaks City

Longest Common Substring Using Hashing

Hot Tags