Java-Based High-Performance Sensitive Word Detection Framework with Advanced Features

Introduction to sensitive-word: A Robust Text Filtering Solution for Java Applications

The sensitive-word project is an open-source, high-performance Java library designed for efficient detection and handling of sensitive content in text. Built on the Deterministic Finite Automaton (DFA) algorithm, it delivers fast matching performance while supporting advanced features such as multi-level word tagging, character normalization, fuzzy matching, and extensible rule configurations.

GitHub Repository: https://github.com/houbb/sensitive-word

Core Features

  • High Efficiency: Uttilizes DFA-based scanning for O(n) time complexity
  • Flexible Tagging System: Supports categorized and hierarchical classification of sensitive terms
  • Character Normalization: Handles full-width/half-width characters, simplified/traditional Chinese conversion
  • Fuzzy Matching: Detects variations including pinyin transcriptions, visually similar characters, and stylized text
  • Extensible Architecture: Allows custom dictionaries from databases or APIs
  • Built-in Validation Rules: Includes support for detecting emails, URLs, IP addresses, and numeric patterns

Quick Start Guide

Add the Maven dependency:

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>sensitive-word</artifactId>
    <version>0.29.2</version>
</dependency>

Using the utility class SensitiveWordHelper for basic operations:

import com.github.houbb.sensitive.word.core.SensitiveWordHelper;

public class DetectionExample {
    public static void main(String[] args) {
        String input = "This text contains gambling and drug-related prohibited content";

        // Check presence
        boolean hasProhibited = SensitiveWordHelper.contains(input);
        System.out.println("Contains restricted terms: " + hasProhibited);

        // Extract all matches
        List<String> foundTerms = SensitiveWordHelper.findAll(input);
        System.out.println("Detected terms: " + foundTerms);

        // Replace with default mask
        String cleaned = SensitiveWordHelper.replace(input);
        System.out.println("Masked output: " + cleaned);

        // Custom replacement character
        String masked = SensitiveWordHelper.replace(input, '*');
        System.out.println("Custom masked: " + masked);
    }
}

Advanced Text Processing Capabilities

The framework normalizes various textual representations before analysis to maximize detection accuracy.

Case Insensitive Matching

String text = "fUcK offensive language here";
String detected = SensitiveWordHelper.findFirst(text); // Returns "fUcK"

Full/Half Width Character Handling

String wideText = "fuck this message";
List<String> results = SensitiveWordHelper.findAll(wideText); // Detects "fuck"

Numerical Pattern Recognition

String mixedNum = "My WeChat: 9⓿二肆⁹₈③⑸⒋➃㈤㊄";
List<String> numbers = SensitiveWordBs.newInstance()
    .enableNumCheck(true)
    .init()
    .findAll(mixedNum); // Identifies complex number format

Traditional/Simplified Chinese Support

String zhText = "我爱五星紅旗";
List<String> flags = SensitiveWordHelper.findAll(zhText); // Matches "五星紅旗"

Stylized Letter Detection

String fancyText = "Ⓕⓤc⒦ bad words";
List<String> styled = SensitiveWordHelper.findAll(fancyText); // Captures "Ⓕⓤc⒦"

Duplicate Character Tolerance

String repeated = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ repeated abuse";
List<String> duplicates = SensitiveWordBs.newInstance()
    .ignoreRepeat(true)
    .init()
    .findAll(repeated);

Built-in Detection Strategies

Supports identification of structured data types beyond plain keywords.

public class StructuredDataDetection {
    public static void main(String[] args) {
        // Email detection
        String emailText = "Contact me at admin@sensitive.com";
        List<String> emails = SensitiveWordBs.newInstance()
            .enableEmailCheck(true)
            .init()
            .findAll(emailText);

        // Numeric sequence detection
        String numText = "Secret code: 12345678";
        List<String> digits = SensitiveWordBs.newInstance()
            .enableNumCheck(true)
            .init()
            .findAll(numText);

        // URL recognition
        String linkText = "Visit https://example.com or example.org";
        List<String> urls = SensitiveWordBs.newInstance()
            .enableUrlCheck(true)
            .wordCheckUrl(WordChecks.urlNoPrefix())
            .init()
            .findAll(linkText);

        // IPv4 address detection
        String ipText = "Server available at 192.168.1.1";
        List<String> ips = SensitiveWordBs.newInstance()
            .enableIpv4Check(true)
            .init()
            .findAll(ipText);
    }
}

Fluent API Configuration

Configure the scanner using a builder pattern for clean, readable setup:

SensitiveWordBs scanner = SensitiveWordBs.newInstance()
    .ignoreCase(true)
    .ignoreWidth(true)
    .ignoreNumStyle(true)
    .ignoreChineseStyle(true)
    .ignoreEnglishStyle(true)
    .enableNumCheck(false)
    .enableEmailCheck(true)
    .enableUrlCheck(true)
    .enableIpv4Check(true)
    .numCheckLen(8)
    .charIgnore(SensitiveWordCharIgnores.defaults())
    .init();

String content = "Political symbols displayed publicly";
System.out.println(scanner.contains(content));
System.out.println(scanner.findFirst(content));

Dynamic Dictionary Management

The system supports runtime modification of the vocabulary set.

Add/Remove Terms Dynamically

SensitiveWordBs dynamicScanner = SensitiveWordBs.newInstance()
    .wordAllow(WordAllows.empty())
    .wordDeny(WordDenys.empty())
    .init();

// Initially empty
Assert.assertTrue(dynamicScanner.findAll("test add remove").isEmpty());

// Add individual words
dynamicScanner.addWord("dynamic");
dynamicScanner.addWord("runtime");

// Remove terms
dynamicScanner.removeWord("runtime");

// Bulk operations
dynamicScanner.addWord(Arrays.asList("batch", "update"));
dynamicScanner.removeWord("batch", "update");

Custom Dictionary Sources

Implement IWordDeny interface to load from external sources like databases:

public class DatabaseBackedFilter {
    public static void main(String[] args) {
        SensitiveWordBs dbScanner = SensitiveWordBs.newInstance()
            .wordDeny(new IWordDeny() {
                @Override
                public List<String> deny() {
                    // Load from database query
                    return fetchFromDatabase();
                }
            })
            .init();
    }

    private static List<String> fetchFromDatabase() {
        // Simulated DB access
        return Arrays.asList("blockedTerm", "restrictedPhrase");
    }
}

Spring Boot Integration

Seamless integration with Spring applications via configuration beans:

@Configuration
public class SensitiveWordConfig {

    @Autowired
    private CustomWhitelistProvider whitelist;

    @Autowired
    private CustomBlacklistProvider blacklist;

    @Bean
    public SensitiveWordBs sensitiveWordEngine() {
        return SensitiveWordBs.newInstance()
            .wordAllow(WordAllows.chains(WordAllows.defaults(), whitelist))
            .wordDeny(blacklist)
            .ignoreCase(true)
            .init();
    }
}

Administrative Console (sensitive-word-admin)

A companion web application for managing sensitive word policies:

  • Frontend: Vue.js with Elemant UI componennts
  • Backend: Spring Boot with JWT authentication
  • Security: Role-based access control using Spring Security
  • Data Store: Redis for caching, relational DB for persistence
  • Development Tools: Code generator for rapid CRUD scaffolding

This console enables non-developers to manage word lists, configure rules, and monitor detection statistics through a user-friendly interface.

Tags: DFA algorithm text filtering Java library sensitive word detection Spring Boot integration

Posted on Sun, 28 Jun 2026 17:39:45 +0000 by cajun225