Introduction to sensitive-word: A Robust Text Filtering Solution for Java Applications
The sensitive-word project is an open-source, high-performance Java library designed for efficient detection and handling of sensitive content in text. Built on the Deterministic Finite Automaton (DFA) algorithm, it delivers fast matching performance while supporting advanced features such as multi-level word tagging, character normalization, fuzzy matching, and extensible rule configurations.
GitHub Repository: https://github.com/houbb/sensitive-word
Core Features
- High Efficiency: Uttilizes DFA-based scanning for O(n) time complexity
- Flexible Tagging System: Supports categorized and hierarchical classification of sensitive terms
- Character Normalization: Handles full-width/half-width characters, simplified/traditional Chinese conversion
- Fuzzy Matching: Detects variations including pinyin transcriptions, visually similar characters, and stylized text
- Extensible Architecture: Allows custom dictionaries from databases or APIs
- Built-in Validation Rules: Includes support for detecting emails, URLs, IP addresses, and numeric patterns
Quick Start Guide
Add the Maven dependency:
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>sensitive-word</artifactId>
<version>0.29.2</version>
</dependency>
Using the utility class SensitiveWordHelper for basic operations:
import com.github.houbb.sensitive.word.core.SensitiveWordHelper;
public class DetectionExample {
public static void main(String[] args) {
String input = "This text contains gambling and drug-related prohibited content";
// Check presence
boolean hasProhibited = SensitiveWordHelper.contains(input);
System.out.println("Contains restricted terms: " + hasProhibited);
// Extract all matches
List<String> foundTerms = SensitiveWordHelper.findAll(input);
System.out.println("Detected terms: " + foundTerms);
// Replace with default mask
String cleaned = SensitiveWordHelper.replace(input);
System.out.println("Masked output: " + cleaned);
// Custom replacement character
String masked = SensitiveWordHelper.replace(input, '*');
System.out.println("Custom masked: " + masked);
}
}
Advanced Text Processing Capabilities
The framework normalizes various textual representations before analysis to maximize detection accuracy.
Case Insensitive Matching
String text = "fUcK offensive language here";
String detected = SensitiveWordHelper.findFirst(text); // Returns "fUcK"
Full/Half Width Character Handling
String wideText = "fuck this message";
List<String> results = SensitiveWordHelper.findAll(wideText); // Detects "fuck"
Numerical Pattern Recognition
String mixedNum = "My WeChat: 9⓿二肆⁹₈③⑸⒋➃㈤㊄";
List<String> numbers = SensitiveWordBs.newInstance()
.enableNumCheck(true)
.init()
.findAll(mixedNum); // Identifies complex number format
Traditional/Simplified Chinese Support
String zhText = "我爱五星紅旗";
List<String> flags = SensitiveWordHelper.findAll(zhText); // Matches "五星紅旗"
Stylized Letter Detection
String fancyText = "Ⓕⓤc⒦ bad words";
List<String> styled = SensitiveWordHelper.findAll(fancyText); // Captures "Ⓕⓤc⒦"
Duplicate Character Tolerance
String repeated = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ repeated abuse";
List<String> duplicates = SensitiveWordBs.newInstance()
.ignoreRepeat(true)
.init()
.findAll(repeated);
Built-in Detection Strategies
Supports identification of structured data types beyond plain keywords.
public class StructuredDataDetection {
public static void main(String[] args) {
// Email detection
String emailText = "Contact me at admin@sensitive.com";
List<String> emails = SensitiveWordBs.newInstance()
.enableEmailCheck(true)
.init()
.findAll(emailText);
// Numeric sequence detection
String numText = "Secret code: 12345678";
List<String> digits = SensitiveWordBs.newInstance()
.enableNumCheck(true)
.init()
.findAll(numText);
// URL recognition
String linkText = "Visit https://example.com or example.org";
List<String> urls = SensitiveWordBs.newInstance()
.enableUrlCheck(true)
.wordCheckUrl(WordChecks.urlNoPrefix())
.init()
.findAll(linkText);
// IPv4 address detection
String ipText = "Server available at 192.168.1.1";
List<String> ips = SensitiveWordBs.newInstance()
.enableIpv4Check(true)
.init()
.findAll(ipText);
}
}
Fluent API Configuration
Configure the scanner using a builder pattern for clean, readable setup:
SensitiveWordBs scanner = SensitiveWordBs.newInstance()
.ignoreCase(true)
.ignoreWidth(true)
.ignoreNumStyle(true)
.ignoreChineseStyle(true)
.ignoreEnglishStyle(true)
.enableNumCheck(false)
.enableEmailCheck(true)
.enableUrlCheck(true)
.enableIpv4Check(true)
.numCheckLen(8)
.charIgnore(SensitiveWordCharIgnores.defaults())
.init();
String content = "Political symbols displayed publicly";
System.out.println(scanner.contains(content));
System.out.println(scanner.findFirst(content));
Dynamic Dictionary Management
The system supports runtime modification of the vocabulary set.
Add/Remove Terms Dynamically
SensitiveWordBs dynamicScanner = SensitiveWordBs.newInstance()
.wordAllow(WordAllows.empty())
.wordDeny(WordDenys.empty())
.init();
// Initially empty
Assert.assertTrue(dynamicScanner.findAll("test add remove").isEmpty());
// Add individual words
dynamicScanner.addWord("dynamic");
dynamicScanner.addWord("runtime");
// Remove terms
dynamicScanner.removeWord("runtime");
// Bulk operations
dynamicScanner.addWord(Arrays.asList("batch", "update"));
dynamicScanner.removeWord("batch", "update");
Custom Dictionary Sources
Implement IWordDeny interface to load from external sources like databases:
public class DatabaseBackedFilter {
public static void main(String[] args) {
SensitiveWordBs dbScanner = SensitiveWordBs.newInstance()
.wordDeny(new IWordDeny() {
@Override
public List<String> deny() {
// Load from database query
return fetchFromDatabase();
}
})
.init();
}
private static List<String> fetchFromDatabase() {
// Simulated DB access
return Arrays.asList("blockedTerm", "restrictedPhrase");
}
}
Spring Boot Integration
Seamless integration with Spring applications via configuration beans:
@Configuration
public class SensitiveWordConfig {
@Autowired
private CustomWhitelistProvider whitelist;
@Autowired
private CustomBlacklistProvider blacklist;
@Bean
public SensitiveWordBs sensitiveWordEngine() {
return SensitiveWordBs.newInstance()
.wordAllow(WordAllows.chains(WordAllows.defaults(), whitelist))
.wordDeny(blacklist)
.ignoreCase(true)
.init();
}
}
Administrative Console (sensitive-word-admin)
A companion web application for managing sensitive word policies:
- Frontend: Vue.js with Elemant UI componennts
- Backend: Spring Boot with JWT authentication
- Security: Role-based access control using Spring Security
- Data Store: Redis for caching, relational DB for persistence
- Development Tools: Code generator for rapid CRUD scaffolding
This console enables non-developers to manage word lists, configure rules, and monitor detection statistics through a user-friendly interface.