When processing user input or data from external sources, special characters can cause unexpected mismatches. This article presents a practical approach to normalize and compare strings that contain such characters.
Converting Special Characters to Hexadecimal
The bin2hex() function can be used to inspect the raw bytes of a string:
$string = '1—1 12 (223), 【33】';
echo bin2hex($string);
From this output, we can identify the byte sequences:
—(em dash) →e28094(hex), which corresponds to bytes226,128,148in decimal.(full-width space) →e38080(hex), bytes227,128,128.- Regular ASCII characters map to a single byte (e.g.,
1is0x31). - CRLF (
\r\n) is0d0ain hex.
Normaliing Strings for Comparision
The goal is to transform a string so that equivalent characters (e.g., different dash styles, full-width vs. half-width punctuation, whitespace variants) become ideentical. We can achieve this by replacing known byte sequences with a canonical form.
Example Requirements
- Remove all types of spaces (normal, full-width, non-breaking).
- Replace em dash
—with hyphen-. - Replace full-width parentheses
()with half-width(). - Replace full-width slash
/with half-width/. - Replace full-width square brackets
【】with half-width[]. - Convert all ASCII letters to uppercase.
- Remove carriage returns and newlines.
Implementation
<?php
function normalizeString(string $input): string {
if (!is_string($input)) {
return $input;
}
$output = $input;
// Remove various spaces
$output = str_replace(chr(194) . chr(160), '', $output); // non-breaking space
$output = str_replace(chr(227) . chr(128) . chr(128), '', $output); // full-width space (U+3000)
$output = str_replace(chr(32), '', $output); // normal space
// Remove tab
$output = str_replace(chr(9), '', $output);
// Replace em dash with hyphen
$output = str_replace(chr(226) . chr(128) . chr(148), '-', $output);
// Replace full-width parentheses
$output = str_replace(chr(239) . chr(188) . chr(136), '(', $output); // full-width (
$output = str_replace(chr(239) . chr(188) . chr(137), ')', $output); // full-width )
// Replace full-width slash
$output = str_replace(chr(239) . chr(188) . chr(143), '/', $output);
// Replace full-width square brackets
$output = str_replace(chr(227) . chr(128) . chr(144), '[', $output); // full-width [
$output = str_replace(chr(227) . chr(128) . chr(145), ']', $output); // full-width ]
// Remove carriage return + newline
$output = str_replace(chr(13) . chr(10), '', $output);
// Convert letters to uppercase
$output = strtoupper($output);
return $output;
}
function compareNormalized(string $a, string $b): bool {
return normalizeString($a) === normalizeString($b);
}
// Example usage
$keyword = '1—1 12 (223), 【33】';
echo "Original: $keyword\n";
$normalized = normalizeString($keyword);
echo "Normalized: $normalized\n";
$candidate = '1-1 12 (223), [33]';
if (compareNormalized($keyword, $candidate)) {
echo "Strings match after normalization.\n";
} else {
echo "Strings do not match.\n";
}
?>
Explanation of the Code
- Each replacement uses the exact byte sequence obtained from
ord()orbin2hex(). chr()converts a decimal byte value to its character representation.- Normalization removes insignificant whitespace, standardizes punctuation, and enforces case uniformity.
- The
compareNormalized()function allows direct equality check after normalization.
This method is efficient for known sets of characters. For more complex scenarios, using mb_* functions with character encoding awareness (e.g., UTF-8) might be necesary.