Handling and Comparing Special Characters in PHP

When processing user input or data from external sources, special characters can cause unexpected mismatches. This article presents a practical approach to normalize and compare strings that contain such characters.

Converting Special Characters to Hexadecimal

The bin2hex() function can be used to inspect the raw bytes of a string:

$string = '1—1 12 (223), 【33】';
echo bin2hex($string);

From this output, we can identify the byte sequences:

  • (em dash) → e28094 (hex), which corresponds to bytes 226, 128, 148 in decimal.
  •   (full-width space) → e38080 (hex), bytes 227, 128, 128.
  • Regular ASCII characters map to a single byte (e.g., 1 is 0x31).
  • CRLF (\r\n) is 0d0a in hex.

Normaliing Strings for Comparision

The goal is to transform a string so that equivalent characters (e.g., different dash styles, full-width vs. half-width punctuation, whitespace variants) become ideentical. We can achieve this by replacing known byte sequences with a canonical form.

Example Requirements

  1. Remove all types of spaces (normal, full-width, non-breaking).
  2. Replace em dash with hyphen -.
  3. Replace full-width parentheses () with half-width ().
  4. Replace full-width slash with half-width /.
  5. Replace full-width square brackets 【】 with half-width [].
  6. Convert all ASCII letters to uppercase.
  7. Remove carriage returns and newlines.

Implementation

<?php

function normalizeString(string $input): string {
    if (!is_string($input)) {
        return $input;
    }

    $output = $input;

    // Remove various spaces
    $output = str_replace(chr(194) . chr(160), '', $output); // non-breaking space
    $output = str_replace(chr(227) . chr(128) . chr(128), '', $output); // full-width space (U+3000)
    $output = str_replace(chr(32), '', $output); // normal space

    // Remove tab
    $output = str_replace(chr(9), '', $output);

    // Replace em dash with hyphen
    $output = str_replace(chr(226) . chr(128) . chr(148), '-', $output);

    // Replace full-width parentheses
    $output = str_replace(chr(239) . chr(188) . chr(136), '(', $output); // full-width (
    $output = str_replace(chr(239) . chr(188) . chr(137), ')', $output); // full-width )

    // Replace full-width slash
    $output = str_replace(chr(239) . chr(188) . chr(143), '/', $output);

    // Replace full-width square brackets
    $output = str_replace(chr(227) . chr(128) . chr(144), '[', $output); // full-width [
    $output = str_replace(chr(227) . chr(128) . chr(145), ']', $output); // full-width ]

    // Remove carriage return + newline
    $output = str_replace(chr(13) . chr(10), '', $output);

    // Convert letters to uppercase
    $output = strtoupper($output);

    return $output;
}

function compareNormalized(string $a, string $b): bool {
    return normalizeString($a) === normalizeString($b);
}

// Example usage
$keyword = '1—1 12 (223), 【33】';
echo "Original: $keyword\n";
$normalized = normalizeString($keyword);
echo "Normalized: $normalized\n";

$candidate = '1-1 12 (223), [33]';
if (compareNormalized($keyword, $candidate)) {
    echo "Strings match after normalization.\n";
} else {
    echo "Strings do not match.\n";
}

?>

Explanation of the Code

  • Each replacement uses the exact byte sequence obtained from ord() or bin2hex().
  • chr() converts a decimal byte value to its character representation.
  • Normalization removes insignificant whitespace, standardizes punctuation, and enforces case uniformity.
  • The compareNormalized() function allows direct equality check after normalization.

This method is efficient for known sets of characters. For more complex scenarios, using mb_* functions with character encoding awareness (e.g., UTF-8) might be necesary.

Tags: PHP string normalization special characters character comparison ASCII

Posted on Tue, 19 May 2026 08:29:26 +0000 by marty_arl