Decoding CJK String Length and Display Width in Java

Java stores strings internally using UTF-16 encoding, where String.length() returns the number of 16-bit code units. While most common CJK characters occupy a single code unit, their visual representaiton in monospaced terminals or grid interfaces typically spans two horizontal cells. Consequently, one Chinese character functions as the equivalent of two ASCII half-width spaces in layout calculations. Relying strictly on byte offsets or naive substring slicing often corrupts multi-byte sequences. Proper text processing requires decoupling logical character indices from rendering metrics.

public class CharSpaceMapper {
    public static void main(String[] argv) {
        String data = "Algorithm算法";
        
        System.out.println("Logical count: " + data.codePointCount(0, data.length()));
        System.out.println("Terminal width: " + estimateTerminalWidth(data));
        
        splitByCellWidth(data, 3);
    }

    private static long estimateTerminalWidth(String payload) {
        long cells = 0;
        int pos = 0;
        while (pos < payload.length()) {
            int cp = payload.codePointAt(pos);
            cells += isWideChar(cp) ? 2 : 1;
            pos += Character.charCount(cp);
        }
        return cells;
    }

    private static boolean isWideChar(int codepoint) {
        return (codepoint >= 0x4E00 && codepoint <= 0x9FA5) || 
               (codepoint >= 0x3000 && codepoint <= 0x30FF);
    }

    private static void splitByCellWidth(String input, int maxWidth) {
        StringBuilder buffer = new StringBuilder();
        long currentCells = 0;
        int pos = 0;
        int len = input.length();

        while (pos < len) {
            int cp = input.codePointAt(pos);
            long charCells = isWideChar(cp) ? 2 : 1;
            
            if (currentCells + charCells > maxWidth) {
                System.out.println("[" + buffer.toString() + "]");
                buffer.setLength(0);
                currentCells = 0;
            }
            
            buffer.appendCodePoint(cp);
            currentCells += charCells;
            pos += Character.charCount(cp);
        }
        if (buffer.length() > 0) {
            System.out.println("[" + buffer.toString() + "]");
        }
    }
}

The implemantation replaces direct array indexing with codePointAt and Character.charCount, preventing surrogate pair corruption. The isWideChar predicate identifies full-width ideographs and punctuation, assigning them a weight of 2 during layout evaluation. Chunking operations accumulate character cells dynamically, flushing the buffer only when exceeding the specified boundary, which guarantees safe truncation across language boundaries. Adjusting the hexadecimal range constants allows adaptation to other scripts requiring non-standard cell mappings.

Tags: java String Manipulation unicode CJK Character Encoding

Posted on Fri, 05 Jun 2026 18:01:01 +0000 by asolell

Freaks City

Decoding CJK String Length and Display Width in Java

Hot Tags