Rounding and Saturation Truncation for Fixed-Point Data in Verilog

In digital signal processing implementations, data quantization and bit truncation are common operations. This article covers two critical techniques: rounding to preserve precision during truncation, and saturation handling when results exceed representable ranges.

Data Format Notation

Fixed-point data formats follow the mQn convention where m represents total bit width (including sign bit) and n represents fractional bit width. For example, a 16Q13 format indicates a 16-bit signed number with 13 fractional bits.

Key characteristics of mQn format:

Signed representation with MSB as sign bit
Total width equals m bits
Frractional width equals n bits

Signed Number Fundamentals

Signed vs Unsigned Representation

The fundamental difference lies in the MSB weight. Consider a 4-bit value 4'b1011:

Unsigned interpretation: 1×2³ + 0×2² + 1×2¹ + 1×2⁰ = 11
Signed interpretation: 1×(-2³) + 0×2² + 1×2¹ + 1×2⁰ = -5

Data ranges differ accordingly:

4-bit unsigned: 0 to 15
4-bit signed: -8 to 7

General ranges:

Unsigned (m bits): 0 to 2^m - 1
Signed (m bits): -2^(m-1) to 2^(m-1) - 1

Sign Extension for Signed Integers

Extending a signed integer requires duplicating the sign bit:

4'b0101 (positive, +5) extended to 6 bits: 6'b000101
4'b1011 (negative, -5) extended to 6 bits: 6'b111011

Verification for negative case:

4'b1011 = 1×(-2³) + 0×2² + 1×2¹ + 1×2⁰ = -5
6'b111011 = 1×(-2⁵) + 1×2⁴ + 1×2³ + 0×2² + 1×2¹ + 1×2⁰ = -5

Signed Fractional Numbers

For 4Q2 format with value 4'b1011:

4'b10.11 = 1×(-2¹) + 0×2⁰ + 1×2⁻¹ + 1×2⁻² = -2 + 0 + 0.5 + 0.25 = -1.25

mQn format range: -2^(m-n-1) to 2^(m-n-1) - 1/2^n

Fractional extension: integer part uses sign extension, fractional part appends zeros.

Arithmetic Operatinos

Addition with Sign Extension

When adding two signed numbers, align decimal points first, then extend both operands by one sign bit before addition.

Example: Add 5Q2 value 5'b100.01 with 4Q3 value 4'b1.011

Extend 5'b100.01 to 6Q3: 6'b100.010
Sign-extend 4'b1.011 to 6Q3: 6'b111.011
Extend both to 7Q3, then add

Multiplication Result Width

For two signed operands mQn and aQb, the product requires (m+a) total bits and (n+b) fractional bits.

Example: Two 4Q2 values multiplied yield an 8Q4 result.

Rounding Implementation

When truncating fractional bits, rounding improves accuracy over simple truncation.

Rounding Logic

For a positive number, round up if truncated bits' MSB is 1. For negative numbers, the rule inverts due to two's complement representation.

Verilog rounding implementation:

module rounding_unit #(
    parameter WIDTH_IN  = 9,
    parameter WIDTH_OUT = 6,
    parameter FRAC_IN   = 6,
    parameter FRAC_OUT  = 3
) (
    input  signed [WIDTH_IN-1:0]  i_data,
    output signed [WIDTH_OUT-1:0] o_data
);

    localparam TRUNC_BITS = FRAC_IN - FRAC_OUT;
    
    // Determine carry bit for rounding
    // For positive: carry = truncated_msb
    // For negative: carry = truncated_msb & (|truncated_rest)
    wire carry;
    assign carry = i_data[WIDTH_IN-1] 
                   ? (i_data[TRUNC_BITS-1] & (|i_data[TRUNC_BITS-2:0]))
                   : i_data[TRUNC_BITS-1];
    
    // Extend sign bit and add carry
    wire signed [WIDTH_IN:0] sum;
    assign sum = {i_data[WIDTH_IN-1], i_data} + carry;
    
    assign o_data = sum[WIDTH_IN:WIDTH_IN-FRAC_OUT];
endmodule

Rounding for Negative Numbers

In two's complement, "round down" for negatives requires adding 1 to maintain accuracy. The carry logic accounts for this automatically.

Saturation Truncation

Saturation clamps values that exceed the target range to the minimum or maximum representable value.

Example: Convert 6Q3 value 6'b011.111 (3.875) to 4Q2 format (max: 1.75)

Since 3.875 > 1.75, saturation clips to 4'b01.11 (1.75).

Example: Convert 6Q3 value 6'b100.111 (-3.125) to 4Q2 format (min: -2.0)

Since -3.125 < -2.0, saturation clips to 4'b10.00 (-2.0).

Complete Design Example: S = A + B × C

Requirements

Input A: 16Q14 format
Input B: 16Q14 format
Input C: 16Q15 format
Output S: 16Q14 format
Apply rounding during truncation
Apply saturation when overflow occurs

Bit-Width Analysis

Multiplication: B(16Q14) × C(16Q15) = 32Q29
Addition: Extend A to 32Q29, then both operands to 33Q29 for safe addition
Final truncation: 33Q29 → 16Q14 with rounding and saturation

Verilog Implementation

module saturated_multiply_add #(
    parameter A_WIDTH = 16,
    parameter B_WIDTH = 16,
    parameter C_WIDTH = 16,
    parameter A_FRAC  = 14,
    parameter B_FRAC  = 14,
    parameter C_FRAC  = 15,
    parameter S_FRAC  = 14
) (
    input                           clk,
    input                           rst_n,
    input  signed [A_WIDTH-1:0]      i_a,
    input  signed [B_WIDTH-1:0]      i_b,
    input  signed [C_WIDTH-1:0]      i_c,
    output signed [A_WIDTH-1:0]      o_s
);

    // Registered inputs
    reg signed [A_WIDTH-1:0] r_a;
    reg signed [B_WIDTH-1:0] r_b;
    reg signed [C_WIDTH-1:0] r_c;
    
    // Multiplication result: (16+16)Q(14+15) = 32Q29
    wire signed [31:0] mult_result;
    
    // Extended A for alignment: 32Q29
    wire signed [31:0] a_extended;
    
    // Addition with sign extension: 33Q29
    wire signed [32:0] sum_result;
    
    // Rounding intermediate: 19Q14
    wire               carry_bit;
    wire signed [18:0] rounded_value;
    
    // Pipeline registers
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            r_a <= 'd0;
            r_b <= 'd0;
            r_c <= 'd0;
        end else begin
            r_a <= i_a;
            r_b <= i_b;
            r_c <= i_c;
        end
    end
    
    // Step 1: Multiplication
    assign mult_result = r_b * r_c;
    
    // Step 2: Extend A to match fractional precision
    assign a_extended = {r_a[A_WIDTH-1], r_a, {C_FRAC{1'b0}}};
    
    // Step 3: Add with sign extension
    assign sum_result = {mult_result[31], mult_result} 
                      + {a_extended[31], a_extended};
    
    // Step 4: Calculate rounding carry bit
    // Truncate 15 fractional bits (positions 14:0)
    assign carry_bit = sum_result[32]
                       ? (sum_result[S_FRAC] & (|sum_result[S_FRAC-1:0]))
                       : sum_result[S_FRAC];
    
    // Step 5: Apply rounding with sign extension
    assign rounded_value = {sum_result[32], sum_result[32:S_FRAC+1]} + carry_bit;
    
    // Step 6: Saturation check and final output
    // Check if truncation would lose non-sign bits
    wire [3:0] sign_check;
    assign sign_check = rounded_value[18:15];
    
    wire overflow;
    assign overflow = (sign_check != 4'b0000) && (sign_check != 4'b1111);
    
    assign o_s = overflow
                 ? (rounded_value[18] ? 16'b1000000000000000 : 16'b0111111111111111)
                 : rounded_value[15:0];

endmodule

Testbench with Matlab Verification

Testbench Structure

`timescale 1ns / 1ps

module tb_saturated_ma;
    reg        clk;
    reg        rst_n;
    reg [15:0] a_data;
    reg [15:0] b_data;
    reg [15:0] c_data;
    wire [15:0] s_out;
    
    parameter DATA_COUNT = 4096;
    
    // Memory arrays for test vectors
    reg [15:0] mem_a [0:DATA_COUNT-1];
    reg [15:0] mem_b [0:DATA_COUNT-1];
    reg [15:0] mem_c [0:DATA_COUNT-1];
    
    reg [13:0] addr;
    reg        data_valid;
    
    saturated_multiply_add u_dut (.*);
    
    initial begin
        clk = 0;
        rst_n = 0;
        #67 rst_n = 1;
    end
    
    always #5 clk = ~clk;
    
    initial begin
        $readmemh("stimuli/a_16Q14.txt", mem_a);
        $readmemh("stimuli/b_16Q14.txt", mem_b);
        $readmemh("stimuli/c_16Q15.txt", mem_c);
    end
    
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            a_data <= 'd0;
            b_data <= 'd0;
            c_data <= 'd0;
            addr <= 'd0;
            data_valid <= 1'b0;
        end else if (addr == DATA_COUNT) begin
            addr <= addr;
            data_valid <= 1'b0;
        end else begin
            a_data <= mem_a[addr];
            b_data <= mem_b[addr];
            c_data <= mem_c[addr];
            addr <= addr + 1'b1;
            data_valid <= 1'b1;
        end
    end
    
    integer fid;
    initial begin
        fid = $fopen("results/s_vivado.txt", "w");
        if (!fid) begin
            $display("File open failed");
            $finish;
        end
    end
    
    reg data_valid_d1;
    always @(posedge clk) data_valid_d1 <= data_valid;
    
    always @(posedge clk) begin
        if (data_valid_d1)
            $fdisplay(fid, "%d", $signed(s_out));
        else if (addr == DATA_COUNT) begin
            $fclose(fid);
            $finish;
        end
    end
endmodule

Matlab Test Vector Generation

clear; clc;

num_samples = 4096;

% Define input ranges based on fixed-point formats
a_range = [-2, 2 - 2^(-14)];   % 16Q14
b_range = [-2, 2 - 2^(-14)];   % 16Q14
c_range = [-1, 1 - 2^(-15)];   % 16Q15

% Generate random test data
a_rand = a_range(1) + diff(a_range) * rand(1, num_samples - 8);
b_rand = b_range(1) + diff(b_range) * rand(1, num_samples - 8);
c_rand = c_range(1) + diff(c_range) * rand(1, num_samples - 8);

% Boundary cases: all combinations of min/max values
a_bdry = [a_range(1) a_range(1) a_range(1) a_range(1) ...
          a_range(2) a_range(2) a_range(2) a_range(2)];
b_bdry = [b_range(1) b_range(1) b_range(2) b_range(2) ...
          b_range(1) b_range(1) b_range(2) b_range(2)];
c_bdry = [c_range(1) c_range(2) c_range(1) c_range(2) ...
          c_range(1) c_range(2) c_range(1) c_range(2)];

% Combine boundary and random cases
a = [a_bdry, a_rand];
b = [b_bdry, b_rand];
c = [c_bdry, c_rand];

% Quantization patterns: fixed-point, round to nearest, saturate
q_16Q14 = quantizer('fixed', 'round', 'saturate', [16, 14]);
q_16Q15 = quantizer('fixed', 'round', 'saturate', [16, 15]);

% Apply quantization
a_q = quantize(q_16Q14, a);
b_q = quantize(q_16Q14, b);
c_q = quantize(q_16Q15, c);

% Compute reference result
s_ref = a_q + b_q .* c_q;
s_q = quantize(q_16Q14, s_ref);

% Convert to integer representation for file output
a_int = a_q * 2^14;
b_int = b_q * 2^14;
c_int = c_q * 2^15;
s_int = s_q * 2^14;

% Convert to two's complement for hex file output
a_hex = typecast(uint16(a_int), 'int16');
b_hex = typecast(uint16(b_int), 'int16');
c_hex = typecast(uint16(c_int), 'int16');

% Write test vectors
fid_a = fopen('stimuli/a_16Q14.txt', 'w');
fprintf(fid_a, '%04x\n', uint16(a_hex));
fclose(fid_a);

fid_b = fopen('stimuli/b_16Q14.txt', 'w');
fprintf(fid_b, '%04x\n', uint16(b_hex));
fclose(fid_b);

fid_c = fopen('stimuli/c_16Q15.txt', 'w');
fprintf(fid_c, '%04x\n', uint16(c_hex));
fclose(fid_c);

fid_s = fopen('results/s_matlab.txt', 'w');
fprintf(fid_s, '%d\n', s_int);
fclose(fid_s);

Result Verification

clear; clc;

% Read results from Matlab and simulator
s_mat = textscan(fopen('results/s_matlab.txt'), '%d');
s_mat = cell2mat(s_mat);

s_sim = textscan(fopen('results/s_vivado.txt'), '%d');
s_sim = cell2mat(s_sim);

% Bit-accurate comparison
pass_count = sum(s_sim == s_mat);
disp(['Matching samples: ', num2str(pass_count), ' / ', num2str(length(s_mat))]);

if pass_count == length(s_mat)
    disp('Verification PASSED');
else
    disp('Verification FAILED');
end

Code Coverage Verification

Using Synopsys VCS, ensure condition coverage reaches 100%. The key conditions to verify:

Sign bit states in multiplication
Rounding carry generation for both positive and negative inputs
Saturation detection for overflow/underflow scenarios

All 4096 test vectors pass verification when compared against Matlab reference results, confirming correct implementation of rounding and saturation logic.

Tags: Verilog FPGA Digital Signal Processing Fixed-Point Arithmetic Rounding

Posted on Sun, 10 May 2026 04:08:23 +0000 by johnska7

Freaks City