In digital signal processing implementations, data quantization and bit truncation are common operations. This article covers two critical techniques: rounding to preserve precision during truncation, and saturation handling when results exceed representable ranges.
Data Format Notation
Fixed-point data formats follow the mQn convention where m represents total bit width (including sign bit) and n represents fractional bit width. For example, a 16Q13 format indicates a 16-bit signed number with 13 fractional bits.
Key characteristics of mQn format:
- Signed representation with MSB as sign bit
- Total width equals m bits
- Frractional width equals n bits
Signed Number Fundamentals
Signed vs Unsigned Representation
The fundamental difference lies in the MSB weight. Consider a 4-bit value 4'b1011:
- Unsigned interpretation: 1×2³ + 0×2² + 1×2¹ + 1×2⁰ = 11
- Signed interpretation: 1×(-2³) + 0×2² + 1×2¹ + 1×2⁰ = -5
Data ranges differ accordingly:
- 4-bit unsigned: 0 to 15
- 4-bit signed: -8 to 7
General ranges:
- Unsigned (m bits): 0 to 2^m - 1
- Signed (m bits): -2^(m-1) to 2^(m-1) - 1
Sign Extension for Signed Integers
Extending a signed integer requires duplicating the sign bit:
4'b0101(positive, +5) extended to 6 bits:6'b0001014'b1011(negative, -5) extended to 6 bits:6'b111011
Verification for negative case:
- 4'b1011 = 1×(-2³) + 0×2² + 1×2¹ + 1×2⁰ = -5
- 6'b111011 = 1×(-2⁵) + 1×2⁴ + 1×2³ + 0×2² + 1×2¹ + 1×2⁰ = -5
Signed Fractional Numbers
For 4Q2 format with value 4'b1011:
4'b10.11 = 1×(-2¹) + 0×2⁰ + 1×2⁻¹ + 1×2⁻² = -2 + 0 + 0.5 + 0.25 = -1.25
mQn format range: -2^(m-n-1) to 2^(m-n-1) - 1/2^n
Fractional extension: integer part uses sign extension, fractional part appends zeros.
Arithmetic Operatinos
Addition with Sign Extension
When adding two signed numbers, align decimal points first, then extend both operands by one sign bit before addition.
Example: Add 5Q2 value 5'b100.01 with 4Q3 value 4'b1.011
- Extend 5'b100.01 to 6Q3:
6'b100.010 - Sign-extend 4'b1.011 to 6Q3:
6'b111.011 - Extend both to 7Q3, then add
Multiplication Result Width
For two signed operands mQn and aQb, the product requires (m+a) total bits and (n+b) fractional bits.
Example: Two 4Q2 values multiplied yield an 8Q4 result.
Rounding Implementation
When truncating fractional bits, rounding improves accuracy over simple truncation.
Rounding Logic
For a positive number, round up if truncated bits' MSB is 1. For negative numbers, the rule inverts due to two's complement representation.
Verilog rounding implementation:
module rounding_unit #(
parameter WIDTH_IN = 9,
parameter WIDTH_OUT = 6,
parameter FRAC_IN = 6,
parameter FRAC_OUT = 3
) (
input signed [WIDTH_IN-1:0] i_data,
output signed [WIDTH_OUT-1:0] o_data
);
localparam TRUNC_BITS = FRAC_IN - FRAC_OUT;
// Determine carry bit for rounding
// For positive: carry = truncated_msb
// For negative: carry = truncated_msb & (|truncated_rest)
wire carry;
assign carry = i_data[WIDTH_IN-1]
? (i_data[TRUNC_BITS-1] & (|i_data[TRUNC_BITS-2:0]))
: i_data[TRUNC_BITS-1];
// Extend sign bit and add carry
wire signed [WIDTH_IN:0] sum;
assign sum = {i_data[WIDTH_IN-1], i_data} + carry;
assign o_data = sum[WIDTH_IN:WIDTH_IN-FRAC_OUT];
endmodule
Rounding for Negative Numbers
In two's complement, "round down" for negatives requires adding 1 to maintain accuracy. The carry logic accounts for this automatically.
Saturation Truncation
Saturation clamps values that exceed the target range to the minimum or maximum representable value.
Example: Convert 6Q3 value 6'b011.111 (3.875) to 4Q2 format (max: 1.75)
Since 3.875 > 1.75, saturation clips to 4'b01.11 (1.75).
Example: Convert 6Q3 value 6'b100.111 (-3.125) to 4Q2 format (min: -2.0)
Since -3.125 < -2.0, saturation clips to 4'b10.00 (-2.0).
Complete Design Example: S = A + B × C
Requirements
- Input A: 16Q14 format
- Input B: 16Q14 format
- Input C: 16Q15 format
- Output S: 16Q14 format
- Apply rounding during truncation
- Apply saturation when overflow occurs
Bit-Width Analysis
- Multiplication: B(16Q14) × C(16Q15) = 32Q29
- Addition: Extend A to 32Q29, then both operands to 33Q29 for safe addition
- Final truncation: 33Q29 → 16Q14 with rounding and saturation
Verilog Implementation
module saturated_multiply_add #(
parameter A_WIDTH = 16,
parameter B_WIDTH = 16,
parameter C_WIDTH = 16,
parameter A_FRAC = 14,
parameter B_FRAC = 14,
parameter C_FRAC = 15,
parameter S_FRAC = 14
) (
input clk,
input rst_n,
input signed [A_WIDTH-1:0] i_a,
input signed [B_WIDTH-1:0] i_b,
input signed [C_WIDTH-1:0] i_c,
output signed [A_WIDTH-1:0] o_s
);
// Registered inputs
reg signed [A_WIDTH-1:0] r_a;
reg signed [B_WIDTH-1:0] r_b;
reg signed [C_WIDTH-1:0] r_c;
// Multiplication result: (16+16)Q(14+15) = 32Q29
wire signed [31:0] mult_result;
// Extended A for alignment: 32Q29
wire signed [31:0] a_extended;
// Addition with sign extension: 33Q29
wire signed [32:0] sum_result;
// Rounding intermediate: 19Q14
wire carry_bit;
wire signed [18:0] rounded_value;
// Pipeline registers
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
r_a <= 'd0;
r_b <= 'd0;
r_c <= 'd0;
end else begin
r_a <= i_a;
r_b <= i_b;
r_c <= i_c;
end
end
// Step 1: Multiplication
assign mult_result = r_b * r_c;
// Step 2: Extend A to match fractional precision
assign a_extended = {r_a[A_WIDTH-1], r_a, {C_FRAC{1'b0}}};
// Step 3: Add with sign extension
assign sum_result = {mult_result[31], mult_result}
+ {a_extended[31], a_extended};
// Step 4: Calculate rounding carry bit
// Truncate 15 fractional bits (positions 14:0)
assign carry_bit = sum_result[32]
? (sum_result[S_FRAC] & (|sum_result[S_FRAC-1:0]))
: sum_result[S_FRAC];
// Step 5: Apply rounding with sign extension
assign rounded_value = {sum_result[32], sum_result[32:S_FRAC+1]} + carry_bit;
// Step 6: Saturation check and final output
// Check if truncation would lose non-sign bits
wire [3:0] sign_check;
assign sign_check = rounded_value[18:15];
wire overflow;
assign overflow = (sign_check != 4'b0000) && (sign_check != 4'b1111);
assign o_s = overflow
? (rounded_value[18] ? 16'b1000000000000000 : 16'b0111111111111111)
: rounded_value[15:0];
endmodule
Testbench with Matlab Verification
Testbench Structure
`timescale 1ns / 1ps
module tb_saturated_ma;
reg clk;
reg rst_n;
reg [15:0] a_data;
reg [15:0] b_data;
reg [15:0] c_data;
wire [15:0] s_out;
parameter DATA_COUNT = 4096;
// Memory arrays for test vectors
reg [15:0] mem_a [0:DATA_COUNT-1];
reg [15:0] mem_b [0:DATA_COUNT-1];
reg [15:0] mem_c [0:DATA_COUNT-1];
reg [13:0] addr;
reg data_valid;
saturated_multiply_add u_dut (.*);
initial begin
clk = 0;
rst_n = 0;
#67 rst_n = 1;
end
always #5 clk = ~clk;
initial begin
$readmemh("stimuli/a_16Q14.txt", mem_a);
$readmemh("stimuli/b_16Q14.txt", mem_b);
$readmemh("stimuli/c_16Q15.txt", mem_c);
end
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
a_data <= 'd0;
b_data <= 'd0;
c_data <= 'd0;
addr <= 'd0;
data_valid <= 1'b0;
end else if (addr == DATA_COUNT) begin
addr <= addr;
data_valid <= 1'b0;
end else begin
a_data <= mem_a[addr];
b_data <= mem_b[addr];
c_data <= mem_c[addr];
addr <= addr + 1'b1;
data_valid <= 1'b1;
end
end
integer fid;
initial begin
fid = $fopen("results/s_vivado.txt", "w");
if (!fid) begin
$display("File open failed");
$finish;
end
end
reg data_valid_d1;
always @(posedge clk) data_valid_d1 <= data_valid;
always @(posedge clk) begin
if (data_valid_d1)
$fdisplay(fid, "%d", $signed(s_out));
else if (addr == DATA_COUNT) begin
$fclose(fid);
$finish;
end
end
endmodule
Matlab Test Vector Generation
clear; clc;
num_samples = 4096;
% Define input ranges based on fixed-point formats
a_range = [-2, 2 - 2^(-14)]; % 16Q14
b_range = [-2, 2 - 2^(-14)]; % 16Q14
c_range = [-1, 1 - 2^(-15)]; % 16Q15
% Generate random test data
a_rand = a_range(1) + diff(a_range) * rand(1, num_samples - 8);
b_rand = b_range(1) + diff(b_range) * rand(1, num_samples - 8);
c_rand = c_range(1) + diff(c_range) * rand(1, num_samples - 8);
% Boundary cases: all combinations of min/max values
a_bdry = [a_range(1) a_range(1) a_range(1) a_range(1) ...
a_range(2) a_range(2) a_range(2) a_range(2)];
b_bdry = [b_range(1) b_range(1) b_range(2) b_range(2) ...
b_range(1) b_range(1) b_range(2) b_range(2)];
c_bdry = [c_range(1) c_range(2) c_range(1) c_range(2) ...
c_range(1) c_range(2) c_range(1) c_range(2)];
% Combine boundary and random cases
a = [a_bdry, a_rand];
b = [b_bdry, b_rand];
c = [c_bdry, c_rand];
% Quantization patterns: fixed-point, round to nearest, saturate
q_16Q14 = quantizer('fixed', 'round', 'saturate', [16, 14]);
q_16Q15 = quantizer('fixed', 'round', 'saturate', [16, 15]);
% Apply quantization
a_q = quantize(q_16Q14, a);
b_q = quantize(q_16Q14, b);
c_q = quantize(q_16Q15, c);
% Compute reference result
s_ref = a_q + b_q .* c_q;
s_q = quantize(q_16Q14, s_ref);
% Convert to integer representation for file output
a_int = a_q * 2^14;
b_int = b_q * 2^14;
c_int = c_q * 2^15;
s_int = s_q * 2^14;
% Convert to two's complement for hex file output
a_hex = typecast(uint16(a_int), 'int16');
b_hex = typecast(uint16(b_int), 'int16');
c_hex = typecast(uint16(c_int), 'int16');
% Write test vectors
fid_a = fopen('stimuli/a_16Q14.txt', 'w');
fprintf(fid_a, '%04x\n', uint16(a_hex));
fclose(fid_a);
fid_b = fopen('stimuli/b_16Q14.txt', 'w');
fprintf(fid_b, '%04x\n', uint16(b_hex));
fclose(fid_b);
fid_c = fopen('stimuli/c_16Q15.txt', 'w');
fprintf(fid_c, '%04x\n', uint16(c_hex));
fclose(fid_c);
fid_s = fopen('results/s_matlab.txt', 'w');
fprintf(fid_s, '%d\n', s_int);
fclose(fid_s);
Result Verification
clear; clc;
% Read results from Matlab and simulator
s_mat = textscan(fopen('results/s_matlab.txt'), '%d');
s_mat = cell2mat(s_mat);
s_sim = textscan(fopen('results/s_vivado.txt'), '%d');
s_sim = cell2mat(s_sim);
% Bit-accurate comparison
pass_count = sum(s_sim == s_mat);
disp(['Matching samples: ', num2str(pass_count), ' / ', num2str(length(s_mat))]);
if pass_count == length(s_mat)
disp('Verification PASSED');
else
disp('Verification FAILED');
end
Code Coverage Verification
Using Synopsys VCS, ensure condition coverage reaches 100%. The key conditions to verify:
- Sign bit states in multiplication
- Rounding carry generation for both positive and negative inputs
- Saturation detection for overflow/underflow scenarios
All 4096 test vectors pass verification when compared against Matlab reference results, confirming correct implementation of rounding and saturation logic.