Strategies for Rapid Generation of Million-Scale Test Datasets

When validating application performance and data-intensive features, engineers frequently require substantial datasets within their testing environments. Scenarios demanding high-volume data include stress-testing database query performance, verifying search algorithm accuracy across diverse records, validating ETL pipeline consistency, and confirming statistical aggregation correctness in reporting dashboards. The scale of required data determines the optimal generation strategy.

Direct Interface Manipulation

For scenarios requiring fewer than one hundred records, manual data entry through the application interface remains viable. This approach requires no technical infrastructure or scripting knowledge. However, this method scales linearly with human effort, making it impractical for volume testing or regression suites.

Batch API Request Automation

When constructing hundreds or thousands of records, automating HTTP requests provides significant efficiency gains over manual entry. Testing frameworks such as JMeter, Postman (utilizing CSV data sources), or Python's httpx library enable concurrent request execution. While effective, this method requires managing authentication tokens, handling rate limiting, and working around potential API validation constraints or back end latency issues that emerge under sustained load.

Bulk Import via Structured Files

For datasets in the thousands, preparing data in spreadsheet applications like Excel or LibreOffice Calc allows rapid generation through fill-handle duplication. After preparing the file, database administration tools such as Navicat, DBeaver, or native LOAD DATA INFILE SQL commands can import the structured data efficiently. This approach hits limitations when reaching tens of thousands of rows due to file size constraints and memory limitations in spreadsheet software.

Server-Side Stored Procedures

To generate millions of records without network overhead, implement database-native stored procedures. These server-side scripts execute directly within the database engine, eliminating client-server latency:

DELIMITER //

DROP PROCEDURE IF EXISTS sp_seed_user_base //

CREATE PROCEDURE sp_seed_user_base(IN record_target INT)
BEGIN
    DECLARE counter INT DEFAULT 1;
    DECLARE var_username VARCHAR(64);
    DECLARE var_contact CHAR(11);
    DECLARE base_sequence BIGINT DEFAULT 13900000000;
    
    WHILE counter <= record_target DO
        SET var_username = CONCAT('auto_gen_', UUID_SHORT(), '_', counter);
        SET var_contact = CAST(base_sequence + counter AS CHAR);
        
        INSERT INTO user_directory (username, mobile, created_timestamp) 
        VALUES (var_username, var_contact, NOW());
        
        SET counter = counter + 1;
    END WHILE;
END //

DELIMITER ;

-- Execute with: CALL sp_seed_user_base(1000000);

This method leverages the database's computational resources and minimizes transaction overhead by keeping operations local to the server instance.

Programmatic Batch Insertion

For environments requiring complex data relationships or external data sources, Python provides flexible bulk insertion capabilities with proper connection pooling and error handling:

import mysql.connector
from mysql.connector import Error
import secrets
import string

def bulk_generate_records(target_volume=1000000, batch_size=10000):
    connection = None
    try:
        connection = mysql.connector.connect(
            host='database.production.local',
            database='app_data',
            user='etl_user',
            password='secure_credential',
            autocommit=False
        )
        
        cursor = connection.cursor()
        cursor.execute("SELECT COALESCE(MAX(account_id), 0) FROM user_accounts")
        current_max = cursor.fetchone()[0]
        
        for offset in range(0, target_volume, batch_size):
            record_batch = []
            for _ in range(batch_size):
                current_max += 1
                account_id = current_max
                login_name = ''.join(secrets.choice(string.ascii_lowercase) for _ in range(10))
                phone_num = f"1{secrets.randbelow(90000000000) + 10000000000}"
                record_batch.append((account_id, login_name, phone_num))
            
            insert_query = """
                INSERT INTO user_accounts (account_id, username, phone) 
                VALUES (%s, %s, %s)
            """
            cursor.executemany(insert_query, record_batch)
            connection.commit()
            
    except Error as err:
        print(f"Data generation failed: {err}")
        if connection:
            connection.rollback()
    finally:
        if connection and connection.is_connected():
            cursor.close()
            connection.close()

if __name__ == "__main__":
    bulk_generate_records()

This implementation utilizes the secrets module for cryptographically secure random generation, implements transaction batching to optimize commit frequency, and includes proper resource cleanup to prevent connection leaks during large-scale operations.

Tags: test data generation database testing Performance Testing sql python

Posted on Thu, 28 May 2026 17:46:34 +0000 by madonnazz

Freaks City

Strategies for Rapid Generation of Million-Scale Test Datasets

Hot Tags