Data Warehouse Testing - Data Quality Validation Approaches

Data Validation Testing

1. Black Box Testing

Black box data testing validates data based on expected results. The following aspects require verification:

  • Table schema alignment with design specifications
  • Primary key uniqueness constraints
  • NOT NULL field valiadtions
  • Invalid input handling
  • Enumerated value accuracy
  • Negative value checks for monetary fields
  • Data validity and reasonableness

Sample SQL queries for validation scenarios

Uniqueness Check

-- Verify product ID uniqueness, no duplicates allowed
SELECT  product_id
        ,COUNT(1) AS record_count
FROM    sales.ads_product_summary
WHERE   partition_date = '20221211'
GROUP BY product_id
HAVING  COUNT(1) > 1
;

NULL Value Detection

-- NULL value validation
SELECT  product_id, product_name 
FROM    sales.ads_product_summary
WHERE   partition_date = '20221211'
AND     (product_id IS NULL OR product_name IS NULL)
;

Empty String Verification

SELECT product_id, product_name
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
AND (product_id = '' OR product_name = '')
;

Negative Value Analysis

SELECT sale_amount
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
AND sale_amount < 0
;

Enumeration Validation

SELECT  DISTINCT product_category
FROM    sales.ads_product_summary
WHERE   partition_date = '20221211'
;

2. White Box Testing

White box testing involves code review and business logic verification. Testers must prepare validation scripts or identify reference datasets for metric reconciliation. This requires accumulated domain knowledge of metric definitions. When discrepancies arise, testers collaborate with developers to trace root causes, often navigating multiple data warehouse layers or investigating unclear metric definitions. Post-validation, DQC scripts should be deployed for production monitoring. Proficiency in SQL is essential for this approach.

Key validation areas include:

  • Field length, min/max values, outliers, boundary conditions
  • Consistent calculation units across datasets
  • Date function precision (e.g., DATEADD time offset calculations)
  • Scheduling dependency configuration completeness

Practical Example - Date Function Adjustment

When using DATEADD for 6-day rolling window statistics with 5-day offset, the condition parameter requires adjustment to '-5':

WHERE record_date BETWEEN TO_CHAR(
    DATEADD(TO_DATE('${bizdate}','yyyyMMdd'), -6,'dd'),
    'yyyyMMdd'
)
AND '${bizdate}'

Practical Example - Default Value Handling

Numeric fields should default to 0, string fields to empty string:

SELECT 
    NVL(t22.bid_count_30d, 0)          AS bid_count_30d,     -- 30-day bidding product count
    NVL(t17.active_products_td, 0)     AS active_products_td, -- Daily active products
    NVL(t22.in_stock_7d, 0)            AS in_stock_7d,       -- 7-day in-stock products
    t22.in_stock_30d                   AS in_stock_30d       -- 30-day in-stock products
FROM ...

Common Testing Approaches

DQC Validation

Migration and refactoring projects typically maintain identical metrics and definitions. DQC validation provides efficient verification in such scenarios.

SELECT  t1.metric_id AS metric_identifier
        ,t1.report_month AS reporting_period
        ,t1.conversion_rate AS current_rate
        ,t2.conversion_rate AS baseline_rate
FROM    (
            SELECT  metric_id
                    ,report_month
                    ,conversion_rate
            FROM    metrics.refined_table
            WHERE   partition_date = '20221011'
        ) t1 
INNER JOIN (
            SELECT  metric_id
                    ,report_time
                    ,conversion_rate
            FROM    metrics.legacy_table
            WHERE   partition_date = '20221011'
        ) t2
ON      t1.metric_id = t2.metric_id
AND     t1.report_month = t2.report_time
WHERE   t1.conversion_rate <> t2.conversion_rate
;

Lineage-Based Cross-Validation

During testing, data mismatches or incorrect enumerations often surface. Lineage analysis helps trace field origins and identify upstream issues.

When merchant order data contains errors, SQL queries can trace field provenance through the lineage graph. This enables systematic root cause analysis by exaimning how data transforms across tables.

Establishing Quality Monitoring

Critical tables require monitoring to ensure data quality across dimensions: completeness, consistency, timeliness, and accuracy.

Completeness ensures all records and field values exist without gaps. Missing records or null field values compromise analytical accuracy. Completeness forms the foundation of data quality assurance.

Accuracy validates whether recorded information correctly reflects reality and contains no anomalies or errors. For instance, incorrect buyer information in orders represents problematic data. Maintaining record accuracy is essential for reliable analytics.

Consistency is particularly important in large-scale data warehouses with multiple branches. The same data entity must maintain consistent representation across all warehouse layers. For example, customer identifiers must use identical types and lengths from operational databases through analytical layers to downstream applications.

Tags: etl-testing sql-validation data-quality data-warehouse big-data-testing

Posted on Sun, 24 May 2026 19:21:29 +0000 by deansatch