Data Validation Testing
1. Black Box Testing
Black box data testing validates data based on expected results. The following aspects require verification:
- Table schema alignment with design specifications
- Primary key uniqueness constraints
- NOT NULL field valiadtions
- Invalid input handling
- Enumerated value accuracy
- Negative value checks for monetary fields
- Data validity and reasonableness
Sample SQL queries for validation scenarios
Uniqueness Check
-- Verify product ID uniqueness, no duplicates allowed
SELECT product_id
,COUNT(1) AS record_count
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
GROUP BY product_id
HAVING COUNT(1) > 1
;
NULL Value Detection
-- NULL value validation
SELECT product_id, product_name
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
AND (product_id IS NULL OR product_name IS NULL)
;
Empty String Verification
SELECT product_id, product_name
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
AND (product_id = '' OR product_name = '')
;
Negative Value Analysis
SELECT sale_amount
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
AND sale_amount < 0
;
Enumeration Validation
SELECT DISTINCT product_category
FROM sales.ads_product_summary
WHERE partition_date = '20221211'
;
2. White Box Testing
White box testing involves code review and business logic verification. Testers must prepare validation scripts or identify reference datasets for metric reconciliation. This requires accumulated domain knowledge of metric definitions. When discrepancies arise, testers collaborate with developers to trace root causes, often navigating multiple data warehouse layers or investigating unclear metric definitions. Post-validation, DQC scripts should be deployed for production monitoring. Proficiency in SQL is essential for this approach.
Key validation areas include:
- Field length, min/max values, outliers, boundary conditions
- Consistent calculation units across datasets
- Date function precision (e.g., DATEADD time offset calculations)
- Scheduling dependency configuration completeness
Practical Example - Date Function Adjustment
When using DATEADD for 6-day rolling window statistics with 5-day offset, the condition parameter requires adjustment to '-5':
WHERE record_date BETWEEN TO_CHAR(
DATEADD(TO_DATE('${bizdate}','yyyyMMdd'), -6,'dd'),
'yyyyMMdd'
)
AND '${bizdate}'
Practical Example - Default Value Handling
Numeric fields should default to 0, string fields to empty string:
SELECT
NVL(t22.bid_count_30d, 0) AS bid_count_30d, -- 30-day bidding product count
NVL(t17.active_products_td, 0) AS active_products_td, -- Daily active products
NVL(t22.in_stock_7d, 0) AS in_stock_7d, -- 7-day in-stock products
t22.in_stock_30d AS in_stock_30d -- 30-day in-stock products
FROM ...
Common Testing Approaches
DQC Validation
Migration and refactoring projects typically maintain identical metrics and definitions. DQC validation provides efficient verification in such scenarios.
SELECT t1.metric_id AS metric_identifier
,t1.report_month AS reporting_period
,t1.conversion_rate AS current_rate
,t2.conversion_rate AS baseline_rate
FROM (
SELECT metric_id
,report_month
,conversion_rate
FROM metrics.refined_table
WHERE partition_date = '20221011'
) t1
INNER JOIN (
SELECT metric_id
,report_time
,conversion_rate
FROM metrics.legacy_table
WHERE partition_date = '20221011'
) t2
ON t1.metric_id = t2.metric_id
AND t1.report_month = t2.report_time
WHERE t1.conversion_rate <> t2.conversion_rate
;
Lineage-Based Cross-Validation
During testing, data mismatches or incorrect enumerations often surface. Lineage analysis helps trace field origins and identify upstream issues.
When merchant order data contains errors, SQL queries can trace field provenance through the lineage graph. This enables systematic root cause analysis by exaimning how data transforms across tables.
Establishing Quality Monitoring
Critical tables require monitoring to ensure data quality across dimensions: completeness, consistency, timeliness, and accuracy.
Completeness ensures all records and field values exist without gaps. Missing records or null field values compromise analytical accuracy. Completeness forms the foundation of data quality assurance.
Accuracy validates whether recorded information correctly reflects reality and contains no anomalies or errors. For instance, incorrect buyer information in orders represents problematic data. Maintaining record accuracy is essential for reliable analytics.
Consistency is particularly important in large-scale data warehouses with multiple branches. The same data entity must maintain consistent representation across all warehouse layers. For example, customer identifiers must use identical types and lengths from operational databases through analytical layers to downstream applications.