Data Quality Checks (DQC): SQL-Driven Automated Validation

Your data pipeline is running — but is the data actually correct? Are row counts consistent? Do critical fields contain null values? Are aggregated metrics reasonable? These are questions every data engineer faces every day. Singdata Lakehouse's Data Quality Check (DQC) uses pure SQL to implement automated validation, integrating quality monitoring into the data pipeline so that problems are caught before they impact downstream consumers.

This article uses the NHL Medallion architecture as an example to demonstrate how to build a complete DQC framework across the Bronze → Silver → Gold three-layer model.


DQC Core Concepts

Data Pipeline DQC Gate ──────────── ───────── Bronze (raw data) ── after load ──→ row count + freshness │ ▼ Silver (clean DT) ── after refresh ──→ null rate + uniqueness + value range │ ▼ Gold (aggregate DT) ── after refresh ──→ aggregation consistency + volatility │ ▼ BI / Apps ←── consume PASS only

DQC is not a one-time activity — it is an automated process embedded in the pipeline. After each data refresh, checks run automatically and results are written to a dqc_results table, with anomalies exposed through a monitoring DT.


SQL Commands Used

Command / FunctionPurposeUse case
CREATE SCHEMACreate a dedicated DQC layerIsolate quality check tables
CREATE TABLECreate a DQC results tableStore the history of each check run
INSERT INTO ... SELECTWrite DQC check resultsOne record per check
CASE WHENDetermine PASS/WARN/FAILCore logic for all check rules
COUNT(*) / SUM(CASE WHEN)Row count, conditional countRow count validation, null rate, uniqueness
MIN / MAXValue range upper and lower boundsValue range checks, freshness checks
CREATE DYNAMIC TABLECreate a DQC dashboard DTAuto-refresh quality status summary
REFRESH DYNAMIC TABLEManually trigger DT refreshInitialize data after first creation

DQC Check Types

TypeDescriptionExample
Row count consistencyWhether upstream and downstream row counts matchBronze team_info(33) = Silver dim_team(33)
Null rateProportion of NULL values in critical fieldsgoals null rate should be 0%
UniquenessWhether ID fields contain duplicatesplayer_id should be unique
Value rangeWhether numeric values fall within reasonable boundsgoals >= 0, save_pct in [0,1]
FreshnessWhether data has been updated to the latestLatest season >= 2019
Aggregation consistencyWhether summary metrics are self-consistentwins + losses = games played
Referential integrityJOIN key match rateskater_stats.player_id exists in player_info

Creating the DQC Results Table

CREATE SCHEMA IF NOT EXISTS dqc COMMENT 'Data Quality Check layer'; CREATE TABLE dqc.dqc_results ( check_id STRING COMMENT 'Check ID, e.g. DQC-001', check_name STRING COMMENT 'Check type: row_match/null_rate/uniqueness...', layer STRING COMMENT 'Data layer: bronze/silver/gold', metric STRING COMMENT 'Metric name', expected STRING COMMENT 'Expected value or range', actual STRING COMMENT 'Actual value', status STRING COMMENT 'PASS / WARN / FAIL', detail STRING COMMENT 'Check description', checked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ) COMMENT 'DQC check results table';


Defining Check Rules

Row Count Consistency

Verify that the Silver layer dimension table row count matches the Bronze source table:

INSERT INTO dqc.dqc_results (check_id, check_name, layer, metric, expected, actual, status, detail) SELECT 'DQC-001', 'dim_row_match', 'silver', 'dim_team_rows', CAST((SELECT COUNT(*) FROM nhl_game_data.team_info) AS STRING), CAST((SELECT COUNT(*) FROM silver.dim_team) AS STRING), CASE WHEN (SELECT COUNT(*) FROM nhl_game_data.team_info) = (SELECT COUNT(*) FROM silver.dim_team) THEN 'PASS' ELSE 'FAIL' END, 'Bronze team_info row count should match Silver dim_team';

Null Rate

INSERT INTO dqc.dqc_results (check_id, check_name, layer, metric, expected, actual, status, detail) SELECT 'DQC-003', 'null_rate', 'silver', 'skater_goals_null_pct', '=0', CAST(ROUND(SUM(CASE WHEN goals IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS STRING), CASE WHEN SUM(CASE WHEN goals IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) = 0 THEN 'PASS' ELSE 'WARN' END, 'Silver fact_skater_stats.goals should have no NULLs' FROM silver.fact_skater_stats;

Uniqueness

INSERT INTO dqc.dqc_results (check_id, check_name, layer, metric, expected, actual, status, detail) SELECT 'DQC-005', 'uniqueness', 'silver', 'dim_player_id_unique', 'TRUE', CAST(CASE WHEN COUNT(*) = COUNT(DISTINCT player_id) THEN 'TRUE' ELSE 'FALSE' END AS STRING), CASE WHEN COUNT(*) = COUNT(DISTINCT player_id) THEN 'PASS' ELSE 'FAIL' END, 'Silver dim_player.player_id should be unique' FROM silver.dim_player;

Value Range

INSERT INTO dqc.dqc_results (check_id, check_name, layer, metric, expected, actual, status, detail) SELECT 'DQC-006', 'value_range', 'silver', 'skater_goals_positive', '>=0', CAST(MIN(goals) AS STRING), CASE WHEN MIN(goals) >= 0 THEN 'PASS' ELSE 'FAIL' END, 'Silver fact_skater_stats.goals should not be negative' FROM silver.fact_skater_stats;

Freshness

INSERT INTO dqc.dqc_results (check_id, check_name, layer, metric, expected, actual, status, detail) SELECT 'DQC-008', 'freshness', 'bronze', 'max_season', '>=2019', CAST(MAX(season) AS STRING), CASE WHEN MAX(season) >= 2019 THEN 'PASS' ELSE 'WARN' END, 'Bronze latest season should not be earlier than 2019' FROM nhl_game_data.game;


DQC Dashboard

Aggregate dqc_results into a Dynamic Table dashboard for a one-stop view of quality status across layers:

CREATE OR REPLACE DYNAMIC TABLE dqc.dqc_dashboard REFRESH INTERVAL 1 DAY VCLUSTER DEFAULT COMMENT 'DQC Dashboard - quality status summary by layer' AS SELECT layer, COUNT(*) AS total_checks, SUM(CASE WHEN status = 'PASS' THEN 1 ELSE 0 END) AS pass_cnt, SUM(CASE WHEN status = 'WARN' THEN 1 ELSE 0 END) AS warn_cnt, SUM(CASE WHEN status = 'FAIL' THEN 1 ELSE 0 END) AS fail_cnt, ROUND(SUM(CASE WHEN status = 'PASS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS pass_rate FROM dqc.dqc_results GROUP BY layer;

View the dashboard:

SELECT * FROM dqc.dqc_dashboard ORDER BY layer;


Validation Results

Running 10 DQC checks on the NHL Medallion architecture (Bronze: 10 tables → Silver: 4 DTs → Gold: 5 DTs):

IDTypeLayerMetricExpectedActualStatus
DQC-001Row countsilverdim_team_rows3333PASS
DQC-002Row countsilverdim_player_rows39253925PASS
DQC-003Nullsilvergoals_null_pct=00.00%PASS
DQC-004Nullsilverplayer_name_null_pct<1%0.00%PASS
DQC-005Uniquenesssilverplayer_id_uniqueTRUETRUEPASS
DQC-006Value rangesilvergoals >= 0>=00PASS
DQC-007Value rangesilverpoints >= 0>=00PASS
DQC-008Freshnessbronzemax_season>=20192020PASS
DQC-009Aggregationgoldunique_seasons>019PASS
DQC-010Aggregationgoldwins >= 0>=00PASS

All PASS, 100% pass rate.


Integrating with the Data Pipeline

Option 1: Manual trigger (suitable for development validation)

-- Run all DQC checks, then view results SELECT check_id, status, metric, actual FROM dqc.dqc_results WHERE status != 'PASS'; -- show anomalies only

Option 2: Dynamic Table automated execution

Wrap DQC check logic in a Dynamic Table to automatically re-run after each source table refresh:

-- DQC check DT: null rate monitoring CREATE OR REPLACE DYNAMIC TABLE dqc.skater_null_monitor REFRESH INTERVAL 1 DAY VCLUSTER DEFAULT COMMENT 'Silver layer player stats null rate monitoring' AS SELECT 'DQC-003' AS check_id, 'null_rate' AS check_name, 'silver' AS layer, 'skater_goals_null_pct' AS metric, '=0' AS expected, CAST(ROUND(SUM(CASE WHEN goals IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS STRING) AS actual, CASE WHEN SUM(CASE WHEN goals IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) = 0 THEN 'PASS' ELSE 'WARN' END AS status FROM silver.fact_skater_stats;

Option 3: Studio task scheduling

Create a DQC task in Studio with a Cron schedule and dependency on the ETL task:

00_sync (Cron 02:00) ↓ 04_etl (Cron 02:30, depends on 00) ↓ 05_dqc (Cron 03:00, depends on 04) ← DQC runs after ETL completes

If DQC finds a FAIL, you can configure Studio alert rules to send notifications.


DQC Checklist

LayerMust check after loadRecommended checks
BronzeRow count >= source, latest data date_op distribution (I/U/D), file count
SilverRow count <= Bronze, critical field NULL < 1%, ID uniqueLEFT JOIN match rate, value range, type conversion success rate
GoldAggregation results non-null, metrics >= 0Period-over-period change < 20%, TOP N results reasonable

Integrating with Alerts

-- Query all FAIL checks SELECT * FROM dqc.dqc_results WHERE status = 'FAIL'; -- Query anomaly summary for the current check run SELECT layer, SUM(CASE WHEN status = 'FAIL' THEN 1 ELSE 0 END) AS fails, SUM(CASE WHEN status = 'WARN' THEN 1 ELSE 0 END) AS warns FROM dqc.dqc_results WHERE checked_at > CURRENT_TIMESTAMP() - INTERVAL 1 DAY GROUP BY layer HAVING SUM(CASE WHEN status = 'FAIL' THEN 1 ELSE 0 END) > 0;

You can configure in Studio: any FAIL in DQC task results → trigger WeCom/DingTalk/Feishu notification.


Notes

NoteDescription
DQC results table should be a regular tableRetaining historical records enables trend analysis; DTs overwrite history
WARN does not block, FAIL should blockWARN means "needs attention"; FAIL means "cannot be published"
DQC checks have their own costEach check is a full table scan; keep the number of checks manageable (3-5 per layer recommended)
Thresholds need business calibrationNULL tolerance varies across business domains; use historical data to establish a baseline first
Mind the timezone in freshness checksCURRENT_TIMESTAMP() is UTC, which may differ from the business timezone