Data Quality Checks (DQC): SQL-Driven Automated Validation
Your data pipeline is running — but is the data actually correct? Are row counts consistent? Do critical fields contain null values? Are aggregated metrics reasonable? These are questions every data engineer faces every day. Singdata Lakehouse's Data Quality Check (DQC) uses pure SQL to implement automated validation, integrating quality monitoring into the data pipeline so that problems are caught before they impact downstream consumers.
This article uses the NHL Medallion architecture as an example to demonstrate how to build a complete DQC framework across the Bronze → Silver → Gold three-layer model.
DQC Core Concepts
DQC is not a one-time activity — it is an automated process embedded in the pipeline. After each data refresh, checks run automatically and results are written to a dqc_results table, with anomalies exposed through a monitoring DT.
SQL Commands Used
| Command / Function | Purpose | Use case |
|---|---|---|
CREATE SCHEMA | Create a dedicated DQC layer | Isolate quality check tables |
CREATE TABLE | Create a DQC results table | Store the history of each check run |
INSERT INTO ... SELECT | Write DQC check results | One record per check |
CASE WHEN | Determine PASS/WARN/FAIL | Core logic for all check rules |
COUNT(*) / SUM(CASE WHEN) | Row count, conditional count | Row count validation, null rate, uniqueness |
MIN / MAX | Value range upper and lower bounds | Value range checks, freshness checks |
CREATE DYNAMIC TABLE | Create a DQC dashboard DT | Auto-refresh quality status summary |
REFRESH DYNAMIC TABLE | Manually trigger DT refresh | Initialize data after first creation |
DQC Check Types
| Type | Description | Example |
|---|---|---|
| Row count consistency | Whether upstream and downstream row counts match | Bronze team_info(33) = Silver dim_team(33) |
| Null rate | Proportion of NULL values in critical fields | goals null rate should be 0% |
| Uniqueness | Whether ID fields contain duplicates | player_id should be unique |
| Value range | Whether numeric values fall within reasonable bounds | goals >= 0, save_pct in [0,1] |
| Freshness | Whether data has been updated to the latest | Latest season >= 2019 |
| Aggregation consistency | Whether summary metrics are self-consistent | wins + losses = games played |
| Referential integrity | JOIN key match rate | skater_stats.player_id exists in player_info |
Creating the DQC Results Table
Defining Check Rules
Row Count Consistency
Verify that the Silver layer dimension table row count matches the Bronze source table:
Null Rate
Uniqueness
Value Range
Freshness
DQC Dashboard
Aggregate dqc_results into a Dynamic Table dashboard for a one-stop view of quality status across layers:
View the dashboard:
Validation Results
Running 10 DQC checks on the NHL Medallion architecture (Bronze: 10 tables → Silver: 4 DTs → Gold: 5 DTs):
| ID | Type | Layer | Metric | Expected | Actual | Status |
|---|---|---|---|---|---|---|
| DQC-001 | Row count | silver | dim_team_rows | 33 | 33 | PASS |
| DQC-002 | Row count | silver | dim_player_rows | 3925 | 3925 | PASS |
| DQC-003 | Null | silver | goals_null_pct | =0 | 0.00% | PASS |
| DQC-004 | Null | silver | player_name_null_pct | <1% | 0.00% | PASS |
| DQC-005 | Uniqueness | silver | player_id_unique | TRUE | TRUE | PASS |
| DQC-006 | Value range | silver | goals >= 0 | >=0 | 0 | PASS |
| DQC-007 | Value range | silver | points >= 0 | >=0 | 0 | PASS |
| DQC-008 | Freshness | bronze | max_season | >=2019 | 2020 | PASS |
| DQC-009 | Aggregation | gold | unique_seasons | >0 | 19 | PASS |
| DQC-010 | Aggregation | gold | wins >= 0 | >=0 | 0 | PASS |
All PASS, 100% pass rate.
Integrating with the Data Pipeline
Option 1: Manual trigger (suitable for development validation)
Option 2: Dynamic Table automated execution
Wrap DQC check logic in a Dynamic Table to automatically re-run after each source table refresh:
Option 3: Studio task scheduling
Create a DQC task in Studio with a Cron schedule and dependency on the ETL task:
If DQC finds a FAIL, you can configure Studio alert rules to send notifications.
DQC Checklist
| Layer | Must check after load | Recommended checks |
|---|---|---|
| Bronze | Row count >= source, latest data date | _op distribution (I/U/D), file count |
| Silver | Row count <= Bronze, critical field NULL < 1%, ID unique | LEFT JOIN match rate, value range, type conversion success rate |
| Gold | Aggregation results non-null, metrics >= 0 | Period-over-period change < 20%, TOP N results reasonable |
Integrating with Alerts
You can configure in Studio: any FAIL in DQC task results → trigger WeCom/DingTalk/Feishu notification.
Notes
| Note | Description |
|---|---|
| DQC results table should be a regular table | Retaining historical records enables trend analysis; DTs overwrite history |
| WARN does not block, FAIL should block | WARN means "needs attention"; FAIL means "cannot be published" |
| DQC checks have their own cost | Each check is a full table scan; keep the number of checks manageable (3-5 per layer recommended) |
| Thresholds need business calibration | NULL tolerance varies across business domains; use historical data to establish a baseline first |
| Mind the timezone in freshness checks | CURRENT_TIMESTAMP() is UTC, which may differ from the business timezone |
Related Documentation
- Medallion Pure-SQL DT Architecture — Bronze → Silver → Gold three-layer modeling
- Volume + Pipe Data Lake Acceleration — Data ingestion pipeline
- AI-Enhanced Data Analysis — Calling LLMs in SQL for intelligent analysis
- Studio Task Scheduling — DQC task Cron configuration
- Monitoring and Alerting — Configuring DQC anomaly notifications
