Singdata Lakehouse Medallion Architecture in Practice: Pure SQL Dynamic Table Approach
The Medallion architecture (Bronze β Silver β Gold) is a data lake organization pattern popularized by Databricks. On Singdata Lakehouse, beyond implementing it with ZettaPark Python API, there is a cleaner alternative: building all three layers declaratively using SQL Dynamic Tablesβno Python code required, no scheduling platform configuration needed, and all three layers automatically refresh incrementally based on dependency chains.
This article uses the NHL (National Hockey League) real-world dataset (10 tables, ~14 million rows) to fully demonstrate this approach.
Data Lake Acceleration Overview: Where This Article Fits
A typical data lake acceleration pipeline looks like: Object storage files β Volume (mount) β Pipe (continuous ingestion) β Target table β Dynamic Table (incremental aggregation). The first two steps handle "automatic data loading," while this article focuses on the final stepβcleansing, modeling, and aggregation after data is loaded, using Dynamic Tables to declaratively build the Bronze β Silver β Gold three-layer pipeline.
If you have not set up data ingestion yet, start with Volume + Pipe End-to-End Practice to get file auto-loading working first. If your data is already in Lakehouse tables (like the NHL dataset in this article), start directly here.
Why Use Dynamic Tables to Build Medallion
Traditional Medallion architecture typically relies on scheduling platforms (Airflow/Databricks Workflows) to execute Python Notebooks or SQL scripts sequentially. Dynamic Tables offer a different paradigm:
| Dimension | Traditional ETL Scheduling | Dynamic Table Approach |
|---|---|---|
| Coding style | Python/ZettaPark or SQL scripts | Pure SQL (CREATE DYNAMIC TABLE ... AS SELECT) |
| Scheduling config | Requires DAG and Cron configuration | Declarative REFRESH INTERVAL, system auto-schedules |
| Incremental computation | Manual incremental logic required | System CBO automatically detects incremental changes |
| Dependency management | Manual orchestration of upstream/downstream order | DT automatically determines refresh order by reference |
| Data lineage | Requires additional tools to track | SHOW DYNAMIC TABLE REFRESH HISTORY built-in |
| Code as assets | Notebooks/scripts scattered across management | Centralized in Studio, searchable, comparable, reusable |
The core difference: you do not need to worry about "when to run" or "what to run"βyou only need to declare "what result you want". The system handles computation orchestration, incremental detection, and parallel scheduling.
Dataset Overview
NHL hockey data from the nhl_game_data schema (Bronze layer, already loaded):
| Table | Rows | Description |
|---|---|---|
game | 26,305 | Main game table (matchups, scores, venues, seasons) |
player_info | 3,925 | Player profiles (name, nationality, position, height/weight) |
team_info | 33 | Team information (name, abbreviation) |
game_skater_stats | 945,830 | Skater stats (goals, assists, shots, hits, +/-, etc.) |
game_goalie_stats | 56,656 | Goalie stats (saves, goals against, save percentage) |
game_goals | 148,992 | Goal details |
game_plays | 5,050,529 | Game events (play-by-play) |
game_plays_players | 7,586,604 | Player participation details per event |
game_penalties | 247,828 | Penalty records |
game_teams_stats | 52,610 | Team game-level statistics |
Data relationships: game is the core fact table, linked to other tables via game_id, player_id, and team_id. Covers 10 seasons from 2010 to 2020.
Architecture Design
Three-layer responsibilities:
| Layer | Schema | Table Type | Responsibility |
|---|---|---|---|
| Bronze | nhl_game_data | Regular table | Raw data, no transformation |
| Silver | silver | Dynamic Table | JOIN dimension tables for names, cleanse field types (STRINGβINT), standardize |
| Gold | gold | Dynamic Table | Aggregated metrics: top scorers, team records, goalie rankings, career stats |
Implementation Steps
Prerequisites
- Virtual Cluster available (use
DEFAULT, GP type, Serverless on-demand wake-up) - Bronze data loaded (
nhl_game_data.*10 tables) - Permissions for CREATE SCHEMA / CREATE DYNAMIC TABLE
Step 1: Create Schemas
Use separate schemas to physically isolate each layer:
Step 2: Silver Layer β Dimension Tables
The simplest DT: directly filter/transform columns from Bronze tables. These two tables are small (33 rows and 3,925 rows), so even FULL refreshes are effortless.
Step 3: Silver Layer β Fact Tables
The core work of fact tables: JOIN dimension tables to resolve names + type standardization. Using skater stats as an example:
Goalie stats fact table follows the same pattern, with additional save percentage calculation:
Step 4: Initial Refresh of Silver Layer
After DT creation, only the computation logic is definedβthere is no data yet. You need to manually trigger the first refresh:
Step 5: Gold Layer β Aggregated Metrics
The Gold layer reads data from the Silver layer and uses aggregate functions to generate business metrics. All tables use a 1 DAY refresh interval (T+1 scenario).
Top Scorers: TOP 20 Scorers Per Season
Use the RANK() window function to rank by season:
Validation results (2019-20 season):
| rank | player | team | goals | assists | points |
|---|---|---|---|---|---|
| 1 | Nikita Kucherov | TBL | 160 | 316 | 476 |
| 2 | Nathan MacKinnon | COL | 176 | 296 | 472 |
| 3 | Leon Draisaitl | EDM | 181 | 274 | 455 |
| 4 | David Pastrnak | BOS | 204 | 216 | 420 |
| 5 | Connor McDavid | EDM | 153 | 262 | 415 |
Team Season Records
Bronze data only has a home/away team perspective per game. Each game needs to be expanded into two rows (one for home team, one for away team), then aggregated by team and season. This is implemented with UNION ALL + CASE WHEN:
Validation results (2019-20 season TOP 5):
| team | games | wins | losses | points |
|---|---|---|---|---|
| Lightning (TBL) | 190 | 122 | 68 | 244 |
| Stars (DAL) | 192 | 104 | 88 | 208 |
| Golden Knights (VGK) | 182 | 102 | 80 | 204 |
| Avalanche (COL) | 170 | 102 | 68 | 204 |
| Flyers (PHI) | 170 | 102 | 68 | 204 |
Goalie Season Rankings + Player Career Stats + Home/Away Split
Full DDL is in the appendix. The core pattern is the same: aggregate from Silver layer β RANK() OVER (PARTITION BY season ...) β filter TOP N.
Step 6: Validate the Full Pipeline
Complete validation results:
| Layer | Table | Rows | Refresh Mode | Status |
|---|---|---|---|---|
| Silver | dim_team | 33 | FULL | β Matches Bronze |
| Silver | dim_player | 3,925 | FULL | β Matches Bronze |
| Silver | fact_skater_stats | 1,130,682 | FULL | β Includes player_name/team_name/points |
| Silver | fact_goalie_stats | 67,642 | FULL | β Includes computed save_pct |
| Gold | scoring_leaders | 399 | FULL | β TOP 20 per season |
| Gold | player_career_stats | 3,353 | FULL | β Career summary |
| Gold | team_season_summary | 580 | FULL | β 33 teams Γ 18 seasons |
| Gold | goalie_season_rankings | 294 | FULL | β TOP 15 per season |
| Gold | team_home_away_split | 580 | FULL | β Home/away split |
Design Principles
1. Cross-Layer Reference Rules
| Reference Direction | Allowed | Example |
|---|---|---|
| Silver β Bronze | β | FROM nhl_game_data.game |
| Gold β Silver | β | FROM silver.fact_skater_stats |
| Gold β Bronze | β οΈ Not recommended | Should access indirectly through Silver layer |
| Gold β Gold | β οΈ Use with caution | Only for multi-level aggregation |
| Bronze β Silver | β Forbidden | Lower layers should not depend on upper layers |
2. LEFT JOIN Filter Conditions Must Go in ON Clause
3. First Refresh Baseline Time
REFRESH INTERVAL 1 DAY calculates the next trigger based on creation time and does not align to clock hours. It is recommended to immediately execute REFRESH DYNAMIC TABLE after creation to manually trigger the first refresh and reset the baseline time:
4. String Cleansing
When raw data comes from external systems, numeric fields may contain non-standard characters:
Three-step cleansing: remove commas β NULLIF empty string β CAST to target type. NULLIF prevents CAST failures caused by empty strings.
Cost Analysis
| Layer | DT Count | Refresh Frequency | Estimated CRU |
|---|---|---|---|
| Silver | 4 | 1 DAY | Low (full refresh, but small data volume) |
| Gold | 5 | 1 DAY | Medium (involves aggregation, ~14M row scan) |
All use GP type Virtual Cluster (DEFAULT), Serverless on-demand billing. In T+1 scenarios with only one refresh per day, this is lower cost than traditional hourly ETL.
Comparison with ZettaPark Approach
| ZettaPark Approach | Pure SQL DT Approach (this article) | |
|---|---|---|
| Target audience | Python developers, Data Scientists | SQL developers, Data Analysts |
| Code volume | Python scripts + Spark API | Pure SQL (DDL) |
| Scheduling | Requires external scheduling (Studio/Notebook) | DT auto-refresh, no scheduling needed |
| Incremental computation | Manual CDC management required | System handles automatically |
| Flexibility | High (Python can call any library) | Medium (within SQL expression capabilities) |
| Learning curve | Pandas/PySpark/ZettaPark | Pure SQL |
| Use cases | Complex transformations, ML feature engineering, external API calls | Standard ETL, aggregation, JOINs, window functions |
Both approaches coexist without conflict: use ZettaPark for complex cleansing, use DT for aggregated metrics, leveraging the strengths of each within the same Medallion architecture.
Notes
| Note | Description |
|---|---|
| Bronze data changes trigger DT automatically | All 9 DTs in the pipeline refresh in dependency order, no manual trigger needed |
| DT does not support ALTER to modify SQL | Use CREATE OR REPLACE to rebuild |
| Virtual Cluster must be GP type | AP type does not support small file merging, queries slow down over time |
| Silver fact tables reference Silver dimension tables | System automatically ensures dimension tables refresh first |
| String numeric fields need cleansing | Remove commas β NULLIF β CAST, three steps |
| UNION ALL row expansion requires careful business logic | When splitting home/away teams, win/loss determination must cross-match side and outcome |
| Manual REFRESH required after initial creation | REFRESH INTERVAL does not immediately trigger the first computation |
Appendix: Complete Gold Layer DDL
Player Career Stats
Goalie Season Rankings
Home/Away Split
Related Documents
Complete data lake acceleration pipeline: Volume mount β Pipe ingestion β Dynamic Table modeling. The following documents cover each stage:
- Volume + Pipe Data Lake Acceleration β File auto-ingestion, the upstream step for this article
- Multi-Cloud Unified Data Lake Acceleration β Same SQL runs on Alibaba Cloud/Tencent Cloud/AWS
- Dynamic Table Introduction β Incremental computation mechanism and scheduling principles
- CREATE DYNAMIC TABLE β Complete DDL syntax
- Incremental Computing Overview β DT incremental refresh support matrix
- Medallion from Scratch (ZettaPark Approach) β Python API version covering the same topic
