Singdata Lakehouse Medallion Architecture in Practice: Pure SQL Dynamic Table Approach

The Medallion architecture (Bronze β†’ Silver β†’ Gold) is a data lake organization pattern popularized by Databricks. On Singdata Lakehouse, beyond implementing it with ZettaPark Python API, there is a cleaner alternative: building all three layers declaratively using SQL Dynamic Tablesβ€”no Python code required, no scheduling platform configuration needed, and all three layers automatically refresh incrementally based on dependency chains.

This article uses the NHL (National Hockey League) real-world dataset (10 tables, ~14 million rows) to fully demonstrate this approach.

Data Lake Acceleration Overview: Where This Article Fits

A typical data lake acceleration pipeline looks like: Object storage files β†’ Volume (mount) β†’ Pipe (continuous ingestion) β†’ Target table β†’ Dynamic Table (incremental aggregation). The first two steps handle "automatic data loading," while this article focuses on the final stepβ€”cleansing, modeling, and aggregation after data is loaded, using Dynamic Tables to declaratively build the Bronze β†’ Silver β†’ Gold three-layer pipeline.

If you have not set up data ingestion yet, start with Volume + Pipe End-to-End Practice to get file auto-loading working first. If your data is already in Lakehouse tables (like the NHL dataset in this article), start directly here.


Why Use Dynamic Tables to Build Medallion

Traditional Medallion architecture typically relies on scheduling platforms (Airflow/Databricks Workflows) to execute Python Notebooks or SQL scripts sequentially. Dynamic Tables offer a different paradigm:

DimensionTraditional ETL SchedulingDynamic Table Approach
Coding stylePython/ZettaPark or SQL scriptsPure SQL (CREATE DYNAMIC TABLE ... AS SELECT)
Scheduling configRequires DAG and Cron configurationDeclarative REFRESH INTERVAL, system auto-schedules
Incremental computationManual incremental logic requiredSystem CBO automatically detects incremental changes
Dependency managementManual orchestration of upstream/downstream orderDT automatically determines refresh order by reference
Data lineageRequires additional tools to trackSHOW DYNAMIC TABLE REFRESH HISTORY built-in
Code as assetsNotebooks/scripts scattered across managementCentralized in Studio, searchable, comparable, reusable

The core difference: you do not need to worry about "when to run" or "what to run"β€”you only need to declare "what result you want". The system handles computation orchestration, incremental detection, and parallel scheduling.


Dataset Overview

NHL hockey data from the nhl_game_data schema (Bronze layer, already loaded):

TableRowsDescription
game26,305Main game table (matchups, scores, venues, seasons)
player_info3,925Player profiles (name, nationality, position, height/weight)
team_info33Team information (name, abbreviation)
game_skater_stats945,830Skater stats (goals, assists, shots, hits, +/-, etc.)
game_goalie_stats56,656Goalie stats (saves, goals against, save percentage)
game_goals148,992Goal details
game_plays5,050,529Game events (play-by-play)
game_plays_players7,586,604Player participation details per event
game_penalties247,828Penalty records
game_teams_stats52,610Team game-level statistics

Data relationships: game is the core fact table, linked to other tables via game_id, player_id, and team_id. Covers 10 seasons from 2010 to 2020.


Architecture Design

Bronze (nhl_game_data.*) Silver (silver.*) DT Gold (gold.*) DT ═══════════════════════ ══════════════════ ══════════════════ Raw data, zero transformation Cleansed + dimension joins Business metrics game ─────────┐ β”Œβ”€ dim_team (33) β”Œβ”€ scoring_leaders team_info ───── LEFT JOIN β”€β”€β†’β”œβ”€ dim_player (3,925) β”œβ”€ player_career_stats player_info β”€β”€β”˜ β”œβ”€ fact_skater_stats β”œβ”€ team_season_summary skater_stats ── LEFT JOIN ──→ └─ fact_goalie_stats β”œβ”€ goalie_season_rankings goalie_stats ── LEFT JOIN ──→ └─ team_home_away_split

Three-layer responsibilities:

LayerSchemaTable TypeResponsibility
Bronzenhl_game_dataRegular tableRaw data, no transformation
SilversilverDynamic TableJOIN dimension tables for names, cleanse field types (STRING→INT), standardize
GoldgoldDynamic TableAggregated metrics: top scorers, team records, goalie rankings, career stats

Implementation Steps

Prerequisites

  • Virtual Cluster available (use DEFAULT, GP type, Serverless on-demand wake-up)
  • Bronze data loaded (nhl_game_data.* 10 tables)
  • Permissions for CREATE SCHEMA / CREATE DYNAMIC TABLE

Step 1: Create Schemas

Use separate schemas to physically isolate each layer:

CREATE SCHEMA IF NOT EXISTS silver COMMENT 'Medallion Silver cleansed layer'; CREATE SCHEMA IF NOT EXISTS gold COMMENT 'Medallion Gold aggregated metrics layer';

Step 2: Silver Layer β€” Dimension Tables

The simplest DT: directly filter/transform columns from Bronze tables. These two tables are small (33 rows and 3,925 rows), so even FULL refreshes are effortless.

-- Team dimension CREATE OR REPLACE DYNAMIC TABLE silver.dim_team REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver team dimension table' AS SELECT team_id, franchiseid, shortname, teamname, abbreviation, link FROM nhl_game_data.team_info; -- Player dimension (standardized + full name column added) CREATE OR REPLACE DYNAMIC TABLE silver.dim_player REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver player dimension table β€” standardized fields + full name' AS SELECT player_id, firstname, lastname, CONCAT(firstname, ' ', lastname) AS full_name, nationality, birthcity, primaryposition AS position, birthdate, height, height_cm, CAST(NULLIF(REGEXP_REPLACE(weight, ',', ''), '') AS INT) AS weight_kg, shootscatches FROM nhl_game_data.player_info;

Step 3: Silver Layer β€” Fact Tables

The core work of fact tables: JOIN dimension tables to resolve names + type standardization. Using skater stats as an example:

CREATE OR REPLACE DYNAMIC TABLE silver.fact_skater_stats REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver skater stats fact table β€” joined with player name + team name + season' AS SELECT s.game_id, s.player_id, p.full_name AS player_name, p.position, s.team_id, t.teamname AS team_name, t.abbreviation AS team_abbr, g.season, g.date_time_gmt AS game_date, s.timeonice, s.goals, s.assists, s.goals + s.assists AS points, -- computed field: points s.shots, CAST(NULLIF(REGEXP_REPLACE(s.hits, ',', ''), '') AS INT) AS hits, s.powerplaygoals, s.penaltyminutes, s.plusminus, s.eventimeonice, s.powerplaytimeonice FROM nhl_game_data.game_skater_stats s LEFT JOIN nhl_game_data.game g ON s.game_id = g.game_id LEFT JOIN silver.dim_player p ON s.player_id = p.player_id LEFT JOIN silver.dim_team t ON s.team_id = t.team_id;

Goalie stats fact table follows the same pattern, with additional save percentage calculation:

CREATE OR REPLACE DYNAMIC TABLE silver.fact_goalie_stats REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver goalie stats fact table β€” includes save percentage calculation' AS SELECT gs.game_id, gs.player_id, p.full_name AS player_name, t.teamname AS team_name, t.abbreviation AS team_abbr, g.season, g.date_time_gmt AS game_date, gs.timeonice, gs.shots AS shots_faced, gs.saves, CASE WHEN gs.shots > 0 THEN ROUND(gs.saves * 1.0 / gs.shots, 3) ELSE NULL END AS save_pct, -- computed field: save percentage gs.decision FROM nhl_game_data.game_goalie_stats gs LEFT JOIN nhl_game_data.game g ON gs.game_id = g.game_id LEFT JOIN silver.dim_player p ON gs.player_id = p.player_id LEFT JOIN silver.dim_team t ON gs.team_id = t.team_id;

Step 4: Initial Refresh of Silver Layer

After DT creation, only the computation logic is definedβ€”there is no data yet. You need to manually trigger the first refresh:

REFRESH DYNAMIC TABLE silver.dim_team; REFRESH DYNAMIC TABLE silver.dim_player; REFRESH DYNAMIC TABLE silver.fact_skater_stats; REFRESH DYNAMIC TABLE silver.fact_goalie_stats;

Step 5: Gold Layer β€” Aggregated Metrics

The Gold layer reads data from the Silver layer and uses aggregate functions to generate business metrics. All tables use a 1 DAY refresh interval (T+1 scenario).

Top Scorers: TOP 20 Scorers Per Season

Use the RANK() window function to rank by season:

CREATE OR REPLACE DYNAMIC TABLE gold.scoring_leaders REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold top 20 scorers per season β€” ranked by points (goals + assists)' AS SELECT season, rank, player_id, player_name, position, team_abbr, games_played, goals, assists, points, ROUND(points * 1.0 / games_played, 2) AS pts_per_game FROM ( SELECT season, player_id, player_name, position, team_abbr, COUNT(*) AS games_played, SUM(goals) AS goals, SUM(assists) AS assists, SUM(points) AS points, RANK() OVER (PARTITION BY season ORDER BY SUM(points) DESC) AS rank FROM silver.fact_skater_stats GROUP BY season, player_id, player_name, position, team_abbr ) t WHERE rank <= 20;

Validation results (2019-20 season):

rankplayerteamgoalsassistspoints
1Nikita KucherovTBL160316476
2Nathan MacKinnonCOL176296472
3Leon DraisaitlEDM181274455
4David PastrnakBOS204216420
5Connor McDavidEDM153262415

Team Season Records

Bronze data only has a home/away team perspective per game. Each game needs to be expanded into two rows (one for home team, one for away team), then aggregated by team and season. This is implemented with UNION ALL + CASE WHEN:

CREATE OR REPLACE DYNAMIC TABLE gold.team_season_summary REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold team season records β€” wins/losses/goals/goals-against/points' AS SELECT g.season, g.team_id, t.teamname AS team_name, t.abbreviation AS team_abbr, COUNT(*) AS games_played, SUM(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 1 WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 1 ELSE 0 END) AS wins, SUM(CASE WHEN g.side = 'home' AND g.outcome LIKE 'away win%' THEN 1 WHEN g.side = 'away' AND g.outcome LIKE 'home win%' THEN 1 ELSE 0 END) AS losses, SUM(CASE WHEN g.side = 'home' THEN g.home_goals ELSE g.away_goals END) AS goals_for, SUM(CASE WHEN g.side = 'home' THEN g.away_goals ELSE g.home_goals END) AS goals_against, SUM(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 2 WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 2 ELSE 0 END) AS points FROM ( SELECT season, home_team_id AS team_id, outcome, home_goals, away_goals, 'home' AS side FROM nhl_game_data.game UNION ALL SELECT season, away_team_id AS team_id, outcome, home_goals, away_goals, 'away' AS side FROM nhl_game_data.game ) g LEFT JOIN silver.dim_team t ON g.team_id = t.team_id GROUP BY g.season, g.team_id, t.teamname, t.abbreviation;

Validation results (2019-20 season TOP 5):

teamgameswinslossespoints
Lightning (TBL)19012268244
Stars (DAL)19210488208
Golden Knights (VGK)18210280204
Avalanche (COL)17010268204
Flyers (PHI)17010268204

Goalie Season Rankings + Player Career Stats + Home/Away Split

Full DDL is in the appendix. The core pattern is the same: aggregate from Silver layer β†’ RANK() OVER (PARTITION BY season ...) β†’ filter TOP N.

Step 6: Validate the Full Pipeline

-- Row count comparison across layers SELECT 'Bronze game' AS layer, COUNT(*) FROM nhl_game_data.game UNION ALL SELECT 'Silver dim_team', COUNT(*) FROM silver.dim_team UNION ALL SELECT 'Silver fact_skater', COUNT(*) FROM silver.fact_skater_stats UNION ALL SELECT 'Gold scoring_leaders', COUNT(*) FROM gold.scoring_leaders UNION ALL SELECT 'Gold team_season', COUNT(*) FROM gold.team_season_summary; -- View DT refresh history SHOW DYNAMIC TABLE REFRESH HISTORY WHERE name = 'scoring_leaders';

Complete validation results:

LayerTableRowsRefresh ModeStatus
Silverdim_team33FULLβœ… Matches Bronze
Silverdim_player3,925FULLβœ… Matches Bronze
Silverfact_skater_stats1,130,682FULLβœ… Includes player_name/team_name/points
Silverfact_goalie_stats67,642FULLβœ… Includes computed save_pct
Goldscoring_leaders399FULLβœ… TOP 20 per season
Goldplayer_career_stats3,353FULLβœ… Career summary
Goldteam_season_summary580FULLβœ… 33 teams Γ— 18 seasons
Goldgoalie_season_rankings294FULLβœ… TOP 15 per season
Goldteam_home_away_split580FULLβœ… Home/away split

Design Principles

1. Cross-Layer Reference Rules

Reference DirectionAllowedExample
Silver β†’ Bronzeβœ…FROM nhl_game_data.game
Gold β†’ Silverβœ…FROM silver.fact_skater_stats
Gold β†’ Bronze⚠️ Not recommendedShould access indirectly through Silver layer
Gold β†’ Gold⚠️ Use with cautionOnly for multi-level aggregation
Bronze β†’ Silver❌ ForbiddenLower layers should not depend on upper layers

2. LEFT JOIN Filter Conditions Must Go in ON Clause

-- ❌ Wrong: WHERE filter degrades LEFT JOIN to INNER JOIN SELECT * FROM skater_stats s LEFT JOIN team_info t ON s.team_id = t.team_id WHERE t.abbreviation = 'TBL'; -- βœ… Correct: filter condition in ON clause SELECT * FROM skater_stats s LEFT JOIN team_info t ON s.team_id = t.team_id AND t.abbreviation = 'TBL';

3. First Refresh Baseline Time

REFRESH INTERVAL 1 DAY calculates the next trigger based on creation time and does not align to clock hours. It is recommended to immediately execute REFRESH DYNAMIC TABLE after creation to manually trigger the first refresh and reset the baseline time:

CREATE DYNAMIC TABLE gold.scoring_leaders ...; REFRESH DYNAMIC TABLE gold.scoring_leaders;

4. String Cleansing

When raw data comes from external systems, numeric fields may contain non-standard characters:

CAST(NULLIF(REGEXP_REPLACE(hits, ',', ''), '') AS INT)

Three-step cleansing: remove commas β†’ NULLIF empty string β†’ CAST to target type. NULLIF prevents CAST failures caused by empty strings.


Cost Analysis

LayerDT CountRefresh FrequencyEstimated CRU
Silver41 DAYLow (full refresh, but small data volume)
Gold51 DAYMedium (involves aggregation, ~14M row scan)

All use GP type Virtual Cluster (DEFAULT), Serverless on-demand billing. In T+1 scenarios with only one refresh per day, this is lower cost than traditional hourly ETL.


Comparison with ZettaPark Approach

ZettaPark ApproachPure SQL DT Approach (this article)
Target audiencePython developers, Data ScientistsSQL developers, Data Analysts
Code volumePython scripts + Spark APIPure SQL (DDL)
SchedulingRequires external scheduling (Studio/Notebook)DT auto-refresh, no scheduling needed
Incremental computationManual CDC management requiredSystem handles automatically
FlexibilityHigh (Python can call any library)Medium (within SQL expression capabilities)
Learning curvePandas/PySpark/ZettaParkPure SQL
Use casesComplex transformations, ML feature engineering, external API callsStandard ETL, aggregation, JOINs, window functions

Both approaches coexist without conflict: use ZettaPark for complex cleansing, use DT for aggregated metrics, leveraging the strengths of each within the same Medallion architecture.


Notes

NoteDescription
Bronze data changes trigger DT automaticallyAll 9 DTs in the pipeline refresh in dependency order, no manual trigger needed
DT does not support ALTER to modify SQLUse CREATE OR REPLACE to rebuild
Virtual Cluster must be GP typeAP type does not support small file merging, queries slow down over time
Silver fact tables reference Silver dimension tablesSystem automatically ensures dimension tables refresh first
String numeric fields need cleansingRemove commas β†’ NULLIF β†’ CAST, three steps
UNION ALL row expansion requires careful business logicWhen splitting home/away teams, win/loss determination must cross-match side and outcome
Manual REFRESH required after initial creationREFRESH INTERVAL does not immediately trigger the first computation

Appendix: Complete Gold Layer DDL

Player Career Stats

CREATE OR REPLACE DYNAMIC TABLE gold.player_career_stats REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold player career overview β€” all-season totals + per-game efficiency' AS SELECT player_id, player_name, position, COUNT(*) AS games_played, SUM(goals) AS total_goals, SUM(assists) AS total_assists, SUM(points) AS total_points, ROUND(SUM(points) * 1.0 / COUNT(*), 2) AS pts_per_game, ROUND(SUM(goals) * 1.0 / NULLIF(SUM(shots), 0), 3) AS shooting_pct, AVG(timeonice) AS avg_timeonice_sec, SUM(penaltyminutes) AS total_pim, AVG(plusminus) AS avg_plusminus FROM silver.fact_skater_stats GROUP BY player_id, player_name, position;

Goalie Season Rankings

CREATE OR REPLACE DYNAMIC TABLE gold.goalie_season_rankings REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold goalie season rankings TOP 15 β€” ranked by wins' AS SELECT season, rank, player_id, player_name, team_abbr, games_played, wins, saves, shots_faced, ROUND(save_pct, 3) AS save_pct FROM ( SELECT season, player_id, player_name, team_abbr, COUNT(*) AS games_played, SUM(CASE WHEN decision = 'W' THEN 1 ELSE 0 END) AS wins, SUM(saves) AS saves, SUM(shots_faced) AS shots_faced, CASE WHEN SUM(shots_faced) > 0 THEN SUM(saves) * 1.0 / SUM(shots_faced) ELSE NULL END AS save_pct, RANK() OVER (PARTITION BY season ORDER BY SUM(CASE WHEN decision = 'W' THEN 1 ELSE 0 END) DESC) AS rank FROM silver.fact_goalie_stats GROUP BY season, player_id, player_name, team_abbr ) t WHERE rank <= 15;

Home/Away Split

CREATE OR REPLACE DYNAMIC TABLE gold.team_home_away_split REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold team home vs. away performance β€” home win% vs away win%' AS SELECT g.season, g.team_id, t.teamname AS team_name, t.abbreviation AS team_abbr, COUNT(CASE WHEN g.side = 'home' THEN 1 END) AS home_games, COUNT(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 1 END) AS home_wins, COUNT(CASE WHEN g.side = 'away' THEN 1 END) AS away_games, COUNT(CASE WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 1 END) AS away_wins, ROUND( COUNT(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 1 END) * 1.0 / NULLIF(COUNT(CASE WHEN g.side = 'home' THEN 1 END), 0), 3 ) AS home_win_pct, ROUND( COUNT(CASE WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 1 END) * 1.0 / NULLIF(COUNT(CASE WHEN g.side = 'away' THEN 1 END), 0), 3 ) AS away_win_pct FROM ( SELECT season, home_team_id AS team_id, outcome, 'home' AS side FROM nhl_game_data.game UNION ALL SELECT season, away_team_id AS team_id, outcome, 'away' AS side FROM nhl_game_data.game ) g LEFT JOIN silver.dim_team t ON g.team_id = t.team_id GROUP BY g.season, g.team_id, t.teamname, t.abbreviation;


Complete data lake acceleration pipeline: Volume mount β†’ Pipe ingestion β†’ Dynamic Table modeling. The following documents cover each stage: