Building a Medallion Lakehouse from Scratch on Singdata Lakehouse
If you've worked with the Medallion architecture on Databricks, migrating to Singdata Lakehouse is mostly an environment configuration exercise — the modeling logic and code stay the same. Bronze ingestion, Silver cleansing, Gold dimensional modeling — all of that thinking and code carries over directly to ZettaPark.
This article walks through a complete three-layer modeling process using a real migration project: moving a Medallion architecture example built on the Apache Spark stack (Databricks + PySpark) to Singdata Lakehouse. After 22 automated validation checks, 20/22 passed (the 2 warnings stem from source data quality issues, not from the migration itself).
Full code on GitHub: spark2lakehouse-medallion
The Original Project
spark2lakehouse-medallion is adapted from DataWithBaraa/databricks_bootcamp_2026. It demonstrates how to use the Medallion architecture on Databricks to integrate data from two systems — CRM and ERP — and produce a dimensional model ready for BI analysis. The project includes 6 source tables (3 from CRM + 3 from ERP), which are processed through three layers to produce 2 dimension tables (dim_customers, dim_products) and 1 fact table (fact_sales).
The migrated code lives in the 03_lakehouse/ directory. The original Databricks Notebooks are preserved in 01_spark/ for side-by-side comparison.
Technology Stack Comparison
| Original | After Migration | |
|---|---|---|
| Compute engine | Apache Spark (Databricks) | Singdata Lakehouse |
| DataFrame API | PySpark (pyspark.sql) | ZettaPark (clickzetta.zettapark) |
| Development environment | Databricks Notebook | Jupyter Notebook (local) |
| Storage format | Delta Lake | Lakehouse native tables |
| File storage | DBFS / ADLS | Volume (vol://schema.vol/...) |
| Session management | spark (globally injected) | Session.builder.configs({}).create() |
| Method naming | withColumn / withColumnRenamed | with_column / with_column_renamed |
| Write method | df.write.mode("overwrite").saveAsTable(t) | df.write.save_as_table(t, mode="overwrite") |
| Layer isolation | Delta databases | Separate schemas (bronze / silver / gold) |
The main changes are in the runtime environment — switching from Databricks Notebook to local Jupyter, and from DBFS to Volume. The core data processing and modeling logic is completely unchanged: cleansing, deduplication, multi-source JOINs, Window functions, dimensional modeling — all of these work the same way in ZettaPark as in PySpark. Business logic (F.when().otherwise(), Window.partition_by().order_by(), df.join(), df.filter()) is fully compatible — not a single line needs to change.
This project achieves roughly 90% code compatibility. Changes are concentrated in four areas: import paths, Session creation, method naming, and write syntax — all mechanical substitutions. If the original project doesn't rely on Databricks' globally injected spark (i.e., it runs with local PySpark), compatibility can approach 100%. Projects that use Python UDFs or df.write.partitionBy() will see somewhat lower compatibility.
Architecture Overview
The Medallion architecture divides data processing into three layers, each with a clear responsibility boundary:
The three layers map to three separate schemas — physically isolated, with no cross-layer interference. Each layer only reads from the layer above it.
The Medallion Data Model
The Medallion architecture (also called the multi-hop architecture) is the dominant data organization pattern for lakehouse scenarios. Popularized by Databricks, it has become an industry standard.
Traditional data warehouses mix raw and processed data together, making it hard to trace issues or rerun pipelines. The core idea of Medallion is layered isolation: raw data is always preserved in the Bronze layer, each layer does exactly one thing, and if something goes wrong you can rerun from the previous layer without affecting anything else.
This pattern is especially well-suited for multi-source integration scenarios — data quality varies across systems, Bronze captures everything as-is, Silver handles cleansing and standardization, and Gold focuses on business-oriented modeling. Responsibilities are clear and layers don't interfere with each other.
Layer Responsibilities
| Layer | Alias | Responsibility | Data State |
|---|---|---|---|
| Bronze | Raw layer | Ingest as-is, no transformations | Raw data, may contain dirty records, duplicates, inconsistent formats |
| Silver | Cleansed layer | Deduplication, cleansing, standardization, multi-source integration | Clean, trustworthy, business-semantic |
| Gold | Serving layer | Dimensional modeling for analytics and BI | Aggregated wide tables, star schema |
The benefit of physical isolation across layers: if any layer has a problem, you can rerun it from the layer above — no need to re-download raw data, and no impact on other layers.
Star Schema
The Gold layer typically uses a star schema: one fact table at the center, surrounded by dimension tables.
Fact tables store business events (orders, transactions). Each row is one event and contains measures (amounts, quantities) plus foreign keys pointing to dimension tables. Dimension tables store attributes of business entities (customer info, product info) and are joined to the fact table via foreign keys.
This project's Gold layer follows exactly this structure: fact_sales links to dim_customers and dim_products via customer_key and product_key.
Why Surrogate Keys?
Dimension table primary keys come in two flavors: natural keys (from the source system, e.g., customer_id = "C001") and surrogate keys (integers generated internally by the warehouse, e.g., customer_key = 1).
Reasons to use surrogate keys:
- Performance: integer JOINs are faster than string JOINs
- Stability: source system IDs can change due to business events (mergers, renumbering); surrogate keys are unaffected
- Cross-source integration: the same entity from multiple source systems can map to a single surrogate key
This project generates surrogate keys using F.row_number().over(Window.order_by("customer_id")) — simple and effective.
Environment Initialization
Before running any notebook, create the schemas and Volume, then upload the CSV files. This only needs to be done once.
.env configuration:
Bronze Layer: Raw Data Ingestion
The Bronze layer principle is no business transformations — just read the CSVs and write them as tables. The benefit: if something goes wrong downstream, you can rerun from Bronze without re-downloading the raw data.
6 tables, 6 config entries, handled uniformly. mode="overwrite" ensures idempotency — rerunning won't produce duplicate data.
Silver Layer: Cleansing and Standardization
The Silver layer does three things: remove nulls, standardize enum values, rename columns. Each source table has its own notebook with independent, self-contained logic.
Customer Info Cleansing (crm_cust_info → silver.crm_customers)
Note that with_column_renamed follows ZettaPark's snake_case naming convention, corresponding to PySpark's withColumnRenamed — the behavior is identical.
Deduplication: Keep the Latest Record by Priority
When source data contains duplicate IDs, a simple drop_duplicates isn't enough — you need to decide which record to keep based on business rules. Here, row_number() ranks records by creation date descending, keeping only the most recent:
This pattern is common in multi-source integration: the same product may exist in both CRM and ERP, with one system treated as authoritative and the other as supplementary.
Silver Layer Orchestration
6 notebooks run in dependency order:
Gold Layer: Dimensional Modeling
The Gold layer's core work is multi-source integration and surrogate key generation.
Customer Dimension Table (dim_customers)
CRM customer info is the primary source; ERP customer info and location data serve as supplements, integrated via LEFT JOIN:
Why surrogate keys?
customer_id is the natural key from the source system. It may be inconsistent across systems and can change over time due to business events. customer_key is the warehouse's internal integer primary key — stable, unique, and used as the foreign key in fact tables, independent of source system ID formats.
Sales Fact Table (fact_sales)
The fact table joins to dimension tables via surrogate keys and stores no redundant business attributes:
Data Quality Validation
After the pipeline runs, 04_validate.ipynb performs 22 automated checks. The validation logic uses ZettaPark queries directly, with no external framework required:
Actual results: 20/22 passed. 2 warnings:
bronze.crm_cust_infocontains 5 groups of duplicatecustomer_idvalues (source data quality issue)bronze.crm_sales_detailscontains 3 rows with negative sales amounts (returns/reversals — normal business behavior)
Neither issue was introduced by the migration; both exist in the original Databricks project as well.
Key Design Decisions
Why mode="overwrite" instead of MERGE INTO?
This Medallion project has a small data volume (a few thousand rows). Full overwrite is far simpler than incremental MERGE and easier to keep idempotent. MERGE INTO complexity is only worth it when data volumes are large (millions of rows or more) or when you need to preserve historical versions.
Comparison:
| Scenario | Recommended approach | Reason |
|---|---|---|
| Small data, full refresh | save_as_table(mode="overwrite") | Simple, idempotent, no merge key to maintain |
| Large data, incremental updates | MERGE INTO + merge_delta_data() | Avoids full table scans, preserves history |
| Historical snapshots needed | Time Travel + INSERT | Natively supported by Lakehouse |
Why surrogate keys in the Gold layer?
Using natural keys (e.g., customer_id) as foreign keys in the fact table creates two problems:
- Natural keys may be strings, which JOIN more slowly than integers
- Source system IDs can change (mergers, renumbering), breaking historical data associations
The integer key generated by row_number().over(Window.order_by("customer_id")) is stable and unique — the standard approach in dimensional modeling.
Why must you explicitly specify column sources after a multi-table JOIN?
This isn't a ZettaPark limitation — PySpark requires the same treatment after multi-table JOINs. Explicitly specifying the source is good practice: it improves readability and prevents runtime errors from column name ambiguity.
Full Execution Order
Side-by-Side Comparison with PySpark
This project preserves the original Databricks Notebooks in the 01_spark/ directory for file-by-file comparison. There are only 4 core differences:
| Difference | PySpark | ZettaPark |
|---|---|---|
| Method naming | withColumn / withColumnRenamed | with_column / with_column_renamed |
| Session creation | spark (globally injected) | Session.builder.configs({...}).create() |
| File paths | Local path / DBFS | vol://schema.vol/path |
| Write method | df.write.mode("overwrite").saveAsTable(t) | df.write.save_as_table(t, mode="overwrite") |
Business logic (F.when().otherwise(), Window.partition_by().order_by(), df.join(), df.filter()) is fully compatible — not a single line needs to change.
Migration Conclusions
ZettaPark is highly compatible with PySpark's DataFrame API. This project validates the following conclusions:
Fully compatible (no changes needed):
- DataFrame chaining:
filter,select,join,group_by,agg,drop_duplicates - Function library:
F.when().otherwise(),F.trim(),F.upper(),F.coalesce(),F.row_number(),F.isnotnull() - Window functions:
Window.partition_by().order_by(), windowed aggregation behavior is identical - SQL DML: MERGE INTO, INSERT, UPDATE, DELETE
4 changes required (same as the F1 project):
| Difference | PySpark | ZettaPark |
|---|---|---|
| Method naming | withColumn / withColumnRenamed | with_column / with_column_renamed |
| Session creation | spark (globally injected) | Session.builder.configs({...}).create() |
| File paths | Local path / DBFS | vol://schema.vol/path |
| Write method | df.write.mode("overwrite").saveAsTable(t) | df.write.save_as_table(t, mode="overwrite") |
With 20/22 validations passing (the 2 warnings are source data quality issues, not migration artifacts), this project demonstrates that ZettaPark can fully handle PySpark-based Medallion data engineering workloads.
References
- GitHub project: spark2lakehouse-medallion
- Original project: DataWithBaraa/databricks_bootcamp_2026
- PySpark migration in practice (more details on pitfalls): PySpark → ZettaPark Migration in Practice: F1 Racing Data Engineering Project
- ZettaPark API reference: ZettaPark DataFrame API Guide
- Spark SQL syntax migration: Spark SQL Syntax Migration Guide
- Volume usage guide: Volume Usage Guide
- Data type compatibility: Data Type Compatibility Reference
- Custom functions: SQL Function · External Function
- Spark Connector: Using Spark Connector
