PySpark → ZettaPark Migration in Practice: F1 Racing Data Engineering Project
If you're familiar with PySpark, the learning curve for migrating to ZettaPark is much lower than you might expect. ZettaPark's DataFrame API is highly compatible with PySpark — select, filter, join, Window functions, and aggregation functions all use identical syntax. You don't need to learn a new API or rewrite your business logic. The changes are limited to 4 environment configuration points.
This article validates that claim with a real project: a complete migration of an F1 racing data engineering project built on the Apache Spark stack (Azure Databricks + PySpark) to Singdata Lakehouse, passing all 71 automated validation checks.
Full code on GitHub: spark2lakehouse-formula1
Original Project
spark2lakehouse-formula1 is forked from FerhattSimsekk/formula1-data-engineering. The original stack is Azure Databricks + PySpark + Delta Lake. The project implements a complete data engineering pipeline — from API download and data ingestion through multi-table join transformations to analytical layer output — covering all 125 races across the 2018–2023 seasons. It includes 8 core tables: circuits, drivers, constructors, race results, pit stops, lap times, and qualifying results, totaling approximately 140,000 rows.
The migrated code lives in the 03_lakehouse/ directory and can be compared file-by-file with 01_spark/.
Technology Stack Comparison
| Original Project | After Migration | |
|---|---|---|
| Compute Engine | Apache Spark (Azure Databricks) | Singdata Lakehouse |
| DataFrame API | PySpark (pyspark.sql) | ZettaPark (clickzetta.zettapark) |
| Storage Format | Delta Lake | Lakehouse native tables |
| File Storage | ADLS (/mnt/... mount) | Volume (vol://schema.vol/...) |
| Session Management | spark (Databricks global injection) | Session.builder.configs({}).create() |
| SQL Execution | spark.sql(q) executes immediately | session.sql(q).collect() triggers execution |
| Partitioned Write | df.write.partitionBy("col") | PARTITIONED BY (col TYPE) DDL |
| Temporary Views | CREATE TEMP VIEW | df.create_or_replace_temp_view() |
| Runtime Environment | Databricks Notebook | Local Python script / Jupyter |
The differences in the table look significant, but their impact on migration effort is minimal. What changes is the runtime environment and the API package name — the data processing logic itself stays the same. 90% of the code can be reused directly. The 4 required changes are all mechanical replacements with no business logic rewriting. DataFrame operations (select, filter, join, groupBy, Window functions, aggregation functions) have identical syntax and require no changes.
Conclusion First
90% of PySpark code can be reused directly. Changes are limited to 4 areas.
| Change | Effort | Notes |
|---|---|---|
| Import path replacement | Very low | Mechanical replacement, no logic changes |
| Explicit Session creation | Very low | Databricks injects spark globally; ZettaPark requires explicit creation |
Add .collect() to DDL/DML | Low | Lazy execution — without it, nothing runs |
| File path format | Low | /mnt/... → vol://schema.vol/... |
Window functions, aggregations, JOINs, filters, sorting — the core operations of data engineering all have identical syntax and need no changes.
Project Background
The data source is the F1 racing API (Jolpica), covering 125 races across the 2018–2023 seasons. The data architecture has two layers: f1_processed (ingestion layer) and f1_presentation (analytical layer).
| Table | Rows | Description |
|---|---|---|
| circuits | 78 | Circuits across 34 countries |
| races | 125 | Full 2018–2023 seasons |
| drivers | 37 | Race drivers |
| constructors | 15 | Race teams |
| results | 2,500 | Race results |
| pit_stops | 4,294 | Pit stop records |
| lap_times | 134,957 | Lap times |
| qualifying | 2,497 | Qualifying results |
Migration Steps
Step 1: Replace Import Paths
This is the only mechanical replacement — a global find-and-replace in your editor:
Step 2: Replace Session Creation
In Databricks, spark is globally injected and requires no setup. ZettaPark requires explicit Session creation:
Step 3: Add .collect() to DDL/DML
ZettaPark's session.sql() is lazy — it returns a DataFrame object and won't actually execute without an action. This is the easiest thing to miss.
For SELECT queries it's not strictly required, but adding .collect() when you just want to trigger execution (without needing the result) is good practice:
Step 4: Replace File Paths
The original project uses ADLS mount paths. ZettaPark uses Volume paths:
Uploading local files to a Volume:
Fully Compatible APIs
The following APIs were used in this project and require no modifications:
4 Real Gotchas from the Migration
These are issues encountered during the actual migration — not documented anywhere, only discovered by running the code.
Gotcha 1: withColumn Triggers Server-Side Schema Resolution in Older Versions
Symptom: After calling df.withColumn("rank", F.rank().over(window_spec)), column names become internal-prefix formats like r_f6fw_race_id, or the call throws an error outright.
Root cause: In older versions of ZettaPark, withColumn calls self._output, which triggers a server-side schema pre-resolution request. Under certain conditions this returns column names with internal prefixes.
Fix: Use select() + .alias() to handle all column operations in one pass:
Gotcha 2: After Multi-Table JOINs, select() Must Explicitly Specify Column Sources
Symptom: After a three-table JOIN, calling select("race_id", "driver_name", ...) throws an error, or the written table has column names with internal prefixes.
Root cause: In older versions of ZettaPark, after multi-table JOINs, columns with the same name from different sources get internal hash prefixes (e.g., r_f6fw_race_id) to disambiguate them.
Fix: After a JOIN, explicitly specify the source DataFrame and .alias() for every column in select():
Gotcha 3: CREATE TEMP VIEW Is Not Supported
Symptom: session.sql("CREATE TEMP VIEW v AS SELECT ...").collect() throws: CZLH-42000 Syntax error - missing KW_ENDPOINT at 'VIEW'.
Root cause: Singdata Lakehouse SQL does not support the CREATE TEMP VIEW syntax.
Fix: Use the DataFrame's create_or_replace_temp_view() method, which behaves identically to PySpark's createOrReplaceTempView:
Gotcha 4: Column Name Ambiguity in MERGE INTO VALUES Clause
Symptom: In a MERGE INTO statement's WHEN NOT MATCHED THEN INSERT ... VALUES (...), bare column names in the VALUES clause are resolved as columns from the target table (tgt), resulting in null inserts.
Root cause: In Lakehouse SQL, bare column names in the VALUES clause resolve to tgt, not src.
Fix: Prefix all column names in the VALUES clause with src.:
Incremental Write Wrapper: merge_delta_data
The original PySpark project uses overwrite_partition() for incremental writes. The ZettaPark equivalent is merge_delta_data(), which implements upsert semantics using MERGE INTO:
The benefit of this wrapper: the same code creates the table on first run and performs incremental updates on subsequent runs — no need for the caller to check whether the table exists.
Partitioned table creation requires hand-written DDL because ZettaPark's save_as_table does not support a partition_by parameter:
Data Source Compatibility Issues
The original project used the Ergast API (now discontinued). This project switches to the Jolpica API. A few issues here are unrelated to ZettaPark but must be addressed during migration.
Primary Key Types Changed
The original data used integer IDs; the Jolpica API uses string refs (e.g., "hamilton", "mercedes"). JOIN conditions need to be updated accordingly:
Global Endpoint Returns Only 100 Records
Jolpica's /drivers.json?limit=1000 actually returns only 100 records (alphabetically sorted), which means major 2018–2023 participants like Hamilton, Vettel, Mercedes, and Red Bull are all missing.
Fix: use the per-season endpoint and download by season, then merge and deduplicate:
The symptom of this issue is that f1_presentation.race_results contains only 66 rows (should be 2,500), because most driver_id values can't find a matching driver_ref during the JOIN.
End-to-End Validation
After migration, data correctness was verified with 71 automated checks:
Validation covers 5 dimensions:
- Row count checks: All 12 tables have data with expected row counts
- NULL checks: Key columns (primary keys, foreign keys, business fields) have no null values
- Duplicate detection: No duplicate driver per race, no duplicate driver_ref
- Business rules: position is between 1–20, points are non-negative, calculated_points = 11 - position, rank minimum is 1
- Cross-layer consistency:
driver_standings.total_pointsequals the year-aggregated result fromrace_results
Actual result: 71/71 all passed.
Full Compatibility Reference
| API | PySpark | ZettaPark | Status |
|---|---|---|---|
df.select() | ✅ | ✅ | Fully compatible |
df.filter() / df.where() | ✅ | ✅ | Fully compatible |
df.join(other, cond, how) | ✅ | ✅ | Fully compatible |
df.groupBy().agg() | ✅ | ✅ | Fully compatible |
df.sort() / df.orderBy() | ✅ | ✅ | Fully compatible |
df.limit(n) / df.count() | ✅ | ✅ | Fully compatible |
df.show() / df.collect() | ✅ | ✅ | Fully compatible |
df.dropDuplicates(keys) | ✅ | ✅ | Fully compatible |
df.union(other) / df.distinct() | ✅ | ✅ | Fully compatible |
F.col() / F.lit() / F.when().otherwise() | ✅ | ✅ | Fully compatible |
F.sum() / F.count() / F.avg() / F.max() / F.min() | ✅ | ✅ | Fully compatible |
F.rank() / F.dense_rank() / F.row_number() | ✅ | ✅ | Fully compatible |
F.current_timestamp() / F.current_date() | ✅ | ✅ | Fully compatible |
F.concat() / F.trim() / F.upper() / F.lower() | ✅ | ✅ | Fully compatible |
F.isNull() / F.isNotNull() | ✅ | ✅ | Fully compatible |
Window.partitionBy().orderBy() | ✅ | ✅ | Fully compatible |
| MERGE INTO (SQL) | ✅ | ✅ | Fully compatible |
df.withColumn() | ✅ | ✅ Fixed in 0.1.5 | Use select().alias() in older versions |
session.sql() DDL/DML | Immediate | Lazy — requires .collect() | ⚠️ Must change |
df.write.partitionBy() | ✅ | ❌ Not supported | Use PARTITIONED BY DDL instead |
CREATE TEMP VIEW (SQL) | ✅ | ❌ Not supported | Use create_or_replace_temp_view() instead |
| File paths | /mnt/... | vol://schema.vol/... | ⚠️ Different format |
| Import paths | pyspark.sql | clickzetta.zettapark | ⚠️ Mechanical replacement |
Migration Conclusion
ZettaPark is highly compatible with PySpark's DataFrame API. The core migration effort is in environment adaptation, not business logic rewriting.
Fully compatible (no changes needed):
- DataFrame operations:
select,filter,join,groupBy,agg,sort,limit,union,distinct - Function library:
F.col(),F.when().otherwise(),F.sum/count/avg/max/min,F.rank/dense_rank/row_number,F.concat/trim/upper/lower,F.current_timestamp() - Window functions:
Window.partitionBy().orderBy()syntax is identical - SQL syntax: MERGE INTO, UPDATE, DELETE, INSERT, CTE are fully compatible
4 required changes:
| Change | Reason | Effort |
|---|---|---|
Import path pyspark.sql → clickzetta.zettapark | Different package name | Very low — global replace |
| Explicit Session creation | Databricks injects globally; ZettaPark requires explicit creation | Very low — one-time |
Add .collect() to DDL/DML | ZettaPark is lazy — without it, nothing executes | Low — check each statement |
File path /mnt/... → vol://... | Different storage system | Low — format replacement |
This project passed 71/71 validation checks, confirming that data integrity after migration is fully consistent with the original Databricks version.
References
- GitHub project: spark2lakehouse-formula1
- Original project: FerhattSimsekk/formula1-data-engineering
- ZettaPark API reference: ZettaPark DataFrame API Guide
- Spark SQL syntax migration: Spark SQL Syntax Migration Guide
- Medallion three-layer data warehouse: Building a Medallion Architecture from Scratch on Lakehouse
- Volume usage guide: Volume Usage Guide
- Data type compatibility: Data Type Compatibility Reference
- Custom functions: SQL Function · External Function
- Spark Connector: Using Spark Connector
