Spark Job Smooth Migration to Lakehouse: A Practical Guide
This guide helps teams with existing Spark jobs migrate smoothly to Singdata Lakehouse.
Why Migrate?
What this is: Singdata Lakehouse natively supports Spark-type virtual clusters (VCluster). Your RDD, DataFrame, and UDF logic can run with almost no changes — you only need to replace the data read/write entry point from Hive/Parquet with the Spark Connector.
Migration benefits:
- No logic rewrite: Core compute code (RDD/DataFrame) is 100% compatible.
- Unified entry point: Compute jobs run directly on the Lakehouse storage-compute separation architecture, with elastic scaling and no traditional Hadoop cluster maintenance.
- Data lake native: Read and write Iceberg/Parquet data directly, no intermediate transfers.
1. Compatibility Overview
Before migrating, assess whether your code falls within the supported scope:
| Spark Operation | Support Status | Notes and Recommendations |
|---|---|---|
| RDD operations | ✅ Supported | Legacy code and flexible control logic require no changes. |
| DataFrame operations | ✅ Supported | Data warehouse task development and big data processing. |
| Read/write warehouse data | ⚠️ Requires adaptation | Not supported: df.write.format("parquet").saveAsTable("hive_db.table"). Must be changed to df.write.format("clickzetta"). |
| SQL operations | ⚠️ Requires adaptation | Not supported: spark.sql("SELECT ... FROM lakehouse_table") direct queries. You must first read data via the Connector and register it as a temporary table. |
| Spark Streaming | ❌ Not yet supported | For real-time stream processing, migrate to Kafka Pipe or real-time sync tasks. |
| Hive metadata access | ❌ Not yet supported | Cannot access Hive tables via enableHiveSupport(). Data must be imported into Lakehouse first. |
2. Core Migration Steps
2.1 Environment Setup
- Create a Spark VCluster:
Run the following in the Lakehouse SQL window:
- Download tools:
- Spark Connector JAR (contact technical support to obtain)
- spark-submit client
2.2 Code Adaptation: Wrapping Read/Write Methods
Since direct spark.sql operations on Lakehouse tables are not supported, you need to wrap two utility methods:
Read: Load data via the Connector and register it as a temporary view, so subsequent spark.sql calls can operate on it like a regular table.
Write: Use format("clickzetta") to write a DataFrame to Lakehouse.
2.3 Code Adaptation Comparison
| Original Spark Code (Hive) | Adapted Code (Lakehouse) |
|---|---|
spark.sql("SELECT * FROM hive_db.users") | 1. readClickzettaTable(spark, "users")2. spark.sql("SELECT * FROM users") |
df.write.saveAsTable("hive_db.result") | writeClickzettaTable(df, "result") |
3. Job Submission Guide
Upload your packaged JAR to OSS and submit it using the spark-submit client provided by Lakehouse.
Command example:
Key parameter reference:
| Parameter | Description | Where to Find |
|---|---|---|
--master | Lakehouse API Endpoint | Studio → Management → Workspace → JDBC connection string domain |
--jars | Required Connector dependency JAR | Spark Connector Overview |
spark.cz.vcluster | Target compute cluster | Must be a VCluster of type SPARK |
spark.cz.instance.name | Instance ID | Studio → Management → Workspace → Instance ID |
4. Maven Dependency Configuration
In pom.xml, set Spark core dependencies to <scope>provided</scope>. The Connector can be included via --jars or set to compile scope if bundled into a Fat JAR.
