Spark SQL Migration Guide
This document provides a complete guide for migrating from Spark SQL to Lakehouse, covering migration assessment, type mapping, syntax differences, function compatibility, and UDF migration.
Migration Complexity Assessment
Lakehouse data types, DDL, DML, SELECT syntax, and the vast majority of high-frequency functions are compatible with Spark. For pure SQL ETL and analytics jobs, migration typically requires only minor changes.
Except for Python/Java UDFs, all migration items are one-time batch replacements or simple rewrites — no logic restructuring required.
The effort for UDF migration depends on the number of functions — each Python/Java UDF needs to be individually packaged and deployed to a cloud function service. This is the only part of the migration that requires significant investment. If UDF logic can be expressed in SQL, rewriting as a SQL Function significantly reduces the workload.
Quick Self-Assessment Checklist
Review your job code against the table below to quickly estimate migration effort before starting.
| Check item (search in codebase) | Lakehouse behavior | Usage frequency impact | Effort |
|---|---|---|---|
| None of the items below | DDL, SELECT, UPDATE/DELETE/MERGE fully compatible | — | None |
spark.udf.register / Python UDF / Java UDF | Must be rewritten as External Function, deployed to cloud function service | More UDFs = linearly more work | High |
CREATE TEMP VIEW / CREATE TEMPORARY VIEW | Session-level temporary views not supported; must change to CTE or persistent VIEW | High usage means more rewrites, but each change is simple | Low–Medium |
aggregate( / reduce( / session_window( / window( | No equivalent functions; must be manually rewritten | Occasional use has limited impact; core logic dependency requires redesign | Medium |
SQL UDF (CREATE FUNCTION) | Calls require Schema prefix, or configure lookup policy | High-frequency calls can be resolved once with udf_first | Low |
LATERAL VIEW posexplode | This syntax not supported; must change to table function syntax | Usually few occurrences; replace one by one | Low |
PARTITIONED BY clause in CTAS | CTAS does not support this clause; must split into CREATE TABLE + INSERT | Usually few occurrences; replace one by one | Low |
| Read/write via Spark Connector | Types auto-mapped; no code changes needed | — | None |
Type Mapping
When reading and writing data via Spark Connector, Spark types are automatically mapped to Lakehouse types — no manual code changes required.
Type Mapping Table
| Spark type | Lakehouse type | Notes |
|---|---|---|
BooleanType | BOOLEAN | Auto-mapped |
ByteType | TINYINT | Auto-mapped |
ShortType | SMALLINT | Auto-mapped |
IntegerType | INT | Auto-mapped |
LongType | BIGINT | Auto-mapped |
FloatType | FLOAT | Auto-mapped |
DoubleType | DOUBLE | Auto-mapped |
DecimalType | DECIMAL(p,s) | Auto-mapped, precision preserved |
StringType | STRING | Auto-mapped |
BinaryType | BINARY | Auto-mapped |
DateType | DATE | Auto-mapped |
TimestampType | TIMESTAMP_LTZ | Timezone-aware type |
TimestampNTZType | TIMESTAMP_NTZ | Timezone-naive type (Spark 3.4+) |
ArrayType | ARRAY<T> | Auto-mapped, element types recursively mapped |
MapType | MAP<K,V> | Auto-mapped, key/value types recursively mapped |
StructType | STRUCT<...> | Auto-mapped, field types recursively mapped |
Timestamp Timezone Notes
Spark's TimestampType is timezone-aware by default, corresponding to Lakehouse's TIMESTAMP_LTZ. If Spark code uses TimestampNTZType (timezone-naive), use TIMESTAMP_NTZ when creating tables in Lakehouse to avoid time value shifts from timezone conversion.
DDL Syntax
CREATE TABLE Syntax
Spark's CREATE TABLE syntax is highly compatible in Lakehouse. The following syntax is all supported and requires no changes during migration:
Partitioned Tables
Spark's partition syntax is fully compatible in Lakehouse — no changes needed during migration.
Lakehouse additionally supports Iceberg-style partition syntax where the partition column appears in the column list, making semantics clearer:
Both syntaxes produce the same result. Keep the original syntax during migration; use Iceberg style for new tables.
CTAS
CTAS (Create Table As Select) syntax is mostly compatible in Lakehouse, with the following differences:
USING keyword is optional
CTAS does not support partition clause
Lakehouse's CTAS syntax does not support the PARTITIONED BY clause. To create a partitioned table, create the table first then insert data:
Bucketed Tables
Bucketed table syntax is identical to Spark — no changes needed:
The USING parquet keyword is also optional in Lakehouse.
Partition Transform Functions
Lakehouse is fully consistent with Spark on partition transform functions — no code changes needed:
| Function | Purpose |
|---|---|
years(ts) | Partition by year |
months(ts) | Partition by month |
days(ts) | Partition by day |
hours(ts) | Partition by hour |
bucket(N, col) | Hash bucketing |
truncate(col, W) | Truncate partitioning |
Hidden Partitioning
Lakehouse uses a hidden partitioning mechanism similar to Apache Iceberg:
- Partition information is stored in metadata, not dependent on file paths
- Partition strategy can be changed at any time without rewriting data
- No limit on the number of partitions
- The optimizer automatically performs partition pruning at query time
Dynamic Partition Limit
Lakehouse supports a maximum of 2048 dynamic partitions per task. If the partition count exceeds this limit, consider writing in batches or using Cluster Key instead of partitioning.
DML Syntax
INSERT
INSERT OVERWRITE
Lakehouse supports three INSERT OVERWRITE modes, but the default semantics differ from Spark — pay attention during migration:
Overwrite the entire table
Static partition overwrite
Dynamic partition overwrite
Without specifying a partition value, the system automatically overwrites all partitions touched by this write; untouched partitions are preserved. This is consistent with Spark's spark.sql.sources.partitionOverwriteMode=dynamic behavior:
UPDATE / DELETE
MERGE INTO
SELECT Queries
Fully Compatible Syntax
The following common Spark SQL syntax is fully compatible in Lakehouse — no changes needed:
| Syntax | Example | Status |
|---|---|---|
| Backtick column references | SELECT `id`, `name` FROM t | ✅ |
| String concatenation | SELECT 'Hello' || ' ' || 'World' | ✅ |
| CASE WHEN | CASE WHEN x > 0 THEN 'positive' END | ✅ |
| GROUP BY positional reference | GROUP BY 1, 2 | ✅ |
| ORDER BY positional reference | ORDER BY 1 DESC | ✅ |
| HAVING with alias | HAVING total > 100 | ✅ |
| LIMIT | LIMIT 10 | ✅ |
| VALUES clause | SELECT * FROM VALUES (1, 'a'), (2, 'b') AS t(id, name) | ✅ |
| JOIN ... USING | JOIN b USING (id) | ✅ |
| CROSS JOIN | CROSS JOIN | ✅ |
| RLIKE / REGEXP | 'abc' RLIKE '[a-z]+$' | ✅ |
| NULLIF / NVL | NULLIF(a, b), NVL(col, 'default') | ✅ |
| EXCEPT / INTERSECT | SELECT ... EXCEPT SELECT ... | ✅ |
| QUALIFY | QUALIFY ROW_NUMBER() = 1 | ✅ |
| Implicit type conversion | WHERE str_col = 123 | ✅ |
| Division by zero | SELECT 1/0 → NULL | ✅ |
Correlated Subqueries
Lakehouse fully supports correlated subqueries, including referencing outer columns in EXISTS/NOT EXISTS. Syntax is identical to Spark — no changes needed:
Window Functions
Lakehouse is fully compatible with Spark SQL window functions — syntax is identical:
Supported window functions: ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), NTILE(), SUM() OVER(), AVG() OVER(), and more.
Table-Generating Functions
Lakehouse supports common Spark table-generating functions:
| Function | Supported | Notes |
|---|---|---|
explode() | ✅ | Fully compatible; supports both LATERAL VIEW and table function syntax |
posexplode() | ⚠️ | AS (pos, val) alias syntax supported; LATERAL VIEW ... AS pos, val syntax not supported — see below |
inline() | ✅ | Direct call |
stack() | ✅ | Direct call |
json_tuple() | ✅ | Fully compatible; supports LATERAL VIEW syntax |
JSON Processing Functions
| Spark function | Lakehouse function | Notes |
|---|---|---|
from_json(str, schema) | from_json(str, schema) | ✅ Fully compatible, returns STRUCT type |
to_json(struct) | to_json(expr) | ✅ Fully compatible |
get_json_object(str, path) | get_json_object(str, path) | ✅ Fully compatible |
json_tuple(str, path1, path2) | json_tuple(str, path1, path2) | ✅ Fully compatible |
parse_json(str) | parse_json(str) | ✅ Lakehouse-specific, returns JSON type |
Temporary Views and CTEs
Temporary Views
Lakehouse does not support TEMPORARY VIEW / TEMP VIEW syntax — this is an important difference from Spark.
Spark's TEMP VIEW is a session-level object that is automatically destroyed when the session ends. Lakehouse has no equivalent session-level view. Two options for migration:
Option 1: Change to a persistent VIEW (suitable for logic reused multiple times)
Option 2: Change to a CTE (suitable for temporary logic within a single query)
The CTE approach requires no permissions and creates no persistent objects — it is the simplest replacement.
CTEs (WITH Clause)
CTE syntax is identical to Spark — no changes needed.
Function Compatibility
Fully Compatible Functions
The following common Spark SQL functions are fully compatible in Lakehouse:
Array functions: split, regexp_replace, regexp_extract, concat_ws, size, array_sort, sort_array, array_contains, array_position, slice, sequence, flatten, arrays_zip, array_repeat, array_distinct, array_union, array_intersect, array_except, arrays_overlap, array_min, array_max, array_join, array_remove, cardinality, reverse, element_at
Map functions: map_keys, map_values, map_from_arrays, map_concat, str_to_map, map_filter, transform_keys, transform_values, map_zip_with, map_from_entries, element_at
Higher-order functions: transform, filter, exists, forall, zip_with
Aggregate functions: collect_list, collect_set, first, last, approx_count_distinct, percentile, percentile_approx, corr, covar_pop, covar_samp
String functions: split, regexp_replace, regexp_extract, concat_ws
Date functions: date_format, to_date, current_date, current_timestamp
Conditional functions: CASE WHEN, NULLIF, NVL, IFNULL, COALESCE, TRY_CAST, TRY_ELEMENT_AT
Other functions: typeof, named_struct, monotonically_increasing_id, current_database, current_schema, current_user, version, raise_error, assert_true, aes_encrypt, aes_decrypt
Unsupported Functions
The following Spark SQL functions are not supported in Lakehouse and must be replaced during migration:
| Spark function | Alternative | Notes |
|---|---|---|
aggregate(arr, init, merge) | Use subquery or UDF | Array aggregation |
reduce(arr, init, merge) | Same as above | Array reduction |
nanvl(x, y) | CASE WHEN isnan(x) THEN y ELSE x END | NaN handling |
bin(n) | conv(n, 10, 2) | Binary conversion |
hash(x) | murmurhash3_32(x) or sha2(x, 256) | Hash function |
xxhash64(x) | murmurhash3_32(x) | 64-bit hash |
shuffle(arr) | Not supported | Random array shuffle |
array_sort(arr, comparator) | Single-argument version only | Custom sort |
soundex(str) | Not supported | Phonetic encoding |
levenshtein(s1, s2) | Not supported | Edit distance |
overlay(str, replace, pos) | Not supported | String replacement |
sentences(str) | Not supported | Sentence splitting |
session_window(ts, gap) | Manually compute using LAG/LEAD | Session window |
window(ts, interval) | Not supported | Time window |
width_bucket(v, min, max, n) | Not supported | Bucketing function |
histogram_numeric(col, n) | Not supported | Numeric histogram |
kurtosis(col) | Not supported | Kurtosis |
skewness(col) | Not supported | Skewness |
reflect(class, method, args) | Not supported | Java reflection |
java_method(class, method, args) | Not supported | Java method call |
xpath_string(xml, xpath) | Not supported | XML parsing |
input_file_name() | Not supported | File name |
input_file_block_start() | Not supported | File block start |
spark_partition_id() | Not supported | Spark partition ID |
entries_to_map(keys, values) | map_from_arrays(keys, values) | Map construction |
length(array) | size(array) or cardinality(array) | Array length |
UDF Migration
Spark supports multiple UDF types. Lakehouse provides two corresponding mechanisms: SQL Function (pure SQL logic) and External Function (Python/Java code).
Migration Comparison
| Spark UDF type | Typical syntax | Lakehouse equivalent | Migration complexity |
|---|---|---|---|
| SQL UDF (pure expression) | CREATE FUNCTION f(x INT) RETURNS INT RETURN x * 2 | SQL Function | Low; syntax mostly identical |
| Python UDF (scalar) | spark.udf.register("f", lambda x: x*2, IntegerType()) | External Function (Python, UDF only) | Medium; must deploy to cloud function service |
| Java UDF | Implement UDF1<T,R> etc. or extend Hive GenericUDF/UDF | External Function (Java) | Medium; must package and deploy; Hive API code can be partially reused |
| Java UDAF | Extend Aggregator or UserDefinedAggregateFunction | External Function (Java UDAF) | High; aggregation logic must be rewritten |
| Java UDTF | Extend GenericUDTF | External Function (Java UDTF) | High; table function logic must be rewritten |
| Python UDAF / Python UDTF | pandas_udf aggregation, yield multiple rows | Not supported; must rewrite in Java or change to SQL Function | High |
SQL Function (Pure SQL Logic)
Spark SQL UDFs can be directly migrated to Lakehouse SQL Functions — syntax is highly compatible:
Key difference from Spark: Schema prefix required
Lakehouse SQL Functions are Schema-level objects; calls require a Schema prefix by default:
If the original Spark code calls UDFs without a prefix extensively, enable UDF-first lookup to make prefix-free calls work:
SQL Functions also support table functions (returning multiple rows), corresponding to Spark UDTF scenarios expressible in pure SQL:
External Function (Python/Java Code)
Spark's Python UDFs and Scala/Java UDFs must be migrated to Lakehouse External Functions. External Functions deploy function logic to a cloud function service (Alibaba Cloud FC, Tencent Cloud SCF, or AWS Lambda); Lakehouse calls them via HTTP.
Supported scope:
- Python 3.10: UDF (scalar functions) only
- Java 8: UDF, UDAF (aggregate functions), UDTF (table functions)
Migration steps overview:
- Rewrite UDF code as a cloud function Handler (Python 3.10 or Java 8 Hive-style UDF)
- Package and upload to object storage or a Lakehouse Volume
- Create an API Connection in Lakehouse (stores cloud function service authentication)
- Create an External Function and bind the Connection
Session-Level UDF Registration
Spark supports dynamically registering UDFs on the Driver (spark.udf.register), making them immediately available in the current session. Lakehouse has no equivalent session-level registration mechanism — all functions are persistent Schema objects.
During migration, change spark.udf.register registration logic to CREATE OR REPLACE FUNCTION, executed once in the deployment script or initialization phase.
Configuration Parameters
Spark's query optimization parameters are automatically managed by Lakehouse — no manual configuration needed:
| Spark configuration | Lakehouse behavior |
|---|---|
spark.sql.adaptive.enabled | Adaptive query optimization enabled by default |
spark.sql.shuffle.partitions | Parallelism automatically managed |
spark.sql.broadcastTimeout | Broadcast Join automatically handled |
spark.sql.files.maxPartitionBytes | File splitting automatically optimized |
DataFrame Write Limitations
When writing to Lakehouse via Spark Connector, note the following limitations:
- Must write all columns; partial column writes are not supported
- Writing to tables with primary keys (PK tables) is not supported
- Write mode supports
appendonly;overwritefor single partition is not supported
Migration Checklist
- Confirm
TimestampTypemaps toTIMESTAMP_LTZ;TimestampNTZTypemaps toTIMESTAMP_NTZ - CTAS:
PARTITIONED BYnot supported; split into CREATE TABLE + INSERT -
INSERT OVERWRITE: Lakehouse defaults to dynamic partition overwrite; if original code relies on Spark 2.x static mode (full table overwrite), change toTRUNCATE TABLE+INSERT INTO -
TEMP VIEW: Not supported; change to persistentVIEWor inlineCTE -
posexplodewithLATERAL VIEW ... AS pos, valsyntax not supported; change to table function syntaxFROM t, posexplode(col) AS tmp(pos, val) - Function compatibility: Check for unsupported functions (e.g.,
aggregate,reduce,hash,shuffle,session_window,window,levenshtein,overlay, etc.) — see Function Compatibility section - DataFrame writes: Confirm full-column writes; do not use PK tables
- Dynamic partitions: Confirm single-task partition count does not exceed 2048
- SQL UDF: Calls require Schema prefix, or configure
cz.sql.remote.udf.lookup.policy = udf_firstfor prefix-free compatibility - Python/Scala/Java UDF: Migrate to External Function; deploy to cloud function service and create API Connection
-
spark.udf.registerdynamic registration: Change toCREATE OR REPLACE FUNCTION, executed once in deployment script
Related Documentation
- PySpark → ZettaPark Migration: F1 Racing Data Engineering Project: Complete migration case with 4 required changes and real-world pitfalls, 71/71 verified
- Build a Medallion Three-Layer Warehouse from Scratch on Lakehouse: Bronze → Silver → Gold modeling practice with surrogate key generation and 22 automated validations
- Data Type Compatibility Reference: MySQL/PostgreSQL/Hive/Spark type mapping
- Data Type Conversion: Explicit and implicit conversion rules
- Using Spark Connector: Spark Connector configuration and read/write examples
- Spark Connector Introduction: Spark Connector overview
- SQL Function: Syntax reference for creating pure SQL UDFs, including
cz.sql.remote.udf.lookup.policyconfiguration
