Spark SQL Migration Guide

This document provides a complete guide for migrating from Spark SQL to Lakehouse, covering migration assessment, type mapping, syntax differences, function compatibility, and UDF migration.

Migration Complexity Assessment

Lakehouse data types, DDL, DML, SELECT syntax, and the vast majority of high-frequency functions are compatible with Spark. For pure SQL ETL and analytics jobs, migration typically requires only minor changes.

Except for Python/Java UDFs, all migration items are one-time batch replacements or simple rewrites — no logic restructuring required.

The effort for UDF migration depends on the number of functions — each Python/Java UDF needs to be individually packaged and deployed to a cloud function service. This is the only part of the migration that requires significant investment. If UDF logic can be expressed in SQL, rewriting as a SQL Function significantly reduces the workload.

Quick Self-Assessment Checklist

Review your job code against the table below to quickly estimate migration effort before starting.

Check item (search in codebase)Lakehouse behaviorUsage frequency impactEffort
None of the items belowDDL, SELECT, UPDATE/DELETE/MERGE fully compatibleNone
spark.udf.register / Python UDF / Java UDFMust be rewritten as External Function, deployed to cloud function serviceMore UDFs = linearly more workHigh
CREATE TEMP VIEW / CREATE TEMPORARY VIEWSession-level temporary views not supported; must change to CTE or persistent VIEWHigh usage means more rewrites, but each change is simpleLow–Medium
aggregate( / reduce( / session_window( / window(No equivalent functions; must be manually rewrittenOccasional use has limited impact; core logic dependency requires redesignMedium
SQL UDF (CREATE FUNCTION)Calls require Schema prefix, or configure lookup policyHigh-frequency calls can be resolved once with udf_firstLow
LATERAL VIEW posexplodeThis syntax not supported; must change to table function syntaxUsually few occurrences; replace one by oneLow
PARTITIONED BY clause in CTASCTAS does not support this clause; must split into CREATE TABLE + INSERTUsually few occurrences; replace one by oneLow
Read/write via Spark ConnectorTypes auto-mapped; no code changes neededNone

Type Mapping

When reading and writing data via Spark Connector, Spark types are automatically mapped to Lakehouse types — no manual code changes required.

Type Mapping Table

Spark typeLakehouse typeNotes
BooleanTypeBOOLEANAuto-mapped
ByteTypeTINYINTAuto-mapped
ShortTypeSMALLINTAuto-mapped
IntegerTypeINTAuto-mapped
LongTypeBIGINTAuto-mapped
FloatTypeFLOATAuto-mapped
DoubleTypeDOUBLEAuto-mapped
DecimalTypeDECIMAL(p,s)Auto-mapped, precision preserved
StringTypeSTRINGAuto-mapped
BinaryTypeBINARYAuto-mapped
DateTypeDATEAuto-mapped
TimestampTypeTIMESTAMP_LTZTimezone-aware type
TimestampNTZTypeTIMESTAMP_NTZTimezone-naive type (Spark 3.4+)
ArrayTypeARRAY<T>Auto-mapped, element types recursively mapped
MapTypeMAP<K,V>Auto-mapped, key/value types recursively mapped
StructTypeSTRUCT<...>Auto-mapped, field types recursively mapped

Timestamp Timezone Notes

Spark's TimestampType is timezone-aware by default, corresponding to Lakehouse's TIMESTAMP_LTZ. If Spark code uses TimestampNTZType (timezone-naive), use TIMESTAMP_NTZ when creating tables in Lakehouse to avoid time value shifts from timezone conversion.

DDL Syntax

CREATE TABLE Syntax

Spark's CREATE TABLE syntax is highly compatible in Lakehouse. The following syntax is all supported and requires no changes during migration:

-- Comment syntax CREATE TABLE orders ( id INT COMMENT 'primary key', name STRING COMMENT 'name' ) COMMENT 'orders table'; -- NOT NULL constraint CREATE TABLE users (id INT NOT NULL, name STRING); -- Default values CREATE TABLE tasks (id INT, status STRING DEFAULT 'pending'); -- Generated columns CREATE TABLE order_items ( id INT, price DOUBLE, quantity INT, total DOUBLE GENERATED ALWAYS AS (price * quantity) ); -- Table properties CREATE TABLE events (id INT) TBLPROPERTIES ('key1' = 'value1'); -- Conditional table creation CREATE TABLE IF NOT EXISTS backup_table AS SELECT * FROM original_table;

Partitioned Tables

Spark's partition syntax is fully compatible in Lakehouse — no changes needed during migration.

-- Spark original syntax, fully supported in Lakehouse CREATE TABLE orders ( order_id INT, customer_id INT, amount DOUBLE ) PARTITIONED BY (order_date STRING);

Lakehouse additionally supports Iceberg-style partition syntax where the partition column appears in the column list, making semantics clearer:

-- Lakehouse recommended syntax (Iceberg style) CREATE TABLE orders ( order_id INT, customer_id INT, amount DOUBLE, order_date STRING -- partition column in column list ) PARTITIONED BY (order_date); -- only declare column name here

Both syntaxes produce the same result. Keep the original syntax during migration; use Iceberg style for new tables.

CTAS

CTAS (Create Table As Select) syntax is mostly compatible in Lakehouse, with the following differences:

USING keyword is optional

-- Spark CREATE TABLE orders_ctas USING parquet AS SELECT * FROM orders; -- Lakehouse (USING can be kept or omitted) CREATE TABLE orders_ctas AS SELECT * FROM orders;

CTAS does not support partition clause

Lakehouse's CTAS syntax does not support the PARTITIONED BY clause. To create a partitioned table, create the table first then insert data:

-- Wrong: CTAS does not support PARTITIONED BY CREATE TABLE orders_partitioned PARTITIONED BY (order_date) AS SELECT * FROM orders; -- Correct: create table first, then insert CREATE TABLE orders_partitioned ( order_id INT, customer_id INT, amount DOUBLE, order_date STRING ) PARTITIONED BY (order_date); INSERT INTO orders_partitioned SELECT * FROM orders;

Bucketed Tables

Bucketed table syntax is identical to Spark — no changes needed:

-- Identical syntax in Spark and Lakehouse CREATE TABLE users ( id INT, name STRING ) CLUSTERED BY (id) INTO 16 BUCKETS;

The USING parquet keyword is also optional in Lakehouse.

Partition Transform Functions

Lakehouse is fully consistent with Spark on partition transform functions — no code changes needed:

FunctionPurpose
years(ts)Partition by year
months(ts)Partition by month
days(ts)Partition by day
hours(ts)Partition by hour
bucket(N, col)Hash bucketing
truncate(col, W)Truncate partitioning

Hidden Partitioning

Lakehouse uses a hidden partitioning mechanism similar to Apache Iceberg:

  • Partition information is stored in metadata, not dependent on file paths
  • Partition strategy can be changed at any time without rewriting data
  • No limit on the number of partitions
  • The optimizer automatically performs partition pruning at query time

Dynamic Partition Limit

Lakehouse supports a maximum of 2048 dynamic partitions per task. If the partition count exceeds this limit, consider writing in batches or using Cluster Key instead of partitioning.

DML Syntax

INSERT

-- Fully compatible INSERT INTO orders VALUES (1, 100, '2024-01-15'); INSERT INTO orders SELECT * FROM staging_orders;

INSERT OVERWRITE

Lakehouse supports three INSERT OVERWRITE modes, but the default semantics differ from Spark — pay attention during migration:

Overwrite the entire table

-- Clear the table and write new data INSERT OVERWRITE orders VALUES (1, 100, '2024-01-15');

Static partition overwrite

-- Overwrite only the specified partition; other partitions are unaffected INSERT OVERWRITE orders PARTITION (order_date='2024-01-15') SELECT order_id, customer_id, amount FROM staging WHERE order_date='2024-01-15';

Dynamic partition overwrite

Without specifying a partition value, the system automatically overwrites all partitions touched by this write; untouched partitions are preserved. This is consistent with Spark's spark.sql.sources.partitionOverwriteMode=dynamic behavior:

-- Overwrite only partitions present in the SELECT result; other partitions are preserved INSERT OVERWRITE orders SELECT order_id, customer_id, amount, order_date FROM staging;

UPDATE / DELETE

-- Fully compatible UPDATE orders SET status = 'shipped' WHERE id = 1; DELETE FROM orders WHERE status = 'cancelled';

MERGE INTO

-- Fully compatible MERGE INTO orders AS target USING staging_orders AS source ON target.id = source.id WHEN MATCHED THEN UPDATE SET target.amount = source.amount WHEN NOT MATCHED THEN INSERT (id, amount, order_date) VALUES (source.id, source.amount, source.order_date);

SELECT Queries

Fully Compatible Syntax

The following common Spark SQL syntax is fully compatible in Lakehouse — no changes needed:

SyntaxExampleStatus
Backtick column referencesSELECT `id`, `name` FROM t
String concatenationSELECT 'Hello' || ' ' || 'World'
CASE WHENCASE WHEN x > 0 THEN 'positive' END
GROUP BY positional referenceGROUP BY 1, 2
ORDER BY positional referenceORDER BY 1 DESC
HAVING with aliasHAVING total > 100
LIMITLIMIT 10
VALUES clauseSELECT * FROM VALUES (1, 'a'), (2, 'b') AS t(id, name)
JOIN ... USINGJOIN b USING (id)
CROSS JOINCROSS JOIN
RLIKE / REGEXP'abc' RLIKE '[a-z]+$'
NULLIF / NVLNULLIF(a, b), NVL(col, 'default')
EXCEPT / INTERSECTSELECT ... EXCEPT SELECT ...
QUALIFYQUALIFY ROW_NUMBER() = 1
Implicit type conversionWHERE str_col = 123
Division by zeroSELECT 1/0NULL

Correlated Subqueries

Lakehouse fully supports correlated subqueries, including referencing outer columns in EXISTS/NOT EXISTS. Syntax is identical to Spark — no changes needed:

-- Fully compatible, no rewrite needed SELECT id FROM orders a WHERE EXISTS (SELECT 1 FROM customers b WHERE b.id = a.customer_id); SELECT id FROM orders a WHERE NOT EXISTS (SELECT 1 FROM blacklist b WHERE b.order_id = a.id);

Window Functions

Lakehouse is fully compatible with Spark SQL window functions — syntax is identical:

SELECT id, amount, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS rn, SUM(amount) OVER (PARTITION BY user_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM orders;

Supported window functions: ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), NTILE(), SUM() OVER(), AVG() OVER(), and more.

Table-Generating Functions

Lakehouse supports common Spark table-generating functions:

FunctionSupportedNotes
explode()Fully compatible; supports both LATERAL VIEW and table function syntax
posexplode()⚠️AS (pos, val) alias syntax supported; LATERAL VIEW ... AS pos, val syntax not supported — see below
inline()Direct call
stack()Direct call
json_tuple()Fully compatible; supports LATERAL VIEW syntax

-- explode: both syntaxes supported SELECT id, item FROM orders LATERAL VIEW explode(items) t AS item; SELECT id, item FROM orders, explode(items) AS t(item); -- posexplode: AS (pos, val) alias syntax supported SELECT pos, val FROM posexplode(ARRAY('a', 'b', 'c')) AS t(pos, val); -- Returns: (0,'a'), (1,'b'), (2,'c') -- posexplode: LATERAL VIEW AS pos, val syntax not supported; must rewrite -- Spark syntax (not supported): -- SELECT pos, val FROM t LATERAL VIEW posexplode(items) tmp AS pos, val -- Lakehouse rewrite: SELECT pos, val FROM t, posexplode(items) AS tmp(pos, val); -- inline/stack/json_tuple: direct calls SELECT inline(ARRAY(named_struct('a', 1), named_struct('a', 2))); SELECT stack(2, 'a', 1, 'b', 2); SELECT a, b FROM json_tuple('{"a":1,"b":2}', 'a', 'b') AS j(a, b);

JSON Processing Functions

Spark functionLakehouse functionNotes
from_json(str, schema)from_json(str, schema)✅ Fully compatible, returns STRUCT type
to_json(struct)to_json(expr)✅ Fully compatible
get_json_object(str, path)get_json_object(str, path)✅ Fully compatible
json_tuple(str, path1, path2)json_tuple(str, path1, path2)✅ Fully compatible
parse_json(str)parse_json(str)✅ Lakehouse-specific, returns JSON type

-- from_json fully compatible SELECT from_json('{"a":1}', 'a INT').a; -- get_json_object fully compatible SELECT get_json_object('{"a":{"b":123}}', '$.a.b'); -- parse_json also available (returns JSON type) SELECT parse_json('{"a":1}')['a'];

Temporary Views and CTEs

Temporary Views

Lakehouse does not support TEMPORARY VIEW / TEMP VIEW syntax — this is an important difference from Spark.

Spark's TEMP VIEW is a session-level object that is automatically destroyed when the session ends. Lakehouse has no equivalent session-level view. Two options for migration:

Option 1: Change to a persistent VIEW (suitable for logic reused multiple times)

-- Spark CREATE OR REPLACE TEMP VIEW temp_orders AS SELECT * FROM orders WHERE status = 'active'; -- Lakehouse: change to a persistent view; requires write permission on the Schema CREATE OR REPLACE VIEW my_schema.temp_orders AS SELECT * FROM orders WHERE status = 'active';

Option 2: Change to a CTE (suitable for temporary logic within a single query)

-- Spark CREATE OR REPLACE TEMP VIEW daily_orders AS SELECT order_date, SUM(amount) AS total FROM orders GROUP BY order_date; SELECT * FROM daily_orders WHERE total > 1000; -- Lakehouse: change to CTE, logic inlined in the query WITH daily_orders AS ( SELECT order_date, SUM(amount) AS total FROM orders GROUP BY order_date ) SELECT * FROM daily_orders WHERE total > 1000;

The CTE approach requires no permissions and creates no persistent objects — it is the simplest replacement.

CTEs (WITH Clause)

CTE syntax is identical to Spark — no changes needed.

Function Compatibility

Fully Compatible Functions

The following common Spark SQL functions are fully compatible in Lakehouse:

Array functions: split, regexp_replace, regexp_extract, concat_ws, size, array_sort, sort_array, array_contains, array_position, slice, sequence, flatten, arrays_zip, array_repeat, array_distinct, array_union, array_intersect, array_except, arrays_overlap, array_min, array_max, array_join, array_remove, cardinality, reverse, element_at

Map functions: map_keys, map_values, map_from_arrays, map_concat, str_to_map, map_filter, transform_keys, transform_values, map_zip_with, map_from_entries, element_at

Higher-order functions: transform, filter, exists, forall, zip_with

Aggregate functions: collect_list, collect_set, first, last, approx_count_distinct, percentile, percentile_approx, corr, covar_pop, covar_samp

String functions: split, regexp_replace, regexp_extract, concat_ws

Date functions: date_format, to_date, current_date, current_timestamp

Conditional functions: CASE WHEN, NULLIF, NVL, IFNULL, COALESCE, TRY_CAST, TRY_ELEMENT_AT

Other functions: typeof, named_struct, monotonically_increasing_id, current_database, current_schema, current_user, version, raise_error, assert_true, aes_encrypt, aes_decrypt

Unsupported Functions

The following Spark SQL functions are not supported in Lakehouse and must be replaced during migration:

Spark functionAlternativeNotes
aggregate(arr, init, merge)Use subquery or UDFArray aggregation
reduce(arr, init, merge)Same as aboveArray reduction
nanvl(x, y)CASE WHEN isnan(x) THEN y ELSE x ENDNaN handling
bin(n)conv(n, 10, 2)Binary conversion
hash(x)murmurhash3_32(x) or sha2(x, 256)Hash function
xxhash64(x)murmurhash3_32(x)64-bit hash
shuffle(arr)Not supportedRandom array shuffle
array_sort(arr, comparator)Single-argument version onlyCustom sort
soundex(str)Not supportedPhonetic encoding
levenshtein(s1, s2)Not supportedEdit distance
overlay(str, replace, pos)Not supportedString replacement
sentences(str)Not supportedSentence splitting
session_window(ts, gap)Manually compute using LAG/LEADSession window
window(ts, interval)Not supportedTime window
width_bucket(v, min, max, n)Not supportedBucketing function
histogram_numeric(col, n)Not supportedNumeric histogram
kurtosis(col)Not supportedKurtosis
skewness(col)Not supportedSkewness
reflect(class, method, args)Not supportedJava reflection
java_method(class, method, args)Not supportedJava method call
xpath_string(xml, xpath)Not supportedXML parsing
input_file_name()Not supportedFile name
input_file_block_start()Not supportedFile block start
spark_partition_id()Not supportedSpark partition ID
entries_to_map(keys, values)map_from_arrays(keys, values)Map construction
length(array)size(array) or cardinality(array)Array length

UDF Migration

Spark supports multiple UDF types. Lakehouse provides two corresponding mechanisms: SQL Function (pure SQL logic) and External Function (Python/Java code).

Migration Comparison

Spark UDF typeTypical syntaxLakehouse equivalentMigration complexity
SQL UDF (pure expression)CREATE FUNCTION f(x INT) RETURNS INT RETURN x * 2SQL FunctionLow; syntax mostly identical
Python UDF (scalar)spark.udf.register("f", lambda x: x*2, IntegerType())External Function (Python, UDF only)Medium; must deploy to cloud function service
Java UDFImplement UDF1<T,R> etc. or extend Hive GenericUDF/UDFExternal Function (Java)Medium; must package and deploy; Hive API code can be partially reused
Java UDAFExtend Aggregator or UserDefinedAggregateFunctionExternal Function (Java UDAF)High; aggregation logic must be rewritten
Java UDTFExtend GenericUDTFExternal Function (Java UDTF)High; table function logic must be rewritten
Python UDAF / Python UDTFpandas_udf aggregation, yield multiple rowsNot supported; must rewrite in Java or change to SQL FunctionHigh

SQL Function (Pure SQL Logic)

Spark SQL UDFs can be directly migrated to Lakehouse SQL Functions — syntax is highly compatible:

-- Spark SQL UDF CREATE FUNCTION multiply(x INT, y INT) RETURNS INT RETURN x * y; -- Lakehouse SQL Function (same syntax) CREATE FUNCTION my_schema.multiply(x INT, y INT) RETURNS INT RETURN x * y;

Key difference from Spark: Schema prefix required

Lakehouse SQL Functions are Schema-level objects; calls require a Schema prefix by default:

-- Error: function not found SELECT multiply(3, 4); -- Correct: add Schema prefix SELECT my_schema.multiply(3, 4);

If the original Spark code calls UDFs without a prefix extensively, enable UDF-first lookup to make prefix-free calls work:

-- After enabling, prefix-free function names are looked up in user UDFs first, then built-in functions SET cz.sql.remote.udf.lookup.policy = udf_first; -- Can then call without prefix SELECT multiply(3, 4);

SQL Functions also support table functions (returning multiple rows), corresponding to Spark UDTF scenarios expressible in pure SQL:

-- SQL Function returning multiple rows (table function) CREATE FUNCTION my_schema.get_employees(dept INT) RETURNS TABLE(name STRING) RETURN SELECT name FROM employee WHERE deptno = dept; SELECT * FROM my_schema.get_employees(10);

External Function (Python/Java Code)

Spark's Python UDFs and Scala/Java UDFs must be migrated to Lakehouse External Functions. External Functions deploy function logic to a cloud function service (Alibaba Cloud FC, Tencent Cloud SCF, or AWS Lambda); Lakehouse calls them via HTTP.

Supported scope:

  • Python 3.10: UDF (scalar functions) only
  • Java 8: UDF, UDAF (aggregate functions), UDTF (table functions)

Migration steps overview:

  1. Rewrite UDF code as a cloud function Handler (Python 3.10 or Java 8 Hive-style UDF)
  2. Package and upload to object storage or a Lakehouse Volume
  3. Create an API Connection in Lakehouse (stores cloud function service authentication)
  4. Create an External Function and bind the Connection

-- Step 4: Create External Function (Java UDF example) CREATE EXTERNAL FUNCTION my_schema.my_upper AS 'com.example.GenericUdfUpper' USING ARCHIVE 'volume://fc_volume/udfs/my_upper.zip' CONNECTION my_fc_conn WITH PROPERTIES ('remote.udf.api' = 'java8.hive2.v0'); -- Call the same way as a regular function (Schema prefix required) SELECT my_schema.my_upper(name) FROM users;

-- Python UDF example CREATE EXTERNAL FUNCTION my_schema.clean_phone AS 'handler.clean_phone' USING FILE 'volume:user://~/clean_phone.zip' CONNECTION my_fc_conn WITH PROPERTIES ('remote.udf.api' = 'python3.mc.v0'); SELECT my_schema.clean_phone(phone_number) FROM users;

Session-Level UDF Registration

Spark supports dynamically registering UDFs on the Driver (spark.udf.register), making them immediately available in the current session. Lakehouse has no equivalent session-level registration mechanism — all functions are persistent Schema objects.

During migration, change spark.udf.register registration logic to CREATE OR REPLACE FUNCTION, executed once in the deployment script or initialization phase.

Configuration Parameters

Spark's query optimization parameters are automatically managed by Lakehouse — no manual configuration needed:

Spark configurationLakehouse behavior
spark.sql.adaptive.enabledAdaptive query optimization enabled by default
spark.sql.shuffle.partitionsParallelism automatically managed
spark.sql.broadcastTimeoutBroadcast Join automatically handled
spark.sql.files.maxPartitionBytesFile splitting automatically optimized

DataFrame Write Limitations

When writing to Lakehouse via Spark Connector, note the following limitations:

  • Must write all columns; partial column writes are not supported
  • Writing to tables with primary keys (PK tables) is not supported
  • Write mode supports append only; overwrite for single partition is not supported

Migration Checklist

  • Confirm TimestampType maps to TIMESTAMP_LTZ; TimestampNTZType maps to TIMESTAMP_NTZ
  • CTAS: PARTITIONED BY not supported; split into CREATE TABLE + INSERT
  • INSERT OVERWRITE: Lakehouse defaults to dynamic partition overwrite; if original code relies on Spark 2.x static mode (full table overwrite), change to TRUNCATE TABLE + INSERT INTO
  • TEMP VIEW: Not supported; change to persistent VIEW or inline CTE
  • posexplode with LATERAL VIEW ... AS pos, val syntax not supported; change to table function syntax FROM t, posexplode(col) AS tmp(pos, val)
  • Function compatibility: Check for unsupported functions (e.g., aggregate, reduce, hash, shuffle, session_window, window, levenshtein, overlay, etc.) — see Function Compatibility section
  • DataFrame writes: Confirm full-column writes; do not use PK tables
  • Dynamic partitions: Confirm single-task partition count does not exceed 2048
  • SQL UDF: Calls require Schema prefix, or configure cz.sql.remote.udf.lookup.policy = udf_first for prefix-free compatibility
  • Python/Scala/Java UDF: Migrate to External Function; deploy to cloud function service and create API Connection
  • spark.udf.register dynamic registration: Change to CREATE OR REPLACE FUNCTION, executed once in deployment script