2025-03-03—Lakehouse Platform 1.0 Product Update Release Notes

In this release, we have introduced a series of new features, enhancements, and fixes. These updates will be rolled out in phases to the following regions and are expected to be completed within one to two weeks from the release date, depending on your specific region.

  • Alibaba Cloud Shanghai Region
  • Tencent Cloud Shanghai Region
  • Tencent Cloud Beijing Region
  • Tencent Cloud Guangzhou Region
  • Amazon Beijing Region
  • Alibaba Cloud Singapore Region (International Site)
  • AWS Singapore Region (International Site)

New Features and Enhancements

Federated Query Update [Preview Release]

  • Architecture Extension: Cloud Lakehouse supports mapping and mirroring external data sources at the Catalog level using External Schema, enabling queries for Hive. (Previously, only the Hive object storage architecture was supported; this release adds support for the Hive HDFS architecture.) For details, see EXTERNAL SCHEMA
  • Enhanced Delta Lake Format Reading: When creating external tables in Delta Lake format, automatic schema inference is enabled. This eliminates the need to manually declare field information during external table creation, as the system automatically parses the metadata. For details, see Delta Lake External Tables

Import and Export Updates

COPY Command Enhancement

Support for two-character CSV delimiters (e.g., ||), breaking the previous single-character limitation and improving compatibility with complex data. For details, see COPY INTO Table

Pipe Function Enhancement

  1. Direct import via Kafka Table Stream is now supported. With Table Stream, you can achieve Exactly Once semantics and store connection information in a Connection.

  2. Pipe SQL Command Optimization

    • Obtain the DDL statement of a Pipe using SHOW CREATE PIPE pipe_name
    • Optimized DESC PIPE output. DESC PIPE pipe_name now displays result fields, including the input and output object names of the Pipe task, and adds Kafka consumption information.
    • Modify Pipe import parameters using ALTER commands, such as changing the compute cluster.
  3. Pipe can import data from object storage. Files or directories starting with . or _temporary can be filtered using the parameter IGNORE_TMP_FILE=FALSE|TRUE. Examples:

    s3://my_bucket/a/b/.SUCCESS
    oss://my_bucket/a/b/_temporary
    oss://my_bucket/a/b/_temporary_123/

Compute Clusters

Fine-Grained Resource Control

Added configuration for GP Compute Cluster Single Job Resource Ratio, allowing control of the maximum resource usage per job at 10%. This prevents resource preemption by large queries and enhances cluster stability:

ALTER VCLUSTER sample_vc SET QUERY_RESOURCE_LIMIT_RATIO='0.1';

SQL Syntax

  1. Support for creating SQL FUNCTIONS. This feature enables users to define custom SQL functions using SQL DDL statements, enhancing flexibility in data processing and analysis.
  2. Lakehouse officially launched the Column-level Security feature, supporting fine-grained control of sensitive data through Dynamic Data Masking. Administrators can dynamically mask, partially display, or replace sensitive information in columns (e.g., ID numbers, credit card numbers) based on user roles or attributes, effectively protecting data privacy. Users can achieve dynamic data masking using the following SQL statement:
ALTER TABLE <table_name> MODIFY COLUMN <column_name> SET MASKING POLICY <policy_name>;

The masking policy is applied to every occurrence of the column during query execution, based on the policy conditions, SQL execution context roles, or users.

  1. Lakehouse Data Storage Optimization. Users can perform small file merging using advanced parameters in the OPTIMIZE Command
  2. DML Enhancement: UPDATE supports ORDER BY ... LIMIT syntax
  3. [Preview Release] Multi-Dialect Compatibility: Partial syntax support for PostgreSQL/MySQL/Hive/Presto dialects is enabled via SQLGlot integration.

UDF Features

Added the cz.sql.remote.udf.lookup.policy configuration parameter, which supports dynamic switching of UDF and built-in function resolution priority.

-- Strategy 1: Prioritize built-in functions (compatible with traditional OLAP system behavior)
SET cz.sql.remote.udf.lookup.policy = builtin_first;


-- Strategy 2: Prioritize UDFs (suitable for MC/Spark job scenarios)
SET cz.sql.remote.udf.lookup.policy = udf_first;


-- Default Strategy: Require UDFs to have a schema prefix (maintain historical compatibility)
SET cz.sql.remote.udf.lookup.policy = schema_only;

Permission Management [Preview Release]

Added Instance-level Roles and Cross-Workspace Authorization Capabilities: Supports creating roles at the instance level and granting global permissions, enabling unified permission management across workspaces and meeting the needs of fine-grained access control in multi-team collaboration scenarios.

Functions

List of New Functions:

FunctionDescription
collect_list_on_arrayCollects elements from an input array into a new array and returns the new array.
collect_set_on_arrayExtracts unique elements from an input array expression and forms a new array.
str_to_date_mysqlConverts a string to a date, compatible with the str_to_date function in MySQL.
make_dateConstructs a date type from year, month, and day.
to_start_of_intervalTruncates timestamp ts according to interval. Note that when interval is in minutes, it must divide evenly into 1 day.
json_removeRemoves elements from jsonObject that match the jsonPath and returns the remaining elements.
element_atExtracts elements at specified positions or keys from arrays or maps.
map_from_arraysCreates a map from two arrays, with keys and values in the map corresponding to the order of elements in the parameter arrays.
endswithDetermines whether a string or binary expression ends with another specified string or binary expression. Returns TRUE if matched, otherwise FALSE. Supports string and binary data.
format_stringFormats a string based on a printf-style format string.
is_asciiChecks if str contains only ASCII-encoded characters.
is_utf8Checks if str contains only UTF-8 encoded characters.
regexp_extract_allExtracts all substrings from a string that match a regular expression.
sha1Calculates the SHA1 hash value of a given string.
startswithChecks if a string starts with another specified string. Returns TRUE if matched, otherwise FALSE. Supports string and binary data.

SDK

JDBC

Internal Endpoint Optimization: Added the use_oss_internal_endpoint=true URL parameter configuration. If the service you use supports querying with OSS internal Endpoint from Alibaba Cloud, this parameter enforces the use of the OSS internal Endpoint. For details, see JDBC Driver

Java SDK

The real-time write interface now fully supports vector data types, with array types mapped in the Java client. This meets the needs of vector retrieval in AI scenarios. Requires clickzetta-java version greater than 2.0.0. For details, see Java SDK

Python SDK

  1. Real-time write support: Provides the clickzetta-ingestion-python-v2 module (pip install clickzetta-ingestion-python-v2), enabling real-time data write to Lakehouse storage.
  2. Asynchronous submission support: The clickzetta-connector-python module supports asynchronous SQL query execution using the execute_async() method, suitable for long-running queries.
  3. Parameter binding support: The clickzetta-connector-python module supports qmark and pyformat-style parameter binding using the execute() method for more flexible queries.

Python SDK Reference

Bug Fixes

  • Fixed the issue of incorrect filesize display in the results of the SQL command SHOW PARTITION EXTENDED.
  • Optimized compatibility for generated columns. Resolved the validation errors when Bulkload writes to generated columns in historical versions.
  • Fixed the issue of the quote parameter not working when exporting data using the Copy command with the specified CSV file format.
  • Fixed the issue of the Schema specified in the Options of External Schema not taking effect in federated queries.
  • Resolved the issue of the regular expression match not working in Volume queries.

Behavioral Changes

  • Default data retention period adjustment: The default value of data_retention_days has been changed from 7 days to 1 day.
  • To enhance development flexibility and data management efficiency, Lakehouse has introduced SQL write support for primary key tables in this version. You can now directly manipulate tables with primary keys using standard SQL statements (INSERT/UPDATE/DELETE).