September 26, 2024 Lakehouse Platform Release Notes

This release (Release 2024.09.26) introduces a series of new features, enhancements, and fixes. Please note that these updates will be gradually rolled out to the following regions, and the updates will be completed within one to two weeks from the release date, depending on your region.

Alibaba Cloud Shanghai Region
Tencent Cloud Shanghai Region
Tencent Cloud Beijing Region
Tencent Cloud Guangzhou Region
Amazon Beijing Region
International Site - Alibaba Cloud - Singapore Region
International Site - AWS - Singapore Region

New Features

[Preview] Support for Clone Function

The Lakehouse CREATE CLONE feature allows you to create a copy of an existing object without actually copying the data. This zero-copy cloning uses metadata to reference the original data, enabling quick creation of clones without consuming additional storage space. Supports cloning of both regular tables and dynamic tables.

Federated Query Updates

[Preview] External Tables Now Support Reading Hudi Format

This release supports creating and using external tables in Hudi format. When creating an external table, you can use the CONNECTION object to define the service connection information for the external table.

[Preview] Hive Federation Supports Writing

Support for Writing: The Hive federation feature now supports write operations, allowing users to write data to different data storage systems. With Hive federation, users can access and write to multiple data sources using a unified SQL interface, simplifying data management and operations.

External Schema Supports Using Connection

Previously, the connection parameters for External Schema needed to be filled in the PROPERTIES. With the enhancement, these parameters can be filled in the Connection. By using Connection, users do not need to expose authentication information in plain text, ensuring data security. Once a Connection is created, it can be used by multiple downstream objects, improving configuration reusability and management convenience.

Import and Export Updates

COPY INTO <location> Function Enhancement

Support for PURGE=TRUE parameter: When PURGE=TRUE is set, successfully loaded files are deleted during the loading process.
Support for OVERWRITE parameter: When using the OVERWRITE parameter, existing table data will be cleared before importing new data. OVERWRITE is atomic, meaning the table data will only be cleared and new data written if the import is successful.

Pipe Function Enhancement

New Mode: Pipe can detect new files and synchronize them even without object storage message service notifications.
Import History: The Pipe's ability to read object storage also supports the load_history function, which records successfully imported files.
Pipe Filtering: A new Pipe filtering feature allows users to filter Pipe job history using query tag.

Virtual Compute Clusters

GP Type Dynamic Scaling Capability: Supports dynamic scaling when tasks experience resource waiting and queuing, addressing the limitation of not being able to scale vertically (increase CPU and memory for a single cluster instance). This helps handle large load fluctuations on GP VCs, avoiding the inefficiency of purchasing VC resources based on peak load while meeting peak load demands.
VC SIZE Setting: Supports setting the CRU value size. CRU is a unit of measurement for computing resources, with 1 CRU equivalent to the computing power provided by 8 cores of cloud vendor cloud server resources running for 1 hour.

Incremental Computing Updates

TABLE_CHANGES Function Enhancement

Default Behavior: The TABLE_CHANGES function returns changes caused by transactions committed after the specified offset and before the current time, but does not return specific operation details. For example, if an insert followed by a delete occurs between two points, the default result is empty.
Enhanced Functionality: By adding the map('TABLE_STREAM_MODE', 'ORIGINAL') parameter, the TABLE_CHANGES function will return all change details between two points. For example, both the insert and delete operations in the above case will be returned.

SQL Query Optimization

Small File Optimization

Automatic Small File Merging: Before performing write operations (such as INSERT, UPDATE, DELETE, COPY statements), setting set cz.sql.compaction.after.commit=true; can automatically trigger small file merging. This helps improve the efficiency of subsequent queries by avoiding reading a large number of small files. Note that enabling this setting may increase the execution time of INSERT, UPDATE, DELETE, COPY statements.

Preload Cache

AP Type Compute Clusters (ANALYTICS PURPOSE VIRTUAL CLUSTER) support viewing cache status

SQL Capability Updates

Support for Specifying Default Values and Generated Columns When Creating Tables

In this SQL engine, when creating internal tables, you can now specify default values for columns or define generated columns to automate data filling and processing. This makes data management more efficient, especially when dealing with partitioned tables. Currently, real-time write interfaces are not supported, and if the real-time interface does not specify the value, it will be null and will not automatically fill the value.

Support for Vector Indexing

Using the HNSW (Hierarchical Navigable Small World) algorithm to build vector indexes. It is used to accelerate vector retrieval.

Data Types

Array type: Arrays can be defined using a constant format such as [1, 2, 3]. When characters are in the above format, they will be recognized as array types.

Vector type: Lakehouse provides the VECTOR type to store these transformed vectors. Building indexes can improve vector search performance.

Function Support

Function Name	Description
median	Calculates the median of the group's values.
group_concat	Concatenates the values in the group into a string.
multiif	Used to write CASE operators more compactly in queries.
ceiling	Returns the smallest integer greater than or equal to the specified number. Equivalent to ceil.
log	Calculates the logarithm of the specified number.
current_timezone	Returns the current time zone.
power	Calculates the power of the specified number.
if	Supports two parameters. If the third parameter is not written, the default is NULL.
nvl2	If the first parameter is not NULL, returns the second parameter; otherwise, returns the third parameter.

Behavior Changes

Data Types

varchar data type change: The length of varchar has changed from 65535 to 1048576.

SDK Interface

In versions after JDBC 2.0.0, the local COPY command has been deprecated. We recommend using the PUT method to upload data to the volume and then using the server-side COPY command to import it.