September 26, 2024 Lakehouse Platform Release Notes
This release (Release 2024.09.26) introduces a series of new features, enhancements, and fixes. Please note that these updates will be gradually rolled out to the following regions, and the updates will be completed within one to two weeks from the release date, depending on your region.
- Alibaba Cloud Shanghai Region
- Tencent Cloud Shanghai Region
- Tencent Cloud Beijing Region
- Tencent Cloud Guangzhou Region
- Amazon Beijing Region
- International Site - Alibaba Cloud - Singapore Region
- International Site - AWS - Singapore Region
New Features
[Preview] Support for Clone Function
The Lakehouse CREATE CLONE feature allows you to create a copy of an existing object without actually copying the data. This zero-copy cloning uses metadata to reference the original data, enabling quick creation of clones without consuming additional storage space. Supports cloning of both regular tables and dynamic tables.
Federated Query Updates
[Preview] External Tables Now Support Reading Hudi Format
This release supports creating and using external tables in Hudi format. When creating an external table, you can use the CONNECTION object to define the service connection information for the external table.
[Preview] Hive Federation Supports Writing
Support for Writing: The Hive federation feature now supports write operations, allowing users to write data to different data storage systems. With Hive federation, users can access and write to multiple data sources using a unified SQL interface, simplifying data management and operations.
External Schema Supports Using Connection
Previously, the connection parameters for External Schema needed to be filled in the PROPERTIES. With the enhancement, these parameters can be filled in the Connection. By using Connection, users do not need to expose authentication information in plain text, ensuring data security. Once a Connection is created, it can be used by multiple downstream objects, improving configuration reusability and management convenience.
Import and Export Updates
COPY INTO <location> Function Enhancement
- Support for PURGE=TRUE parameter: When PURGE=TRUE is set, successfully loaded files are deleted during the loading process.
- Support for
OVERWRITE
parameter: When using theOVERWRITE
parameter, existing table data will be cleared before importing new data.OVERWRITE
is atomic, meaning the table data will only be cleared and new data written if the import is successful.
Pipe Function Enhancement
- New Mode: Pipe can detect new files and synchronize them even without object storage message service notifications.
- Import History: The Pipe's ability to read object storage also supports the
load_history
function, which records successfully imported files. - Pipe Filtering: A new Pipe filtering feature allows users to filter Pipe job history using
query tag
.
Virtual Compute Clusters
- GP Type Dynamic Scaling Capability: Supports dynamic scaling when tasks experience resource waiting and queuing, addressing the limitation of not being able to scale vertically (increase CPU and memory for a single cluster instance). This helps handle large load fluctuations on GP VCs, avoiding the inefficiency of purchasing VC resources based on peak load while meeting peak load demands.
- VC SIZE Setting: Supports setting the CRU value size. CRU is a unit of measurement for computing resources, with 1 CRU equivalent to the computing power provided by 8 cores of cloud vendor cloud server resources running for 1 hour.
Incremental Computing Updates
TABLE_CHANGES Function Enhancement
- Default Behavior: The TABLE_CHANGES function returns changes caused by transactions committed after the specified offset and before the current time, but does not return specific operation details. For example, if an insert followed by a delete occurs between two points, the default result is empty.
- Enhanced Functionality: By adding the
map('TABLE_STREAM_MODE', 'ORIGINAL')
parameter, the TABLE_CHANGES function will return all change details between two points. For example, both the insert and delete operations in the above case will be returned.
SQL Query Optimization
Small File Optimization
- Automatic Small File Merging: Before performing write operations (such as INSERT, UPDATE, DELETE, COPY statements), setting
set cz.sql.compaction.after.commit=true;
can automatically trigger small file merging. This helps improve the efficiency of subsequent queries by avoiding reading a large number of small files. Note that enabling this setting may increase the execution time of INSERT, UPDATE, DELETE, COPY statements.
Preload Cache
- AP Type Compute Clusters (ANALYTICS PURPOSE VIRTUAL CLUSTER) support viewing cache status
SQL Capability Updates
Support for Specifying Default Values and Generated Columns When Creating Tables
- In this SQL engine, when creating internal tables, you can now specify default values for columns or define generated columns to automate data filling and processing. This makes data management more efficient, especially when dealing with partitioned tables. Currently, real-time write interfaces are not supported, and if the real-time interface does not specify the value, it will be null and will not automatically fill the value.
Support for Vector Indexing
Using the HNSW (Hierarchical Navigable Small World) algorithm to build vector indexes. It is used to accelerate vector retrieval.
Data Types
Array type: Arrays can be defined using a constant format such as [1, 2, 3]
. When characters are in the above format, they will be recognized as array types.
Vector type: Lakehouse provides the VECTOR type to store these transformed vectors. Building indexes can improve vector search performance.
Function Support
Function Name | Description |
---|---|
median | Calculates the median of the group's values. |
group_concat | Concatenates the values in the group into a string. |
multiif | Used to write CASE operators more compactly in queries. |
ceiling | Returns the smallest integer greater than or equal to the specified number. Equivalent to ceil. |
log | Calculates the logarithm of the specified number. |
current_timezone | Returns the current time zone. |
power | Calculates the power of the specified number. |
if | Supports two parameters. If the third parameter is not written, the default is NULL. |
nvl2 | If the first parameter is not NULL, returns the second parameter; otherwise, returns the third parameter. |
Behavior Changes
Data Types
varchar data type change: The length of varchar has changed from 65535 to 1048576.
SDK Interface
In versions after JDBC 2.0.0, the local COPY command has been deprecated. We recommend using the PUT method to upload data to the volume and then using the server-side COPY command to import it.