May 24, 2024 Lakehouse Platform Release Notes
In this release (Release 2024.05.24), we have introduced a series of new features, enhancements, and fixes. Please note that these updates will be gradually rolled out to the following regions over a period of one to two weeks from the release date, depending on your region.
- Alibaba Cloud Shanghai Region
- Tencent Cloud Shanghai Region
- Alibaba Cloud Singapore Region
- Tencent Cloud Beijing Region
- Amazon Beijing Region
New Features
[Preview] Support for Creating Inverted Indexes to Accelerate Search Analysis
This release introduces support for inverted indexes. Inverted indexes tokenize text and save the mapping between tokens and records. When users perform text searches based on keywords, the query engine quickly finds matching data records through the index data, significantly accelerating text search performance.
This release supports creating inverted indexes for string-type fields and provides a set of built-in functions to express text matching conditions and rules.
For more information, please refer to the Inverted Index documentation.
New Internal Volume Object Types for Data Lake Storage Volume
In addition to external Volumes, two new internal Volume types, Table Volume and User Volume, have been added. Table Volume and User Volume are pre-defined and created by the system by default. The data of internal Volumes is stored in the Lakehouse managed storage area, allowing for quick management and use of unstructured data without the need to connect to external storage services. Internal Volumes simplify the management and use of file data in scenarios such as UDF resource file management, temporary storage of data import/export files, and development testing.
For more information, please refer to the Volume documentation.
[Preview] External Function Supports UDAF and UDTF
External Function previously supported UDF (i.e., scalar UDF). This upgrade extends support for UDAF and UDTF custom functions. You can use the Hive UDF API to develop UDAF and UDTF.
The documentation has been updated to include development examples for UDF, UDAF, and UDTF.
For more information, please refer to the Java UDF Development Guide documentation.
External Table: Support for DELTA LAKE External Tables
A new external table object type has been added, allowing direct access to data in external storage without importing it. This release supports creating and using external tables in the DELTA LAKE format. When creating an external table, you can use the CONNECTION object to define the service connection information for the external table.
For more information, please refer to the External Table documentation.
[Preview] PIPE Pipeline Task Supports Real-time Data Import from Kafka
A new PIPE object type has been added, allowing the creation of real-time tasks to continuously import data from external streaming data sources into target tables. PIPE tasks are implemented directly by the SQL engine, eliminating the need for third-party ETL tools or engines, reducing unnecessary intermediate storage and format conversion computational overhead, significantly improving real-time import efficiency, increasing import throughput, and reducing import costs.
For more information, please refer to the PIPE Pipeline documentation.
Data Lake Updates
Service Connection Object Supports GCS Object Storage
STORAGE CONNECTION can manage the connection and authentication information of object storage services. In addition to supporting Alibaba Cloud OSS and Tencent Cloud COS, support for Google Cloud GCS object storage service has been added. STORAGE CONNECTION allows for permission control of storage service connections, enabling different tasks and users to reuse defined connection objects.
For more information, please refer to the CONNECTION documentation.
External Schema Adds Access to HMS-managed GCS Data
External Schema maps the Schema (or database in a two-tier structure) under external metadata services, enabling access to data objects under the external Schema. This upgrade adds the ability to access HMS-stored data in Google Cloud GCS.
For more information, please refer to the External Schema documentation.
Real-time Incremental Computing
Dynamic Table: View Dynamic Table Refresh History
A new SHOW DYNAMIC TABLE REFRESH HISTORY command has been added, allowing you to view the refresh history of dynamic tables. The refresh history shows the execution status of each refresh job, the runtime duration of the refresh job, the type of incremental or full refresh, and the number of records processed in each refresh task (including write and delete types). The refresh history enables monitoring of dynamic table refresh operations (especially periodic operations), allowing you to adjust the scheduling cycle or resource size based on metrics to meet business SLA requirements.
Documentation update: Added Dynamic Table Best Practices introduction.
For more information, please refer to the Dynamic Table documentation.
[Preview] Dynamic Table: Support for Incremental Processing of Dynamic Tables Using Custom Functions
The definition of dynamic tables supports the use of custom functions created through External Function (including UDF, UDAF, UDTF). When a dynamic table defined with custom functions is refreshed, the system automatically optimizes incremental processing, further expanding the scope of incremental processing for dynamic tables.
This feature is not enabled by default during the preview period and needs to be enabled through specific parameters.
For more information, please refer to the Using UDF in Dynamic Table document.
Information Schema Updates
New Volume and Connection Object Views in Information Schema
New views for data lake storage VOLUMES and CONNECTIONS have been added. You can query the corresponding views under INFORMATION_SCHEMA to obtain information about data lake storage Volumes and external service Connections.
For more information, please refer to the INFORMATION_SCHEMA document.
SQL Capability Updates
Support for Using SYNONYM as an Alias to Access Existing Objects
By creating SYNONYM, you can reference and wrap the location and name of existing data objects, simplifying access to data objects or enhancing the security of data object access.
For more information, please refer to the SYNONYM document.
Support for TIMESTAMP_NTZ Type
Based on the TIMESTAMP_LTZ type, the TIMESTAMP_NTZ type has been added to the time data types. TIMESTAMP_NTZ, or timestamp without time zone, is used to store date and time values without time zone information, disregarding time zone changes. Compared to other timestamp types with time zones (TIMESTAMP_LTZ), TIMESTAMP_NTZ does not require time zone conversion, greatly simplifying the processing logic of timestamp type data in multi-time zone environments or cross-system data transmission scenarios.
For more information, please refer to the TIMESTAMP_NTZ Type document.
[Preview] Use IDENTITY to Set Auto-Increment Properties for Fields
When creating a table, it supports specifying auto-increment columns using the IDENTITY column attribute.
For more information, please refer to the IDENTITY Auto-Increment Column document.
Built-in Functions
The following built-in functions have been added in this release:
Function Name | Description |
---|---|
READ_KAFKA | Read Kafka messages based on parameter configuration |
TO_TIMESTAMP_NTZ | Convert string to NTZ timestamp |
LOCALTIMESTAMP | Return the current date and time |
L2_DISTANCE | Calculate the L2 distance between two vectors |
L2_NORM | Calculate the L2 norm of a vector |
L2_NORMALIZE | Perform L2 normalization on a vector |
COSINE_DISTANCE | Calculate the cosine distance between two vectors |
DOT_PRODUCT | Calculate the dot product of two vectors |
MATCH_PHRASE | Match complete phrases in two strings |
MATCH_PHRASE_PREFIX | Match complete phrases in two strings and ignore prefixes |
MATCH_REGEXP | Match regular expressions in a string |
MATCH_ALL | Match all occurrences of substrings in a string |
MATCH_ANY | Match substrings that appear at least once in a string |
TOKENIZE | Tokenization function |
Ecosystem & Development Interfaces
Java SDK Supports timestamp_ntz
The Java SDK supports the timestamp_ntz type, allowing the use of the timestamp_ntz type to map to the no-time-zone type of the source database in scenarios such as data synchronization, simplifying time zone handling.
Bug Fixes
- Dynamic Table: Added a check for the latest read position during dynamic table refresh submission to avoid data duplication during concurrent refreshes.
- Information_Schema: Fixed the abnormal value of the filesize field in the TABLES view being equal to -1.
Behavior Changes
Added Maximum Write Length Constraint for STRING/JSON/BINARY Types
The maximum write length limit for STRING, JSON, and BINARY fields in the data table is 16MB. Field length validation is performed during batch and real-time imports.