May 24, 2024 Lakehouse Platform Release Notes

In this release (Release 2024.05.24), we have introduced a series of new features, enhancements, and fixes. Please note that these updates will be gradually rolled out to the following regions over a period of one to two weeks from the release date, depending on your region.

  • Alibaba Cloud Shanghai Region
  • Tencent Cloud Shanghai Region
  • Alibaba Cloud Singapore Region
  • Tencent Cloud Beijing Region
  • Amazon Beijing Region

New Features

This release introduces support for inverted indexes. Inverted indexes tokenize text and save the mapping between tokens and records. When users perform text searches based on keywords, the query engine quickly finds matching data records through the index data, significantly accelerating text search performance.

This release supports creating inverted indexes for string-type fields and provides a set of built-in functions to express text matching conditions and rules.

For more information, please refer to the Inverted Index documentation.

New Internal Volume Object Types for Data Lake Storage Volume

In addition to external Volumes, two new internal Volume types, Table Volume and User Volume, have been added. Table Volume and User Volume are pre-defined and created by the system by default. The data of internal Volumes is stored in the Lakehouse managed storage area, allowing for quick management and use of unstructured data without the need to connect to external storage services. Internal Volumes simplify the management and use of file data in scenarios such as UDF resource file management, temporary storage of data import/export files, and development testing.

For more information, please refer to the Volume documentation.

[Preview] External Function Supports UDAF and UDTF

External Function previously supported UDF (i.e., scalar UDF). This upgrade extends support for UDAF and UDTF custom functions. You can use the Hive UDF API to develop UDAF and UDTF.

The documentation has been updated to include development examples for UDF, UDAF, and UDTF.

For more information, please refer to the Java UDF Development Guide documentation.

External Table: Support for DELTA LAKE External Tables

A new external table object type has been added, allowing direct access to data in external storage without importing it. This release supports creating and using external tables in the DELTA LAKE format. When creating an external table, you can use the CONNECTION object to define the service connection information for the external table.

For more information, please refer to the External Table documentation.

[Preview] PIPE Pipeline Task Supports Real-time Data Import from Kafka

A new PIPE object type has been added, allowing the creation of real-time tasks to continuously import data from external streaming data sources into target tables. PIPE tasks are implemented directly by the SQL engine, eliminating the need for third-party ETL tools or engines, reducing unnecessary intermediate storage and format conversion computational overhead, significantly improving real-time import efficiency, increasing import throughput, and reducing import costs.

For more information, please refer to the PIPE Pipeline documentation.

Data Lake Updates

Service Connection Object Supports GCS Object Storage

STORAGE CONNECTION can manage the connection and authentication information of object storage services. In addition to supporting Alibaba Cloud OSS and Tencent Cloud COS, support for Google Cloud GCS object storage service has been added. STORAGE CONNECTION allows for permission control of storage service connections, enabling different tasks and users to reuse defined connection objects.

For more information, please refer to the CONNECTION documentation.

External Schema Adds Access to HMS-managed GCS Data

External Schema maps the Schema (or database in a two-tier structure) under external metadata services, enabling access to data objects under the external Schema. This upgrade adds the ability to access HMS-stored data in Google Cloud GCS.

For more information, please refer to the External Schema documentation.

Real-time Incremental Computing

Dynamic Table: View Dynamic Table Refresh History

A new SHOW DYNAMIC TABLE REFRESH HISTORY command has been added, allowing you to view the refresh history of dynamic tables. The refresh history shows the execution status of each refresh job, the runtime duration of the refresh job, the type of incremental or full refresh, and the number of records processed in each refresh task (including write and delete types). The refresh history enables monitoring of dynamic table refresh operations (especially periodic operations), allowing you to adjust the scheduling cycle or resource size based on metrics to meet business SLA requirements.

Documentation update: Added Dynamic Table Best Practices introduction.

For more information, please refer to the Dynamic Table documentation.

[Preview] Dynamic Table: Support for Incremental Processing of Dynamic Tables Using Custom Functions

The definition of dynamic tables supports the use of custom functions created through External Function (including UDF, UDAF, UDTF). When a dynamic table defined with custom functions is refreshed, the system automatically optimizes incremental processing, further expanding the scope of incremental processing for dynamic tables.

This feature is not enabled by default during the preview period and needs to be enabled through specific parameters.

For more information, please refer to the Using UDF in Dynamic Table document.

Information Schema Updates

New Volume and Connection Object Views in Information Schema

New views for data lake storage VOLUMES and CONNECTIONS have been added. You can query the corresponding views under INFORMATION_SCHEMA to obtain information about data lake storage Volumes and external service Connections.

For more information, please refer to the INFORMATION_SCHEMA document.

SQL Capability Updates

Support for Using SYNONYM as an Alias to Access Existing Objects

By creating SYNONYM, you can reference and wrap the location and name of existing data objects, simplifying access to data objects or enhancing the security of data object access.

For more information, please refer to the SYNONYM document.

Support for TIMESTAMP_NTZ Type

Based on the TIMESTAMP_LTZ type, the TIMESTAMP_NTZ type has been added to the time data types. TIMESTAMP_NTZ, or timestamp without time zone, is used to store date and time values without time zone information, disregarding time zone changes. Compared to other timestamp types with time zones (TIMESTAMP_LTZ), TIMESTAMP_NTZ does not require time zone conversion, greatly simplifying the processing logic of timestamp type data in multi-time zone environments or cross-system data transmission scenarios.

For more information, please refer to the TIMESTAMP_NTZ Type document.

[Preview] Use IDENTITY to Set Auto-Increment Properties for Fields

When creating a table, it supports specifying auto-increment columns using the IDENTITY column attribute.

For more information, please refer to the IDENTITY Auto-Increment Column document.

Built-in Functions

The following built-in functions have been added in this release:

Function NameDescription
READ_KAFKARead Kafka messages based on parameter configuration
TO_TIMESTAMP_NTZConvert string to NTZ timestamp
LOCALTIMESTAMPReturn the current date and time
L2_DISTANCECalculate the L2 distance between two vectors
L2_NORMCalculate the L2 norm of a vector
L2_NORMALIZEPerform L2 normalization on a vector
COSINE_DISTANCECalculate the cosine distance between two vectors
DOT_PRODUCTCalculate the dot product of two vectors
MATCH_PHRASEMatch complete phrases in two strings
MATCH_PHRASE_PREFIXMatch complete phrases in two strings and ignore prefixes
MATCH_REGEXPMatch regular expressions in a string
MATCH_ALLMatch all occurrences of substrings in a string
MATCH_ANYMatch substrings that appear at least once in a string
TOKENIZETokenization function

Ecosystem & Development Interfaces

Java SDK Supports timestamp_ntz

The Java SDK supports the timestamp_ntz type, allowing the use of the timestamp_ntz type to map to the no-time-zone type of the source database in scenarios such as data synchronization, simplifying time zone handling.

Bug Fixes

  • Dynamic Table: Added a check for the latest read position during dynamic table refresh submission to avoid data duplication during concurrent refreshes.
  • Information_Schema: Fixed the abnormal value of the filesize field in the TABLES view being equal to -1.

Behavior Changes

Added Maximum Write Length Constraint for STRING/JSON/BINARY Types

The maximum write length limit for STRING, JSON, and BINARY fields in the data table is 16MB. Field length validation is performed during batch and real-time imports.