July 22, 2024 Lakehouse Platform Release Notes

This release (Release 2024.07.22) introduces a series of new features, enhancements, and fixes. Please note that these updates will be gradually rolled out to the following regions over the course of one to two weeks from the release date, depending on your region.

Alibaba Cloud Shanghai Region
Tencent Cloud Shanghai Region
Tencent Cloud Beijing Region
Tencent Cloud Guangzhou Region
Amazon Beijing Region
International Site - Alibaba Cloud - Singapore Region
International Site - AWS - Singapore Region

New Features

[Preview] External Catalog, Supports Catalog-Level Federated Query

Based on External Table and External Schema, Singdata Lakehouse supports using External Catalog to map and mirror external data sources at the Catalog level. This release supports using External Catalog to connect and map multiple databases of Hive Metastore, simplifying the query and analysis of Hive Metastore managed data in Singdata Lakehouse.

[Preview] Kafka External Table, Supports SQL Direct Reading of Kafka

A new external table type for Kafka services is added, supporting SQL queries of Kafka message data.

[Preview] ClickZetta Connector for Spark, Supports Spark Reading and Writing Lakehouse

Provides the ClickZetta Connector for Spark plugin, which allows you to access Singdata Lakehouse data tables through existing Spark clusters, supporting reading and writing.

[Preview] MySQL Communication Protocol Support, Supports MySQL Client Connection to Lakehouse

Based on the custom JDBC driver, Singdata Lakehouse adds support for the MySQL8 compatible communication protocol. You can use MySQL clients and drivers to connect to Lakehouse, extending analysis scenarios with MySQL ecosystem tools. Specifically, when client tools do not support uploading Singdata Lakehouse's custom JDBC driver, you can use the MySQL protocol to connect to Lakehouse, such as PowerBI, QuickBI, etc.

[Preview] Logstash ClickZetta Output Plugin, Real-time Log Data Writing to Lakehouse

For log collection and retrieval analysis scenarios, Singdata Lakehouse provides the Logstash ClickZetta Output plugin, which allows real-time writing of log data collected by Logstash into Lakehouse data tables. Combined with the creation of inverted indexes on data tables, Lakehouse can achieve real-time log collection, real-time index building to support real-time file and log retrieval analysis needs.

Import and Export Updates

[Preview] Automatic Import Service (Pipe) Adds Support for Real-time Import of Alibaba Cloud Object Storage File Data

On the basis of supporting real-time import of Kafka data, the automatic import service adds the capability of real-time import of object storage file data. The automatic import service can subscribe to object storage file change events and automatically trigger import tasks based on file change events, achieving automatic incremental import capability for rapidly changing object storage data files.

COPY INTO Supports Exporting Table Data in JSON Format

The COPY INTO command extends the export format, allowing table data to be exported as JSON format files through the FILE_FORMAT = (TYPE = JSON) parameter.

Federated Query Updates

Extend Catalog Connection Types, Support External Catalog Service Connection Definitions

Adds the Catalog Connection object type, supporting the creation of connection definitions with external Catalog services (such as Hive Metastore). When creating External Catalog/External Schema/External Table, the Catalog Connection can be referenced to simplify definitions and improve the security of connection information.

External Table Query Adds Metadata Caching Capability

When querying external tables through External Table/External Schema/External Catalog, Singdata Lakehouse adds local caching capability for remote metadata to accelerate external data query performance.

Virtual Compute Cluster Updates

[Preview] Preload Cache Supports Dynamic Cache Only for Recent Partitions of Partitioned Tables

Supports the configuration of actively caching recent partitions of partitioned tables. The system automatically eliminates expired partition cache data and loads new partition data based on partition changes. This feature allows effective use of the local cache of analytical clusters to accelerate the caching of recent hot data, suitable for business scenarios with a large amount of historical data and a focus on recent data during query analysis.

Incremental Computing Updates

Dynamic Table DDL Definition Supports User-Specified Virtual Compute Cluster for Automatic Refresh Jobs

In the DDL definition of CREATE DYNAMIC TABLE, the syntax REFRESH INTERVAL [interval_time] VCLUSTER <virtual_cluster_name> is supported to set the cluster name used by the automatic refresh job.

SQL Capability Updates

Support SHOW PARTITIONS Syntax to View Partition Table Information

Lakehouse uses an implicit partitioning method to define and use partition tables. To be compatible with Hive's partition management habits and enhance partition-based management and optimization capabilities, this release provides the SHOW PARTITIONS syntax to view partition table information. Additionally, the SHOW PARTITIONS EXTENDED syntax can be used to view extended partition information, including: partition values, partition record count, partition data size, partition creation time, and partition last modification time. Through SHOW PARTITIONS EXTENDED, users can understand the number, size, and modification information of partitions. The Lakehouse platform can also leverage partition information in metadata for fine-grained management and optimization in scenarios such as historical data archiving and Preload Cache proactive caching.

SHOW TABLES Command Data Object Types Extended to Include External and Dynamic Tables

The return information of the SHOW TABLES command has been extended with the is_external and is_dynamic fields to distinguish whether the table is external or dynamic.

Compatibility Extension: New MAX_PT Function to Support Viewing the Largest Partition of the Latest Partition Table

Using max_pt can return the value of the largest primary partition of a partition table, enhancing syntax compatibility for existing tasks using this function.

Built-in Functions

The following built-in functions are newly added in this release:

Function Name	Description
CHARACTER_LENGTH	Returns the number of characters in a string.
CHAR_LENGTH	Equivalent to CHARACTER_LENGTH, used to return the number of characters in a string.
LENGTHB	Returns the byte length of the string parameter.
PERCENTILE_APPROX	Used to calculate approximate percentiles. It returns the approximate percentile of a specified column value in a table.
PERCENT_RANK	Used to calculate percentile rank. It returns the relative position of a value within a set of values.
FORMAT_STRING	Used to format strings. It generates formatted strings based on `printf` style format strings.
REGEXP_EXTRACT_ALL	Used to extract all substrings that match a regular expression from a string.
STR_TO_DATE_MYSQL	Used to convert strings to dates, compatible with the `STR_TO_DATE` function in MySQL.
MAX_PT	Used to get the value of the largest partition in a partition table.
IS_IP_ADDRESS_IN_RANGE	Used to determine if an IP address is within a certain network range.

Ecosystem & Development Interfaces

ClickZetta Connector for Flink Update

The ClickZetta Connector for Flink plugin adds support for Flink v1.17/v1.18 versions.

ClickZetta Catalog SDK Update

Flink supports using the ClickZetta Catalog SDK to read Lakehouse data tables.

DBT-CLICKZETTA ADAPTER Update

The dbt-clickzetta adapter adds support for Singdata Lakehouse dynamic table incremental models, allowing the use of dbt to develop automatically refreshing dynamic table models.

Bug Fixes

Job History & Job Details: Fixed an issue where some jobs (such as desc, show commands) could not display job duration information in the Job History list.

Behavior Changes

Virtual Cluster Management (Preload Cache): For clusters with Preload Cache set, the original behavior was to proactively cache the entire table configured for Preload by default at startup. The current behavior change is: after starting the cluster, incrementally cache the table configured for Preload. That is, by default, only proactively cache newly changed data.
Dynamic Table Auto Refresh: In the DDL definition of CREATE DYNAMIC TABLE, you can set the system auto-refresh interval and the computing resources for running the refresh job. The original behavior was to set this by adding the refresh_vc parameter in the PROPERTIES of the dynamic table. For example, after setting PROPERTIES('refresh_vc'='vcluster_name), the dynamic table auto-refresh task will use the vcluster_name cluster to execute the task. The current behavior has changed to: use the syntax REFRESH INTERVAL [interval_time] VCLUSTER <virtual_cluster_name> to set the cluster name used for the auto-refresh job. The original PROPERTIES parameter behavior will continue to be retained and compatible.