July 22, 2024 Lakehouse Platform Release Notes
This release (Release 2024.07.22) introduces a series of new features, enhancements, and fixes. Please note that these updates will be gradually rolled out to the following regions over the course of one to two weeks from the release date, depending on your region.
- Alibaba Cloud Shanghai Region
- Tencent Cloud Shanghai Region
- Tencent Cloud Beijing Region
- Tencent Cloud Guangzhou Region
- Amazon Beijing Region
- International Site - Alibaba Cloud - Singapore Region
- International Site - AWS - Singapore Region
New Features
[Preview] External Catalog, Supports Catalog-Level Federated Query
Based on External Table and External Schema, Singdata Lakehouse supports using External Catalog to map and mirror external data sources at the Catalog level. This release supports using External Catalog to connect and map multiple databases of Hive Metastore, simplifying the query and analysis of Hive Metastore managed data in Singdata Lakehouse.
[Preview] Kafka External Table, Supports SQL Direct Reading of Kafka
A new external table type for Kafka services is added, supporting SQL queries of Kafka message data.
[Preview] ClickZetta Connector for Spark, Supports Spark Reading and Writing Lakehouse
Provides the ClickZetta Connector for Spark plugin, which allows you to access Singdata Lakehouse data tables through existing Spark clusters, supporting reading and writing.
[Preview] MySQL Communication Protocol Support, Supports MySQL Client Connection to Lakehouse
Based on the custom JDBC driver, Singdata Lakehouse adds support for the MySQL8 compatible communication protocol. You can use MySQL clients and drivers to connect to Lakehouse, extending analysis scenarios with MySQL ecosystem tools. Specifically, when client tools do not support uploading Singdata Lakehouse's custom JDBC driver, you can use the MySQL protocol to connect to Lakehouse, such as PowerBI, QuickBI, etc.
[Preview] Logstash ClickZetta Output Plugin, Real-time Log Data Writing to Lakehouse
For log collection and retrieval analysis scenarios, Singdata Lakehouse provides the Logstash ClickZetta Output plugin, which allows real-time writing of log data collected by Logstash into Lakehouse data tables. Combined with the creation of inverted indexes on data tables, Lakehouse can achieve real-time log collection, real-time index building to support real-time file and log retrieval analysis needs.
Import and Export Updates
[Preview] Automatic Import Service (Pipe) Adds Support for Real-time Import of Alibaba Cloud Object Storage File Data
On the basis of supporting real-time import of Kafka data, the automatic import service adds the capability of real-time import of object storage file data. The automatic import service can subscribe to object storage file change events and automatically trigger import tasks based on file change events, achieving automatic incremental import capability for rapidly changing object storage data files.
COPY INTO Supports Exporting Table Data in JSON Format
The COPY INTO command extends the export format, allowing table data to be exported as JSON format files through the FILE_FORMAT = (TYPE = JSON) parameter.
Federated Query Updates
Extend Catalog Connection Types, Support External Catalog Service Connection Definitions
Adds the Catalog Connection object type, supporting the creation of connection definitions with external Catalog services (such as Hive Metastore). When creating External Catalog/External Schema/External Table, the Catalog Connection can be referenced to simplify definitions and improve the security of connection information.
External Table Query Adds Metadata Caching Capability
When querying external tables through External Table/External Schema/External Catalog, Singdata Lakehouse adds local caching capability for remote metadata to accelerate external data query performance.
Virtual Compute Cluster Updates
[Preview] Preload Cache Supports Dynamic Cache Only for Recent Partitions of Partitioned Tables
Supports the configuration of actively caching recent partitions of partitioned tables. The system automatically eliminates expired partition cache data and loads new partition data based on partition changes. This feature allows effective use of the local cache of analytical clusters to accelerate the caching of recent hot data, suitable for business scenarios with a large amount of historical data and a focus on recent data during query analysis.
Incremental Computing Updates
Dynamic Table DDL Definition Supports User-Specified Virtual Compute Cluster for Automatic Refresh Jobs
In the DDL definition of CREATE DYNAMIC TABLE, the syntax REFRESH INTERVAL [interval_time] VCLUSTER <virtual_cluster_name> is supported to set the cluster name used by the automatic refresh job.
SQL Capability Updates
Support SHOW PARTITIONS Syntax to View Partition Table Information
Lakehouse uses an implicit partitioning method to define and use partition tables. To be compatible with Hive's partition management habits and enhance partition-based management and optimization capabilities, this release provides the SHOW PARTITIONS syntax to view partition table information. Additionally, the SHOW PARTITIONS EXTENDED syntax can be used to view extended partition information, including: partition values, partition record count, partition data size, partition creation time, and partition last modification time. Through SHOW PARTITIONS EXTENDED, users can understand the number, size, and modification information of partitions. The Lakehouse platform can also leverage partition information in metadata for fine-grained management and optimization in scenarios such as historical data archiving and Preload Cache proactive caching.
SHOW TABLES Command Data Object Types Extended to Include External and Dynamic Tables
The return information of the SHOW TABLES command has been extended with the is_external and is_dynamic fields to distinguish whether the table is external or dynamic.
Compatibility Extension: New MAX_PT Function to Support Viewing the Largest Partition of the Latest Partition Table
Using max_pt can return the value of the largest primary partition of a partition table, enhancing syntax compatibility for existing tasks using this function.
Built-in Functions
The following built-in functions are newly added in this release:
Function Name | Description |
---|---|
CHARACTER_LENGTH | Returns the number of characters in a string. |
CHAR_LENGTH | Equivalent to CHARACTER_LENGTH, used to return the number of characters in a string. |
LENGTHB | Returns the byte length of the string parameter. |
PERCENTILE_APPROX | Used to calculate approximate percentiles. It returns the approximate percentile of a specified column value in a table. |
PERCENT_RANK | Used to calculate percentile rank. It returns the relative position of a value within a set of values. |
FORMAT_STRING | Used to format strings. It generates formatted strings based on printf style format strings. |
REGEXP_EXTRACT_ALL | Used to extract all substrings that match a regular expression from a string. |
STR_TO_DATE_MYSQL | Used to convert strings to dates, compatible with the STR_TO_DATE function in MySQL. |
MAX_PT | Used to get the value of the largest partition in a partition table. |
IS_IP_ADDRESS_IN_RANGE | Used to determine if an IP address is within a certain network range. |
Ecosystem & Development Interfaces
ClickZetta Connector for Flink Update
The ClickZetta Connector for Flink plugin adds support for Flink v1.17/v1.18 versions.
ClickZetta Catalog SDK Update
Flink supports using the ClickZetta Catalog SDK to read Lakehouse data tables.
DBT-CLICKZETTA ADAPTER Update
The dbt-clickzetta adapter adds support for Singdata Lakehouse dynamic table incremental models, allowing the use of dbt to develop automatically refreshing dynamic table models.
Bug Fixes
- Job History & Job Details: Fixed an issue where some jobs (such as desc, show commands) could not display job duration information in the Job History list.
Behavior Changes
- Virtual Cluster Management (Preload Cache): For clusters with Preload Cache set, the original behavior was to proactively cache the entire table configured for Preload by default at startup. The current behavior change is: after starting the cluster, incrementally cache the table configured for Preload. That is, by default, only proactively cache newly changed data.
- Dynamic Table Auto Refresh: In the DDL definition of CREATE DYNAMIC TABLE, you can set the system auto-refresh interval and the computing resources for running the refresh job. The original behavior was to set this by adding the refresh_vc parameter in the PROPERTIES of the dynamic table. For example, after setting PROPERTIES('refresh_vc'='vcluster_name), the dynamic table auto-refresh task will use the vcluster_name cluster to execute the task. The current behavior has changed to: use the syntax REFRESH INTERVAL [interval_time] VCLUSTER <virtual_cluster_name> to set the cluster name used for the auto-refresh job. The original PROPERTIES parameter behavior will continue to be retained and compatible.