Partitioning
Partitioning is a method of grouping similar rows together at the time of writing to speed up queries. Using partitioning can achieve data pruning and optimize queries. When querying a table, specify the partition to be queried through the WHERE clause to avoid full table scans, improve processing efficiency, and reduce computing resources.
Lakehouse Partitioning
Lakehouse partitioning is similar to Apache Iceberg's hidden partitioning. For ease of understanding, we refer to Apache Iceberg's partitioning concepts to introduce Lakehouse partitioning. Apache Iceberg is an open-source table format that supports two partitioning methods: identity partitioning and transform partitioning.
- Identity Partitioning: Uses one or more columns of the table as partition keys, dividing the data into different partitions based on the column values. Partition values are generated by obtaining column values and optionally transforming them.
- Transform Partitioning: Uses one or more columns of the table after applying a transformation function as partition keys, dividing the data into different partitions based on the transformed values. For example, if a table has a timestamp column, the
years(timestamp)
function can be used to partition the data by the number of years since 1970.
Partition information and data file paths are stored in metadata. The advantage of this approach is that partitioning strategies can be modified without affecting the data, and partition information can be hidden, eliminating the need to specify partition conditions when writing SQL. Additionally, there is no limit to the number of partitions since the data itself is the partition.
Lakehouse is compatible with some of Apache Iceberg's transform partitioning functions. However, Apache Iceberg's year, month, day, and hour transform partitioning functions conflict with common date function names in databases. Therefore, in Lakehouse, the transform partitioning functions are named years, months, days, and hours. Below are the partition transform functions supported by Lakehouse:
Partition Function Name | Description | Source Type | Result Type |
---|---|---|---|
bucket(numbucket,colNmae) | Hash value of the value, modulo N | int, long, decimal, date, timestamp_ltz, string binary | int |
truncate(colName,W) | Truncate value to width W | int, long, decimal, string | Source Type |
years | Extract the year from date or timestamp, based on 1970 | date, timestamp, timestamptz | int |
months | Extract the month from date or timestamp, based on 1970-01-01 | date, timestamp, timestamptz | int |
days | Extract the day from date or timestamp, based on 1970-01-01 | date, timestamp, timestamptz | int |
hours | Extract the hour from timestamp, based on 1970-01-01 00:00:00 | timestamp, timestamptz | int |
Show Partitions
Lakehouse Partitioning Syntax
Introduction
To accommodate users who are accustomed to using Hive syntax, Lakehouse has implemented some syntactic sugar at the syntax level. It is particularly important to note that Apache Hive requires partition columns to be in the last position, but Apache Iceberg does not have this requirement. Therefore, when adding columns, be especially careful as adding columns in the original Hive way may place the column in the last position. Hence, the position must be specified when adding columns. For example, the syntax for creating partitions in Apache Iceberg is as follows:
The syntax of hive must be like this
lakehouse is compatible with both iceberg syntax and hive syntax, with partition fields and types written in the PARTITIONED BY statement.
Supported Syntax
lakehouse is compatible with Hive's partition syntax and supports the following partition operations:
-
Insert into…partition: Insert data into one or more partitions of the table. If the partition does not exist, it will be automatically created because the partition is the data. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns, requiring that the partition columns and the table positions should be mapped consistently. In hive, you must specify the partition clause.
-
Insert overwrite…partition: Overwrite data in one or more partitions of the table. If the partition does not exist, it will be automatically created. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns, requiring that the partition columns and the table positions should be mapped consistently. The overwrite behavior follows Hive's principles: if a partition is specified and it is a static value, it is a static overwrite, meaning only the specified partition is overwritten; if a partition is specified and it is a dynamic value, or if no partition is specified, it is a dynamic overwrite, meaning all matching partitions are overwritten. If the partition in the table is a transformed partition, the partition clause does not support specifying transformation functions, and you can directly use insert overwrite table to insert data.
-
-
Truncate partition: Clear the data in one or more partitions of the table. However, in Hive, the partition values will be retained, while in Lakehouse, they will not be retained because the partition in Lakehouse is the data. Sure, here is the translated content:
-
-
Drop partition: Delete one or more partitions of the table.
-
-
Rename partition: Currently, you can directly update the data.
-
-
Show partition: It is recommended to use SQL, select distinct for statistics, such as
select distinct pt from table_pt;
Supported Partition Data Types
Data Type | Supported |
---|---|
TINYINT | Yes |
SMALLINT | Yes |
INT | Yes |
BIGINT | Yes |
STRING | Yes |
CHAR(n) | Yes |
VARCHAR(n) | Yes |
BOOLEAN | Yes |
BINARY | No |
FLOAT | No |
DOUBLE | No |
DECIMAL(precision,scale) | No |
TIMESTAMP_LTZ | No |
INTERVAL (interval type) | No |
ARRAY | No |
MAP | No |
STRUCT | No |
Write Partition
-
Insert into…partition: Insert data into one or more partitions of the table. If the partition does not exist, it will be automatically created because the partition is the data. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns. The required partition columns and the table positions should be mapped consistently. In Hive, you must specify the partition clause.
-
Insert overwrite…partition: Overwrite data in one or more partitions of the table. If the partition does not exist, it will be automatically created. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns. The required partition columns and the table positions should be mapped consistently. The overwrite behavior follows Hive's principles. If a partition is specified and it is a static value, it is a static overwrite, meaning only the specified partition is overwritten. If a partition is specified and it is a dynamic value, or if no partition is specified, it is a dynamic overwrite, meaning all matching partitions are overwritten. If the partition in the table is a transformed partition, the partition clause does not support specifying transformation functions. You can directly use insert overwrite table to insert data.
-
Note:
When writing partitions, a single task is currently limited to 2048 partitions. Exceeding this limit will result in an error: The count of dynamic partitions exceeds the maximum number 2048
. It is recommended to count the number of partitions before inserting, for example: select count(distinct pt) from table
. If you indeed have this many partitions, you can import them in batches. There is no limit to the total number of partitions in the lakehouse, but there is a limit for a single task. If your data volume is small, it is recommended not to set the cluster key and partition key. It is suggested that a single partition and cluster key be in the range of hundreds of MB to GB. For example, a parquet format file compressed to 128MB.
Example
Below are some examples of partition syntax:
Iceberg Syntax to Create Partitions