Partitioning

Partitioning is a method of grouping similar rows together at the time of writing to speed up queries. Using partitioning can achieve data pruning and optimize queries. When querying a table, specify the partition to be queried through the WHERE clause to avoid full table scans, improve processing efficiency, and reduce computing resources.

Lakehouse Partitioning

Lakehouse partitioning is similar to Apache Iceberg's hidden partitioning. For ease of understanding, we refer to Apache Iceberg's partitioning concepts to introduce Lakehouse partitioning. Apache Iceberg is an open-source table format that supports two partitioning methods: identity partitioning and transform partitioning.

  • Identity Partitioning: Uses one or more columns of the table as partition keys, dividing the data into different partitions based on the column values. Partition values are generated by obtaining column values and optionally transforming them.
  • Transform Partitioning: Uses one or more columns of the table after applying a transformation function as partition keys, dividing the data into different partitions based on the transformed values. For example, if a table has a timestamp column, the years(timestamp) function can be used to partition the data by the number of years since 1970.

Partition information and data file paths are stored in metadata. The advantage of this approach is that partitioning strategies can be modified without affecting the data, and partition information can be hidden, eliminating the need to specify partition conditions when writing SQL. Additionally, there is no limit to the number of partitions since the data itself is the partition.

Lakehouse is compatible with some of Apache Iceberg's transform partitioning functions. However, Apache Iceberg's year, month, day, and hour transform partitioning functions conflict with common date function names in databases. Therefore, in Lakehouse, the transform partitioning functions are named years, months, days, and hours. Below are the partition transform functions supported by Lakehouse:

Partition Function NameDescriptionSource TypeResult Type
bucket(numbucket,colNmae)Hash value of the value, modulo Nint, long, decimal, date, timestamp_ltz, string binaryint
truncate(colName,W)Truncate value to width Wint, long, decimal, stringSource Type
yearsExtract the year from date or timestamp, based on 1970date, timestamp, timestamptzint
monthsExtract the month from date or timestamp, based on 1970-01-01date, timestamp, timestamptzint
daysExtract the day from date or timestamp, based on 1970-01-01date, timestamp, timestamptzint
hoursExtract the hour from timestamp, based on 1970-01-01 00:00:00timestamp, timestamptzint

Show Partitions

SHOW PARTITIONS

Lakehouse Partitioning Syntax

Introduction

To accommodate users who are accustomed to using Hive syntax, Lakehouse has implemented some syntactic sugar at the syntax level. It is particularly important to note that Apache Hive requires partition columns to be in the last position, but Apache Iceberg does not have this requirement. Therefore, when adding columns, be especially careful as adding columns in the original Hive way may place the column in the last position. Hence, the position must be specified when adding columns. For example, the syntax for creating partitions in Apache Iceberg is as follows:

CREATE TABLE prod.db.sample (
id bigint,
category string,
data string,
)
PARTITIONED BY(category)

The syntax of hive must be like this

CREATE TABLE prod.db.sample (
id bigint,
data string,
category string)
PARTITIONED BY(category string)

lakehouse is compatible with both iceberg syntax and hive syntax, with partition fields and types written in the PARTITIONED BY statement.

Supported Syntax

lakehouse is compatible with Hive's partition syntax and supports the following partition operations:

  • Insert into…partition: Insert data into one or more partitions of the table. If the partition does not exist, it will be automatically created because the partition is the data. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns, requiring that the partition columns and the table positions should be mapped consistently. In hive, you must specify the partition clause.

  • Insert overwrite…partition: Overwrite data in one or more partitions of the table. If the partition does not exist, it will be automatically created. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns, requiring that the partition columns and the table positions should be mapped consistently. The overwrite behavior follows Hive's principles: if a partition is specified and it is a static value, it is a static overwrite, meaning only the specified partition is overwritten; if a partition is specified and it is a dynamic value, or if no partition is specified, it is a dynamic overwrite, meaning all matching partitions are overwritten. If the partition in the table is a transformed partition, the partition clause does not support specifying transformation functions, and you can directly use insert overwrite table to insert data.

  •   INSERT OVERWRITE [TABLE] table_name 
      --partition keyword is optional
      [ PARTITION partition_spec] 
      [ column_list ]
      VALUES(value [,...])| select_statement
      --partition_spec is mandatory if the PARTITION keyword is specified
      partition_spec ::=
          partition_col_name = partition_col_val [ , ... ] | partition_col_name
  • Truncate partition: Clear the data in one or more partitions of the table. However, in Hive, the partition values will be retained, while in Lakehouse, they will not be retained because the partition in Lakehouse is the data. Sure, here is the translated content:

    • TRUNCATE [TABLE] table_name [PARTITION partition_spec];
      --partition_spec explanation
      partition_spec::=
  • Drop partition: Delete one or more partitions of the table.

    • ALTER TABLE table_name DROP [IF EXISTS] PARTITION 
      partition_spec[, PARTITION partition_spec, ...]
  • Rename partition: Currently, you can directly update the data.

    •  update sales set order_date='2023-02-02' where order_date= '2023-02-01';
  • Show partition: It is recommended to use SQL, select distinct for statistics, such as select distinct pt from table_pt;

Supported Partition Data Types

Data TypeSupported
TINYINTYes
SMALLINTYes
INTYes
BIGINTYes
STRINGYes
CHAR(n)Yes
VARCHAR(n)Yes
BOOLEANYes
BINARYNo
FLOATNo
DOUBLENo
DECIMAL(precision,scale)No
TIMESTAMP_LTZNo
INTERVAL (interval type)No
ARRAYNo
MAPNo
STRUCTNo

Write Partition

  • Insert into…partition: Insert data into one or more partitions of the table. If the partition does not exist, it will be automatically created because the partition is the data. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns. The required partition columns and the table positions should be mapped consistently. In Hive, you must specify the partition clause.

  • Insert overwrite…partition: Overwrite data in one or more partitions of the table. If the partition does not exist, it will be automatically created. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns. The required partition columns and the table positions should be mapped consistently. The overwrite behavior follows Hive's principles. If a partition is specified and it is a static value, it is a static overwrite, meaning only the specified partition is overwritten. If a partition is specified and it is a dynamic value, or if no partition is specified, it is a dynamic overwrite, meaning all matching partitions are overwritten. If the partition in the table is a transformed partition, the partition clause does not support specifying transformation functions. You can directly use insert overwrite table to insert data.

  •   INSERT OVERWRITE [TABLE] table_name 
      --partition keyword is optional
      [ PARTITION partition_spec] 
      [ column_list ]
      VALUES(value [,...])| select_statement
      --partition_spec is mandatory if the PARTITION keyword is specified
      partition_spec ::=
          partition_col_name = partition_col_val [ , ... ] | partition_col_name

Note: When writing partitions, a single task is currently limited to 2048 partitions. Exceeding this limit will result in an error: The count of dynamic partitions exceeds the maximum number 2048. It is recommended to count the number of partitions before inserting, for example: select count(distinct pt) from table. If you indeed have this many partitions, you can import them in batches. There is no limit to the total number of partitions in the lakehouse, but there is a limit for a single task. If your data volume is small, it is recommended not to set the cluster key and partition key. It is suggested that a single partition and cluster key be in the range of hundreds of MB to GB. For example, a parquet format file compressed to 128MB.

Example

Below are some examples of partition syntax:

-- Create a table partitioned by year and month
CREATE TABLE sales (
  order_id INT,
  customer_id INT,
  amount DOUBLE
) 
PARTITIONED BY (order_date string);

-- Insert data into the table without specifying a partition
INSERT INTO sales VALUES
(1, 101, 100.0, '2023-01-01'),
(2, 102, 200.0, '2023-01-02'),
(3, 103, 300.0, '2023-02-01'),
(4, 104, 400.0, '2023-02-02');

-- Insert data into the table, specifying a partition
INSERT INTO sales PARTITION (order_date='2023-03-01') VALUES
(5, 105, 500.0),
(6, 106, 600.0);
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 5        | 105         | 500.0  | 2023-03-01 |
| 6        | 106         | 600.0  | 2023-03-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 3        | 103         | 300.0  | 2023-02-01 |
| 4        | 104         | 400.0  | 2023-02-02 |
+----------+-------------+--------+------------+
-- Overwrite data in the table, specifying a partition
INSERT OVERWRITE sales PARTITION (order_date='2023-03-01') VALUES
(7, 107, 700.0),
(8, 108, 800.0);
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 7        | 107         | 700.0  | 2023-03-01 |
| 8        | 108         | 800.0  | 2023-03-01 |
| 3        | 103         | 300.0  | 2023-02-01 |
| 4        | 104         | 400.0  | 2023-02-02 |
+----------+-------------+--------+------------+
-- Overwrite data in the table without specifying a partition
INSERT OVERWRITE sales VALUES
(9, 109, 900.0, '2023-04-01'),
(10, 110, 1000.0, '2023-04-02');
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 7        | 107         | 700.0  | 2023-03-01 |
| 8        | 108         | 800.0  | 2023-03-01 |
| 3        | 103         | 300.0  | 2023-02-01 |
| 4        | 104         | 400.0  | 2023-02-02 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+

-- Rename partition, modify partition value
update sales set order_date='2023-02-02' where order_date= '2023-02-01';
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 4        | 104         | 400.0  | 2023-02-02 |
| 3        | 103         | 300.0  | 2023-02-02 |
| 7        | 107         | 700.0  | 2023-03-01 |
| 8        | 108         | 800.0  | 2023-03-01 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+
-- Truncate data in the partition
TRUNCATE TABLE sales PARTITION (order_date='2023-03-01');
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 4        | 104         | 400.0  | 2023-02-02 |
| 3        | 103         | 300.0  | 2023-02-02 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+
-- Delete partition
ALTER TABLE sales DROP PARTITION (order_date='2023-02-02');
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+
-- Add column, specify before partition column
ALTER TABLE sales add column col1  string after amount;
+-------------------------+-----------+---------+
|       column_name       | data_type | comment |
+-------------------------+-----------+---------+
| order_id                | int       |         |
| customer_id             | int       |         |
| amount                  | double    |         |
| col1                    | string    |         |
| order_date              | string    |         |
| # Partition Information |           |         |
| # col_name              | data_type | comment |
| order_date              | string    |         |
+-------------------------+-----------+---------+

Iceberg Syntax to Create Partitions

-- Create a table partitioned by year and month
CREATE TABLE sales_ice (
  order_id INT,
  customer_id INT,
  order_date string,
  amount DOUBLE
) 
PARTITIONED BY (order_date);

-- Insert data into the table without specifying the partition, corresponding in order
INSERT INTO sales_ice VALUES
(1, 101, '2023-01-01',100.0),
(2, 102, '2023-01-02',200.0 ),
(3, 103, '2023-02-01',300.0 ),
(4, 104, '2023-02-02',400.0);

-- Insert data into the table, specifying the partition
INSERT INTO sales_ice PARTITION (order_date='2023-03-01') VALUES
(5, 105, 500.0),
(6, 106, 600.0);
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 5        | 105         | 2023-03-01 | 500.0  |
| 6        | 106         | 2023-03-01 | 600.0  |
+----------+-------------+------------+--------+
-- Overwrite data in the table, specifying the partition
INSERT OVERWRITE sales_ice PARTITION (order_date='2023-03-01') VALUES
(7, 107, 700.0),
(8, 108, 800.0);
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 7        | 107         | 2023-03-01 | 700.0  |
| 8        | 108         | 2023-03-01 | 800.0  |
+----------+-------------+------------+--------+
-- Dynamic partitioning
INSERT OVERWRITE sales_ice PARTITION (order_date) VALUES
(11, 111,'2023-03-01', 700.0),
(12, 112,'2023-03-01', 800.0);
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 11       | 111         | 2023-03-01 | 700.0  |
| 12       | 112         | 2023-03-01 | 800.0  |
+----------+-------------+------------+--------+
-- Overwrite data in the table without specifying the partition, mapping according to the data
INSERT OVERWRITE sales_ice VALUES
(9, 109,'2023-04-01', 900.0 ),
(10, 110, '2023-04-02',1000.0 );
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 11       | 111         | 2023-03-01 | 700.0  |
| 12       | 112         | 2023-03-01 | 800.0  |
| 9        | 109         | 2023-04-01 | 900.0  |
| 10       | 110         | 2023-04-02 | 1000.0 |
+----------+-------------+------------+--------+