Partitioning

Partitioning is a method of grouping similar rows together at the time of writing to speed up queries. Using partitioning can achieve data pruning and optimize queries. When querying a table, specify the partition to be queried through the WHERE clause to avoid full table scans, improve processing efficiency, and reduce computing resources.

Lakehouse Partitioning

Lakehouse partitioning is similar to Apache Iceberg's hidden partitioning. For ease of understanding, we refer to Apache Iceberg's partitioning concepts to introduce Lakehouse partitioning. Apache Iceberg is an open-source table format that supports two partitioning methods: identity partitioning and transform partitioning.

Identity Partitioning: Uses one or more columns of the table as partition keys, dividing the data into different partitions based on the column values. Partition values are generated by obtaining column values and optionally transforming them.
Transform Partitioning: Uses one or more columns of the table after applying a transformation function as partition keys, dividing the data into different partitions based on the transformed values. For example, if a table has a timestamp column, the years(timestamp) function can be used to partition the data by the number of years since 1970.

Partition information and data file paths are stored in metadata. The advantage of this approach is that partitioning strategies can be modified without affecting the data, and partition information can be hidden, eliminating the need to specify partition conditions when writing SQL. Additionally, there is no limit to the number of partitions since the data itself is the partition.

Lakehouse is compatible with some of Apache Iceberg's transform partitioning functions. However, Apache Iceberg's year, month, day, and hour transform partitioning functions conflict with common date function names in databases. Therefore, in Lakehouse, the transform partitioning functions are named years, months, days, and hours. Below are the partition transform functions supported by Lakehouse:

Partition Function Name	Description	Source Type	Result Type
bucket(numbucket,colNmae)	Hash value of the value, modulo N	int, long, decimal, date, timestamp_ltz, string binary	int
truncate(colName,W)	Truncate value to width W	int, long, decimal, string	Source Type
years	Extract the year from date or timestamp, based on 1970	date, timestamp, timestamptz	int
months	Extract the month from date or timestamp, based on 1970-01-01	date, timestamp, timestamptz	int
days	Extract the day from date or timestamp, based on 1970-01-01	date, timestamp, timestamptz	int
hours	Extract the hour from timestamp, based on 1970-01-01 00:00:00	timestamp, timestamptz	int

Show Partitions

SHOW PARTITIONS

Lakehouse Partitioning Syntax

Introduction

To accommodate users who are accustomed to using Hive syntax, Lakehouse has implemented some syntactic sugar at the syntax level. It is particularly important to note that Apache Hive requires partition columns to be in the last position, but Apache Iceberg does not have this requirement. Therefore, when adding columns, be especially careful as adding columns in the original Hive way may place the column in the last position. Hence, the position must be specified when adding columns. For example, the syntax for creating partitions in Apache Iceberg is as follows:


CREATE TABLE prod.db.sample (
id bigint,
category string，
data string,
)
PARTITIONED BY(category)

The syntax of hive must be like this


CREATE TABLE prod.db.sample (
id bigint,
data string,
category string)
PARTITIONED BY(category string)

lakehouse is compatible with both iceberg syntax and hive syntax, with partition fields and types written in the PARTITIONED BY statement.

Supported Syntax

lakehouse is compatible with Hive's partition syntax and supports the following partition operations:

Insert into…partition: Insert data into one or more partitions of the table. If the partition does not exist, it will be automatically created because the partition is the data. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns, requiring that the partition columns and the table positions should be mapped consistently. In hive, you must specify the partition clause.
Insert overwrite…partition: Overwrite data in one or more partitions of the table. If the partition does not exist, it will be automatically created. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns, requiring that the partition columns and the table positions should be mapped consistently. The overwrite behavior follows Hive's principles: if a partition is specified and it is a static value, it is a static overwrite, meaning only the specified partition is overwritten; if a partition is specified and it is a dynamic value, or if no partition is specified, it is a dynamic overwrite, meaning all matching partitions are overwritten. If the partition in the table is a transformed partition, the partition clause does not support specifying transformation functions, and you can directly use insert overwrite table to insert data.


  INSERT OVERWRITE [TABLE] table_name 
  --partition keyword is optional
  [ PARTITION partition_spec] 
  [ column_list ]
  VALUES(value [,...])| select_statement
  --partition_spec is mandatory if the PARTITION keyword is specified
  partition_spec ::=
      partition_col_name = partition_col_val [ , ... ] | partition_col_name

Truncate partition: Clear the data in one or more partitions of the table. However, in Hive, the partition values will be retained, while in Lakehouse, they will not be retained because the partition in Lakehouse is the data. Sure, here is the translated content:


TRUNCATE [TABLE] table_name [PARTITION partition_spec];
--partition_spec explanation
partition_spec::=

Drop partition: Delete one or more partitions of the table.


ALTER TABLE table_name DROP [IF EXISTS] PARTITION 
partition_spec[, PARTITION partition_spec, ...]

Rename partition: Currently, you can directly update the data.


 update sales set order_date='2023-02-02' where order_date= '2023-02-01';

Show partition: It is recommended to use SQL, select distinct for statistics, such as select distinct pt from table_pt;

Supported Partition Data Types

Data Type	Supported
TINYINT	Yes
SMALLINT	Yes
INT	Yes
BIGINT	Yes
STRING	Yes
CHAR(n)	Yes
VARCHAR(n)	Yes
BOOLEAN	Yes
BINARY	No
FLOAT	No
DOUBLE	No
DECIMAL(precision,scale)	No
TIMESTAMP_LTZ	No
INTERVAL (interval type)	No
ARRAY	No
MAP	No
STRUCT	No

Write Partition

Insert into…partition: Insert data into one or more partitions of the table. If the partition does not exist, it will be automatically created because the partition is the data. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns. The required partition columns and the table positions should be mapped consistently. In Hive, you must specify the partition clause.
Insert overwrite…partition: Overwrite data in one or more partitions of the table. If the partition does not exist, it will be automatically created. The partition clause can be omitted, and the data will be automatically mapped according to the order of the columns. The required partition columns and the table positions should be mapped consistently. The overwrite behavior follows Hive's principles. If a partition is specified and it is a static value, it is a static overwrite, meaning only the specified partition is overwritten. If a partition is specified and it is a dynamic value, or if no partition is specified, it is a dynamic overwrite, meaning all matching partitions are overwritten. If the partition in the table is a transformed partition, the partition clause does not support specifying transformation functions. You can directly use insert overwrite table to insert data.


  INSERT OVERWRITE [TABLE] table_name 
  --partition keyword is optional
  [ PARTITION partition_spec] 
  [ column_list ]
  VALUES(value [,...])| select_statement
  --partition_spec is mandatory if the PARTITION keyword is specified
  partition_spec ::=
      partition_col_name = partition_col_val [ , ... ] | partition_col_name

Note: When writing partitions, a single task is currently limited to 2048 partitions. Exceeding this limit will result in an error: The count of dynamic partitions exceeds the maximum number 2048. It is recommended to count the number of partitions before inserting, for example: select count(distinct pt) from table. If you indeed have this many partitions, you can import them in batches. There is no limit to the total number of partitions in the lakehouse, but there is a limit for a single task. If your data volume is small, it is recommended not to set the cluster key and partition key. It is suggested that a single partition and cluster key be in the range of hundreds of MB to GB. For example, a parquet format file compressed to 128MB.

Example

Below are some examples of partition syntax:


-- Create a table partitioned by year and month
CREATE TABLE sales (
  order_id INT,
  customer_id INT,
  amount DOUBLE
) 
PARTITIONED BY (order_date string);

-- Insert data into the table without specifying a partition
INSERT INTO sales VALUES
(1, 101, 100.0, '2023-01-01'),
(2, 102, 200.0, '2023-01-02'),
(3, 103, 300.0, '2023-02-01'),
(4, 104, 400.0, '2023-02-02');

-- Insert data into the table, specifying a partition
INSERT INTO sales PARTITION (order_date='2023-03-01') VALUES
(5, 105, 500.0),
(6, 106, 600.0);
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 5        | 105         | 500.0  | 2023-03-01 |
| 6        | 106         | 600.0  | 2023-03-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 3        | 103         | 300.0  | 2023-02-01 |
| 4        | 104         | 400.0  | 2023-02-02 |
+----------+-------------+--------+------------+
-- Overwrite data in the table, specifying a partition
INSERT OVERWRITE sales PARTITION (order_date='2023-03-01') VALUES
(7, 107, 700.0),
(8, 108, 800.0);
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 7        | 107         | 700.0  | 2023-03-01 |
| 8        | 108         | 800.0  | 2023-03-01 |
| 3        | 103         | 300.0  | 2023-02-01 |
| 4        | 104         | 400.0  | 2023-02-02 |
+----------+-------------+--------+------------+
-- Overwrite data in the table without specifying a partition
INSERT OVERWRITE sales VALUES
(9, 109, 900.0, '2023-04-01'),
(10, 110, 1000.0, '2023-04-02');
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 7        | 107         | 700.0  | 2023-03-01 |
| 8        | 108         | 800.0  | 2023-03-01 |
| 3        | 103         | 300.0  | 2023-02-01 |
| 4        | 104         | 400.0  | 2023-02-02 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+

-- Rename partition, modify partition value
update sales set order_date='2023-02-02' where order_date= '2023-02-01';
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 4        | 104         | 400.0  | 2023-02-02 |
| 3        | 103         | 300.0  | 2023-02-02 |
| 7        | 107         | 700.0  | 2023-03-01 |
| 8        | 108         | 800.0  | 2023-03-01 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+
-- Truncate data in the partition
TRUNCATE TABLE sales PARTITION (order_date='2023-03-01');
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 4        | 104         | 400.0  | 2023-02-02 |
| 3        | 103         | 300.0  | 2023-02-02 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+
-- Delete partition
ALTER TABLE sales DROP PARTITION (order_date='2023-02-02');
+----------+-------------+--------+------------+
| order_id | customer_id | amount | order_date |
+----------+-------------+--------+------------+
| 1        | 101         | 100.0  | 2023-01-01 |
| 2        | 102         | 200.0  | 2023-01-02 |
| 9        | 109         | 900.0  | 2023-04-01 |
| 10       | 110         | 1000.0 | 2023-04-02 |
+----------+-------------+--------+------------+
-- Add column, specify before partition column
ALTER TABLE sales add column col1  string after amount;
+-------------------------+-----------+---------+
|       column_name       | data_type | comment |
+-------------------------+-----------+---------+
| order_id                | int       |         |
| customer_id             | int       |         |
| amount                  | double    |         |
| col1                    | string    |         |
| order_date              | string    |         |
| # Partition Information |           |         |
| # col_name              | data_type | comment |
| order_date              | string    |         |
+-------------------------+-----------+---------+

Iceberg Syntax to Create Partitions


-- Create a table partitioned by year and month
CREATE TABLE sales_ice (
  order_id INT,
  customer_id INT,
  order_date string,
  amount DOUBLE
) 
PARTITIONED BY (order_date);

-- Insert data into the table without specifying the partition, corresponding in order
INSERT INTO sales_ice VALUES
(1, 101, '2023-01-01',100.0),
(2, 102, '2023-01-02',200.0 ),
(3, 103, '2023-02-01',300.0 ),
(4, 104, '2023-02-02',400.0);

-- Insert data into the table, specifying the partition
INSERT INTO sales_ice PARTITION (order_date='2023-03-01') VALUES
(5, 105, 500.0),
(6, 106, 600.0);
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 5        | 105         | 2023-03-01 | 500.0  |
| 6        | 106         | 2023-03-01 | 600.0  |
+----------+-------------+------------+--------+
-- Overwrite data in the table, specifying the partition
INSERT OVERWRITE sales_ice PARTITION (order_date='2023-03-01') VALUES
(7, 107, 700.0),
(8, 108, 800.0);
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 7        | 107         | 2023-03-01 | 700.0  |
| 8        | 108         | 2023-03-01 | 800.0  |
+----------+-------------+------------+--------+
-- Dynamic partitioning
INSERT OVERWRITE sales_ice PARTITION (order_date) VALUES
(11, 111,'2023-03-01', 700.0),
(12, 112,'2023-03-01', 800.0);
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 11       | 111         | 2023-03-01 | 700.0  |
| 12       | 112         | 2023-03-01 | 800.0  |
+----------+-------------+------------+--------+
-- Overwrite data in the table without specifying the partition, mapping according to the data
INSERT OVERWRITE sales_ice VALUES
(9, 109,'2023-04-01', 900.0 ),
(10, 110, '2023-04-02',1000.0 );
+----------+-------------+------------+--------+
| order_id | customer_id | order_date | amount |
+----------+-------------+------------+--------+
| 1        | 101         | 2023-01-01 | 100.0  |
| 2        | 102         | 2023-01-02 | 200.0  |
| 3        | 103         | 2023-02-01 | 300.0  |
| 4        | 104         | 2023-02-02 | 400.0  |
| 11       | 111         | 2023-03-01 | 700.0  |
| 12       | 112         | 2023-03-01 | 800.0  |
| 9        | 109         | 2023-04-01 | 900.0  |
| 10       | 110         | 2023-04-02 | 1000.0 |
+----------+-------------+------------+--------+