Clustered Key and Sorted Key

In the field of big data storage and analysis, the way data is organized has a significant impact on query performance and storage efficiency. This article will introduce the concepts, usage methods, and practical application examples of Clustered Key and Sorted Key in detail, helping you better optimize your data organization structure.

Clustered Key

The Clustered Key is the core of table data distribution. By specifying the column's Hash Key, Lakehouse will perform a hash operation on the data based on these keys and distribute the data into different data buckets. This dispersion helps avoid data skew and hotspot issues while improving parallel processing capabilities.

Criteria for Selecting Clustered Keys

Choose columns with a wide range of values and few duplicate values as Clustered Keys to achieve uniform data distribution.
When performing JOIN operations, if the join key is consistent with the Clustered Key, it can significantly improve performance.
Suitable for scenarios with large amounts of data, the recommended data bucket size is about 128MB to 1GB, depending on the data compression rate and access pattern.
If no Clustered Key is specified, 256 buckets are used by default.

Notes

Avoid specifying too small a number of Clustered Keys to prevent generating a large number of small files, which affects metadata management and I/O operation efficiency.
Too many small files will lead to poor data locality, increased task scheduling overhead, and reduced processing efficiency.

Sorted Key

The Sorted Key is used for the sorting method of file fields. For queries that need to sort the query results, sorting the data by the Sorted Key can improve performance.

Notes on Using Sorted Keys

You can specify ascending (ASC) or descending (DESC) order for the Sorted Key.
Although Sorted Keys can improve query performance, sorting a large amount of data during insertion may consume more resources.

Practical Application Examples

Example 1: Create a Table and Specify Clustered Key and Sorted Key


CREATE TABLE sales_data (
    sale_id INT,
    product_id INT,
    quantity_sold INT,
    sale_date DATE,
    ...
) CLUSTERED BY (product_id)
SORTED BY (sale_date DESC)
INTO 50 BUCKETS;

In this example, a table named sales_data is created, and the data will be distributed into 50 buckets based on the Hash value of the product_id column. Meanwhile, the data within each bucket will be sorted in descending order by the sale_date column.

Example 2: Optimizing Query Performance in a Data Warehouse

Suppose you are working with a data warehouse containing a large number of transaction records. You can optimize query performance in the following way:


CREATE TABLE transaction_records (
    transaction_id INT,
    customer_id INT,
    transaction_date DATE,
    amount DECIMAL(10,2),
    ...
) CLUSTERED BY (customer_id)
SORTED BY (transaction_date ASC)
INTO 128 BUCKETS;

In this example, the transaction_records table is bucketed by customer_id, and the data within each bucket is sorted by transaction_date. This design helps to improve the efficiency of querying transaction records by customer.