Clustered Key and Sorted Key
In the field of big data storage and analysis, the way data is organized has a significant impact on query performance and storage efficiency. This article will introduce the concepts, usage methods, and practical application examples of Clustered Key and Sorted Key in detail, helping you better optimize your data organization structure.
Clustered Key
The Clustered Key is the core of table data distribution. By specifying the column's Hash Key, Lakehouse will perform a hash operation on the data based on these keys and distribute the data into different data buckets. This dispersion helps avoid data skew and hotspot issues while improving parallel processing capabilities.
Criteria for Selecting Clustered Keys
- Choose columns with a wide range of values and few duplicate values as Clustered Keys to achieve uniform data distribution.
- When performing
JOIN
operations, if the join key is consistent with the Clustered Key, it can significantly improve performance. - Suitable for scenarios with large amounts of data, the recommended data bucket size is about 128MB to 1GB, depending on the data compression rate and access pattern.
- If no Clustered Key is specified, 256 buckets are used by default.
Notes
- Avoid specifying too small a number of Clustered Keys to prevent generating a large number of small files, which affects metadata management and I/O operation efficiency.
- Too many small files will lead to poor data locality, increased task scheduling overhead, and reduced processing efficiency.
Sorted Key
The Sorted Key is used for the sorting method of file fields. For queries that need to sort the query results, sorting the data by the Sorted Key can improve performance.
Notes on Using Sorted Keys
- You can specify ascending (ASC) or descending (DESC) order for the Sorted Key.
- Although Sorted Keys can improve query performance, sorting a large amount of data during insertion may consume more resources.
Practical Application Examples
Example 1: Create a Table and Specify Clustered Key and Sorted Key
In this example, a table named sales_data
is created, and the data will be distributed into 50 buckets based on the Hash value of the product_id
column. Meanwhile, the data within each bucket will be sorted in descending order by the sale_date
column.
Example 2: Optimizing Query Performance in a Data Warehouse
Suppose you are working with a data warehouse containing a large number of transaction records. You can optimize query performance in the following way:
In this example, the transaction_records
table is bucketed by customer_id
, and the data within each bucket is sorted by transaction_date
. This design helps to improve the efficiency of querying transaction records by customer.