CUME_DIST
Description
The CUME_DIST function is used to calculate the cumulative distribution ratio of the current row within the specified partition. Specifically, it returns a double type value representing the proportion of the current row and all preceding rows (within the partition) to the total number of rows in the partition. This function is commonly used in data analysis to understand the distribution of data across different groups.
Usage Instructions
- The PARTITION BY clause is used to divide the data into different partitions. If the PARTITION BY clause is not specified, the entire dataset will be considered as one partition.
- The ORDER BY clause is used to specify how the data should be sorted within each partition. The result of the CUME_DIST function depends on the sorting order of the ORDER BY clause.
- If there are identical values in the ORDER BY column, the CUME_DIST function will return the same result for these rows, which is the row_number() of the last row divided by the number of rows in the window.
Return Results
- The return value type is double.
- The result is equal to
last_peer_row_number / partition_row_count
, where last_peer_row_number represents the maximum row_number() value of the current row and all preceding rows (within the partition), and partition_row_count represents the total number of rows in the partition.
Example
Results:
In this example, we can see:
- When a = 'A', there are two rows (b = 1 and b = 2), their cume_dist are 0.67 (2/3) and 1 (3/3) respectively.
- When a = 'B', there is only one row (b = 3), its cume_dist is 1 (1/1).
More Examples
- Calculate the cumulative distribution of employee salaries within each department:
- 计算每个类别内商品销售额的累计分布:
- Calculate the cumulative distribution of the number of users in each age group:
Through these examples, you can better understand the application of the CUME_DIST function in different scenarios.