approx_top_k
Description
This function is used to extract the top k most frequent items from the specified column (col) and return their approximate counts. This is implemented using a probabilistic data structure, so the results may have some errors, but in most cases, it effectively reflects the distribution of the data.
Parameter Description
col
: The input column, which can be of numeric type, string type, or nested type.k
: The number of top k most frequent items to return, must be an integer greater than 0.maxItemsTracked
: Optional parameter to specify the maximum number of items to track. The default value is 10000. If the specifiedmaxItemsTracked
is greater than or equal to k, the specified value will be used; otherwise, k will be used as the maximum number of items to track.
Return Result
The function returns a structured array, where each element is a struct containing three fields: item
(value of the original input type), count
(long integer, representing the approximate number of occurrences of the item), and approximation
(boolean, indicating whether the result is an approximation, always true). The array is sorted in descending order by the count
field.
Usage Example
The following example demonstrates how to use the approx_top_k
function to get the most common items in the data and their occurrence counts.
Example 1:
Example 2:
Example 3:
Note:
- Since the
approx_top_k
function is based on probability, the returned results may have some errors. In practical applications, you can adjust themaxItemsTracked
parameter to balance accuracy and performance. - When processing large amounts of data, you can appropriately reduce the value of
maxItemsTracked
to improve performance. However, please note that this may affect the accuracy of the results.