COLLECT_SET

collect_set([distinct] expr)

Description

The collect_set function is used to collect and return a set of unique elements from a given set of input data. This function ensures that the returned array does not contain duplicate elements, making the result more concise and clear. When the distinct keyword is specified, the function will perform deduplication on the result. However, note that even without specifying distinct, the collect_set function will still perform deduplication, so the distinct keyword does not affect the result in this scenario.

Parameter Description

  • expr: Can be an expression of any data type.

Return Result

  • Returns an array, where the element type in the array is the same as the input parameter type.
  • Using the distinct keyword does not affect the deduplication result.
  • The order of elements in the returned array is not guaranteed to be the same as the input order.
  • Input null values will not be included in the result array.

Example

Example 1: Basic Usage

SELECT collect_set(col) FROM VALUES (1), (2), (1), (null) AS tab(col);
+------------------+
| collect_set(col) |
+------------------+
| [2,1]            |
+------------------+

Example 2: Deduplication Operation

SELECT collect_set(distinct col) FROM VALUES ("a"), ("b"), (null), ("c"), ("a") AS tab(col);
+---------------------------+
| collect_set(DISTINCT col) |
+---------------------------+
| ["c","b","a"]             |
+---------------------------+

Notes

  • When handling large amounts of data, please note that the collect_set function may consume a significant amount of memory resources.
  • Since the order of the elements in the returned array is not guaranteed to be the same as the input order, if sorting is needed, please use other functions such as ARRAY_SORT to sort the results.