APPROX_HISTOGRAM

Description

The approx_histogram function is used to generate an approximate histogram based on a given numeric column col and the number of buckets n. This function returns an array where each element is a structure (STRUCT) containing three fields: minimum value (min), maximum value (max), and count. This function is very useful for statistical data distribution, especially when dealing with large amounts of data, as it allows for a quick understanding of the data distribution.

Parameter Description

  • col: The numeric column for which the histogram needs to be generated.
  • n: The number of buckets in the histogram, which must be a positive integer constant.

Return Result

The function returns an array of structures, each containing three fields:

  • min: The minimum value of the histogram bucket.
  • max: The maximum value of the histogram bucket.
  • count: The number of data points falling within the bucket.

Usage Example

The following examples demonstrate how to use the approx_histogram function to generate histograms with different numbers of buckets.

Example 1: Generating a histogram with 2 buckets

SELECT approx_histogram(col, 2) FROM VALUES (0), (1), (2), (3), (4) AS t(col);
+-------------------------------------------------------------------+
|                     approx_histogram(col, 2)                      |
+-------------------------------------------------------------------+
| [{"min":0.0,"max":0.5,"count":1},{"min":0.5,"max":4.0,"count":4}] |
+-------------------------------------------------------------------+

Example 2: Generate a Histogram with 5 Buckets

SELECT approx_histogram(col, 5) FROM VALUES (0), (1), (2), (3), (4) AS t(col);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                     approx_histogram(col, 5)                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"min":0.0,"max":0.0,"count":1},{"min":0.0,"max":1.0,"count":1},{"min":1.0,"max":2.0,"count":1},{"min":2.0,"max":3.0,"count":1},{"min":3.0,"max":4.0,"count":2}] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example 3: Generate a Histogram with 10 Buckets

SELECT approx_histogram(col, 10) FROM VALUES (0),(0.1), (0.2), (0.3), (0.4), (0.5), (0.6), (0.7), (0.8), (0.9) AS t(col);
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                    approx_histogram(col, 10)                             |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"min":0.0,"max":0.0,"count":1},{"min":0.0,"max":0.1,"count":1},{"min":0.1,"max":0.2,"count":1},{"min":0.2,"max":0.3,"count":1},{"min":0.3,"max":0.4,"count":1},{"min":0.4,"max":0.5,"count":1},{"min": |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+