Lakehouse Data Grouping and Aggregation Guide

Overview

Data grouping and aggregation is a core operation in data analysis, used to group data by dimensions and compute statistical metrics. Singdata Lakehouse provides comprehensive aggregate function support, including basic aggregation, conditional aggregation, and approximate aggregation. This guide categorizes usage by business scenario to help you quickly master efficient data aggregation methods.


SQL Commands Covered

Command/FunctionPurposeApplicable Scenario
GROUP BYGroup by columnDimensional aggregation analysis
COUNT(*) / COUNT(col)CountCount rows or non-null values
SUM() / AVG() / MIN() / MAX()Numeric aggregationSum, average, min/max
HAVINGFilter grouped resultsSubstitute for WHERE on aggregated data
COUNT(DISTINCT col)Distinct countCount unique values
APPROX_COUNT_DISTINCT()Approximate distinct countFast estimation for large data volumes

Prerequisites

The following examples use a simulated sales table sales:

-- Create test table
CREATE TABLE IF NOT EXISTS sales (
    sale_id INT,
    region STRING,
    product STRING,
    amount DOUBLE,
    quantity INT,
    sale_date DATE
);

-- Insert test data
INSERT INTO sales VALUES
(1, 'East', 'Phone', 5000, 10, '2024-06-01'),
(2, 'East', 'Laptop', 8000, 5, '2024-06-01'),
(3, 'West', 'Phone', 4500, 9, '2024-06-02'),
(4, 'West', 'Tablet', 3000, 8, '2024-06-02'),
(5, 'East', 'Phone', 5000, 10, '2024-06-03'),
(6, 'South', 'Laptop', 8000, 5, '2024-06-03'),
(7, 'South', 'Tablet', 3000, 8, '2024-06-04');

Basic Grouping and Aggregation

Group by dimension and compute aggregate metrics -- the most common data analysis operation.

-- Compute sales and order count by region
SELECT 
    region,
    COUNT(*) as order_count,
    SUM(amount) as total_sales,
    AVG(amount) as avg_sales
FROM sales
GROUP BY region
ORDER BY total_sales DESC;

Result:

regionorder_counttotal_salesavg_sales
East3180006000
West275003750
South2110005500

Filter Aggregation Results

Use the HAVING clause to filter results after grouping, replacing aggregate conditions that cannot be used in WHERE.

-- Query regions with total sales greater than 10000
SELECT 
    region,
    SUM(amount) as total_sales
FROM sales
GROUP BY region
HAVING SUM(amount) > 10000
ORDER BY total_sales DESC;

Result:

regiontotal_sales
East18000
South11000

Difference Between WHERE and HAVING

  • WHERE filters rows before grouping and cannot use aggregate functions.
  • HAVING filters groups after grouping and can use aggregate functions.
-- Correct: filter date first, then group
SELECT region, SUM(amount) as total_sales
FROM sales
WHERE sale_date >= '2024-06-02'
GROUP BY region
HAVING SUM(amount) > 5000;

Conditional Aggregation

Use CASE WHEN with aggregate functions to implement conditional statistics, avoiding multiple table scans.

-- Compute sales for each product type by region
SELECT 
    region,
    SUM(CASE WHEN product = 'Phone' THEN amount ELSE 0 END) as phone_sales,
    SUM(CASE WHEN product = 'Laptop' THEN amount ELSE 0 END) as laptop_sales,
    SUM(CASE WHEN product = 'Tablet' THEN amount ELSE 0 END) as tablet_sales,
    SUM(amount) as total_sales
FROM sales
GROUP BY region
ORDER BY region;

Result:

regionphone_saleslaptop_salestablet_salestotal_sales
East100008000018000
South08000300011000
West4500030007500

Distinct Count

Use COUNT(DISTINCT col) to count unique values.

-- Count distinct product types sold in each region
SELECT 
    region,
    COUNT(DISTINCT product) as unique_products,
    SUM(quantity) as total_quantity
FROM sales
GROUP BY region
ORDER BY unique_products DESC;

Result:

regionunique_productstotal_quantity
East225
South213
West217

Approximate Aggregation

When data volumes are very large, exact distinct counts can be slow. Lakehouse provides approximate aggregate functions that significantly improve performance within an acceptable error range.

-- Use approximate distinct count (suitable for tens of millions of rows and above)
SELECT 
    region,
    APPROX_COUNT_DISTINCT(product) as approx_unique_products,
    COUNT(DISTINCT product) as exact_unique_products
FROM sales
GROUP BY region;

Result:

regionapprox_unique_productsexact_unique_products
East22
South22
West22

Clean Up Test Data

After completing aggregation verification, it is recommended to clean up test tables:

-- Drop test table
DROP TABLE IF EXISTS sales;

Important Notes

  1. GROUP BY Position: Column positions can be used in GROUP BY (e.g., GROUP BY 1), but using column names is recommended for readability.
  2. NULL Value Handling: COUNT(col) does not count NULL values; COUNT(*) counts all rows. GROUP BY groups NULL values into one group.
  3. Performance Optimization: Filter with WHERE before GROUP BY to significantly reduce aggregation computation.
  4. Approximate Function Applicability: When data exceeds millions of rows and precision requirements are moderate, prefer the APPROX_* function family.