Lakehouse Data Grouping and Aggregation Guide
Overview
Data grouping and aggregation is a core operation in data analysis, used to group data by dimensions and compute statistical metrics. Singdata Lakehouse provides comprehensive aggregate function support, including basic aggregation, conditional aggregation, and approximate aggregation. This guide categorizes usage by business scenario to help you quickly master efficient data aggregation methods.
Quick Navigation
- Basic Grouping and Aggregation -- Use GROUP BY to compute departmental statistics
- Filter Aggregation Results -- Use HAVING to filter groups
- Conditional Aggregation -- Use CASE WHEN + aggregate functions for conditional statistics
- Distinct Count -- Use COUNT(DISTINCT) to count unique values
- Approximate Aggregation -- Use APPROX_COUNT_DISTINCT to accelerate large-volume statistics
SQL Commands Covered
| Command/Function | Purpose | Applicable Scenario |
|---|---|---|
GROUP BY | Group by column | Dimensional aggregation analysis |
COUNT(*) / COUNT(col) | Count | Count rows or non-null values |
SUM() / AVG() / MIN() / MAX() | Numeric aggregation | Sum, average, min/max |
HAVING | Filter grouped results | Substitute for WHERE on aggregated data |
COUNT(DISTINCT col) | Distinct count | Count unique values |
APPROX_COUNT_DISTINCT() | Approximate distinct count | Fast estimation for large data volumes |
Prerequisites
The following examples use a simulated sales table sales:
Basic Grouping and Aggregation
Group by dimension and compute aggregate metrics -- the most common data analysis operation.
Result:
| region | order_count | total_sales | avg_sales |
|---|---|---|---|
| East | 3 | 18000 | 6000 |
| West | 2 | 7500 | 3750 |
| South | 2 | 11000 | 5500 |
Filter Aggregation Results
Use the HAVING clause to filter results after grouping, replacing aggregate conditions that cannot be used in WHERE.
Result:
| region | total_sales |
|---|---|
| East | 18000 |
| South | 11000 |
Difference Between WHERE and HAVING
WHEREfilters rows before grouping and cannot use aggregate functions.HAVINGfilters groups after grouping and can use aggregate functions.
Conditional Aggregation
Use CASE WHEN with aggregate functions to implement conditional statistics, avoiding multiple table scans.
Result:
| region | phone_sales | laptop_sales | tablet_sales | total_sales |
|---|---|---|---|---|
| East | 10000 | 8000 | 0 | 18000 |
| South | 0 | 8000 | 3000 | 11000 |
| West | 4500 | 0 | 3000 | 7500 |
Distinct Count
Use COUNT(DISTINCT col) to count unique values.
Result:
| region | unique_products | total_quantity |
|---|---|---|
| East | 2 | 25 |
| South | 2 | 13 |
| West | 2 | 17 |
Approximate Aggregation
When data volumes are very large, exact distinct counts can be slow. Lakehouse provides approximate aggregate functions that significantly improve performance within an acceptable error range.
Result:
| region | approx_unique_products | exact_unique_products |
|---|---|---|
| East | 2 | 2 |
| South | 2 | 2 |
| West | 2 | 2 |
Clean Up Test Data
After completing aggregation verification, it is recommended to clean up test tables:
Important Notes
- GROUP BY Position: Column positions can be used in
GROUP BY(e.g.,GROUP BY 1), but using column names is recommended for readability. - NULL Value Handling:
COUNT(col)does not count NULL values;COUNT(*)counts all rows.GROUP BYgroups NULL values into one group. - Performance Optimization: Filter with
WHEREbeforeGROUP BYto significantly reduce aggregation computation. - Approximate Function Applicability: When data exceeds millions of rows and precision requirements are moderate, prefer the
APPROX_*function family.
