Data Transformation Using Window Functions (Windows)
Let's first understand the basic concepts and common use cases of using window functions for data transformation in the ETL/ELT (Extract, Transform, Load) process.
Basic Concepts
Window functions are a type of function in SQL specifically used to perform complex multi-row operations within a specified data set (i.e., "window"). Window functions can retain row-level details while performing calculations within a specific data window.
-
OVER keyword defines the scope of the window.
-
PARTITION BY defines how to partition the data, applying the window function within each partition. If no partition is specified, the entire table is considered as one partition.
-
Function applied to the current row. The function result adds an extra column in the output.
-
ORDER BY defines the sorting within the window.
Common Window Functions
- RANK(): Ranking function, assigns a rank to each row within each partition.
- DENSE_RANK(): Similar to RANK(), but does not skip ranks.
- ROW_NUMBER(): Assigns a unique number to each row within each partition.
- SUM(): Cumulative sum.
- AVG(): Calculates the average.
- LAG(): Retrieves data from a previous row.
- LEAD(): Retrieves data from a subsequent row.
Use Cases
1. Data Deduplication and Tagging
Window functions are often used for data deduplication and tagging duplicate rows. For example, we can use window functions to number each group and delete duplicate rows except the first one.
2. Data Partitioning and Aggregation
Window functions can be used to perform aggregation operations within partitions, such as cumulative sums and moving averages.
3. Data Sorting and Ranking
Through window functions, data can be sorted and ranked, and the results can be used for subsequent calculations.
4. Data Completion and Lag/Lead Columns
Using the LAG()
and LEAD()
functions, you can obtain previous/next row data to complete missing data.
Using window functions for ETL data transformation can effectively improve the flexibility and efficiency of data processing, making complex data analysis and transformation operations faster and simpler.
Data Model
TPC-H data represents the data warehouse of an auto parts supplier, which records orders, items that make up the orders (lineitem), suppliers, customers, parts sold (part), regions, countries, and parts suppliers (partsupp).
Singdata Lakehouse has built-in shared TPC-H data, which each user can directly use by adding the data context, for example:
Data Transformation Using Singdata Lakehouse SQL Window Functions
Window Functions Have Four Basic Parts
- Partition: Defines a set of rows based on the values of specified columns. If no partition is specified, the entire table is considered as one partition.
- Order By: This optional clause specifies how to sort the rows within a partition.
- Function: The function applied to the current row. The function result adds an extra column in the output.
- Window Frame: Within a partition, the window frame allows you to specify the rows to be considered in the function calculation.
The SUM
function in the above query is an aggregate function. Notice how running_sum
accumulates (i.e., aggregates) o_totalprice
over all rows. The rows themselves are ordered in ascending order by their order date.
Reference: Standard aggregate functions are MIN, MAX, AVG, SUM, & COUNT
. Modern data systems offer a variety of powerful aggregate functions. Please refer to your database documentation to learn about the available aggregate functions. Please read this article for a list of aggregate functions available in Lakehouse.
Using Ranking Functions to Get Top/Bottom n Rows
If you are dealing with a problem that requires getting the top/bottom n rows (defined by some value), then use row functions.
Let's look at an example of how to use row functions:
From the orders
table, get the top 3 customers with the highest spending each day. The schema of the orders
table is as follows:
Standard Ranking Functions
- ``:Ranks rows within the window frame from 1 to n. Rows with the same value (as defined by the "ORDER BY" clause) receive the same rank, and the ranking numbers that would have been used if the values were different are skipped.
- ``:Ranks rows within the window frame from 1 to n. Rows with the same value (as defined by the "ORDER BY" clause) receive the same rank, and no ranking numbers are skipped.
- ``:Adds row numbers from 1 to n within the window frame, without creating any duplicate values.
Now we have seen how to use window functions and how to use ranking and aggregation functions.
Why define a window frame with partitions?
While our functions operate on rows within partitions, the window frame provides a more refined way to operate on a selected set of rows within a partition.
When we need to operate on a set of rows within a partition (e.g., sliding window), we can use the window frame to define these rows.
Consider a scenario where you have sales data and you want to calculate the 3-day moving average sales for each store:
In this example:
- PARTITION BY store_id ensures that the calculation is performed separately for each store.
- ORDER BY sale_date defines the order of rows within each partition.
- ROWS BETWEEN 2 PRECEDING AND CURRENT ROW specifies the window frame, considering the current row and the two preceding rows to calculate the moving average.
If the window frame is not defined, the function may not provide the specific moving average calculation you need.
Using ROWS to Define Window Frame
ROWS: Used to select a set of rows relative to the current row based on position.
-
The row definition format is
ROWS BETWEEN start_point AND end_point
.-
start_point and end_point can be any of the following three (in the correct order):
- n PRECEDING: The n rows before the current row. UNBOUNDED PRECEDING means all rows before the current row.
- n FOLLOWING: The n rows after the current row. UNBOUNDED FOLLOWING means all rows after the current row.
-
Let's see how to use relative row numbers to define the window range.
Consider this window function:
Write an SQL query to get the following output from the orders table:
- o_custkey 2. order_month: Format as YYYY-MM, use strftime(o_orderdate, '%Y-%m') AS order_month 3. total_price: The sum of o_totalprice for that month 4. three_mo_total_price_avg: The average total_price for the past, current, and next month for that customer
Using RANGE to Define Window Frames
-
RANGE: Used to select a set of rows related to the current row based on the values of the columns specified in the
ORDER BY
clause.-
Range definition format
RANGE BETWEEN start_point AND end_point
. -
start_point and end_point can be any of the following:
- CURRENT ROW: The current row.
- n PRECEDING: All rows within the specified range and n units before the current row value.
- n FOLLOWING: All rows within the specified range and n units after the current row value.
- UNBOUNDED PRECEDING: All rows before the current row in the partition.
- UNBOUNDED FOLLOWING: All rows after the current row in the partition.
-
RANGE
is particularly useful when dealing with numerical or date/time ranges, allowing calculations such as running totals, moving averages, or cumulative distributions.
-
Let's see how RANGE
works with AVG(total price) OVER (PARTITION BY customer id ORDER BY date RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND '1' DAY FOLLOWING)
.
Now that we've seen how to create window frames using ROWS, let's explore how to do this using RANGE.
-
Write a query to get the following output from the orders table:
- order_month,
- o_custkey,
- total_price,
- three_mo_total_price_avg
- consecutive_three_mo_total_price_avg: The average total_price for consecutive 3 months for the customer. Note that this should only include months arranged in chronological order.
Review
-
When using window functions:
- Calculate running metrics (similar to
GROUP BY
, but retains all rows) - Rank rows based on specific columns
- Access values from other rows relative to the current row
- Calculate running metrics (similar to
-
Windows have four key parts: Partition, Order By, Function, Window Frame
-
Define window frames using ROWS or RANGE
-
Window functions are costly; be mindful of performance