Data Transformation with CTE
Let's first understand the basic concepts, advantages, and common use cases of using Common Table Expression (CTE) with Lakehouse SQL for data transformation.
Basic Concepts
A Common Table Expression (CTE) is an expression that defines a temporary result set that exists within the execution scope of a SQL query. It is typically used to simplify complex queries, perform recursive queries, or break down query steps. CTEs are usually introduced by the WITH keyword and can be used in subsequent SELECT
, INSERT
, UPDATE
, and DELETE
statements.
Basic syntax:
Advantages
-
High Readability: CTE can make complex queries more readable and maintainable. By breaking down complex query logic into multiple understandable parts, CTE enhances the structure and clarity of the query. CTE makes testing complex queries simpler.
- CTE is a
SELECT
statement that can be reused within a single query. - Complex SQL queries often involve multiple subqueries. Multiple subqueries can make the code difficult to read.
- Using Common Table Expressions (CTE) can make your queries more readable.
- CTE is a
-
Strong Reusability: CTE can be referenced multiple times within the same query, thus avoiding code duplication and improving query reusability.
-
Step-by-Step Data Transformation: CTE allows the data transformation process to be implemented step by step, with each step defined as an independent CTE, making it easier to debug and test.
-
Support for Recursive Queries: CTE supports recursive queries, making it very suitable for handling recursive data structures with hierarchical relationships, such as tree-structured data.
Usage Scenarios
Here are some common scenarios for data transformation using CTE:
1. Data Cleaning and Transformation
CTE can be used to clean and transform data step by step, such as removing duplicates and correcting data formats:
2. Grouping and Aggregation
CTE can be used for complex grouping and aggregation operations, and the aggregation results can be referenced in subsequent queries:
3. Data Merging and Joining
Using CTE can simplify multi-table joins and data merging operations:
Data Model
TPC-H data represents a data warehouse for an auto parts dealer, recording orders, items that make up the orders (lineitem), suppliers, customers, parts sold (part), regions, countries, and parts suppliers (partsupp).
Singdata Lakehouse has built-in shared TPC-H data, which each user can directly use by adding the data context, for example:
Data Transformation Using Singdata Lakehouse SQL CTE
How to Define CTE
Calculate the Amount Lost Due to Discounts
Use the lineitem
table to get the price of items in the order (excluding discounts) and compare it with the order. First, determine the granularity of the comparison. Think step by step, first get the price of all items in the order excluding discounts, and then compare it with the order data of totalprice
that has already calculated the discount.
Do not overuse CTE. Pay attention to code readability.
-
CTE can help improve the readability and reusability of queries.
-
Do not overuse CTE; pay attention to the size of the query.
- SQL queries with multiple temporary tables are better than 1000-line SQL queries with numerous CTEs.
- Keep the number of CTEs in each query small (depending on the size of the query, but usually < 5).
-
If you have doubts about the performance of CTE, check your query plan.