Lakehouse SQL Guide

Lakehouse SQL Query Statements

Description

Lakehouse supports data queries using standard SQL SELECT statements. This document will provide a detailed introduction to the basic syntax of query statements, parameter descriptions, and usage examples to help you perform data queries more efficiently.

Syntax


[WITH cte [, ...] ]
SELECT 
[ hints ] 
[ ALL | DISTINCT ]
select_expr [, (except_expr)] ...
FROM table_reference
[WHERE where_condition ] 
[GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}]
    [ HAVING having_condition ]
[ ORDER BY order_condition [ ASC | DESC ] ]
[ LIMIT  <number> [OFFSET <number>]]

Parameter Description

WITH cte (Optional) Common Table Expression, used to define a temporary result set in the query.
ALL | DISTINCT：Filters the result set, all for all, distinct will filter out duplicate columns, default is all.


--Indicates not to deduplicate cp_start_date_sk
SELECT ALL cp_start_date_sk FROM catalog_page;
--Indicates to deduplicate cp_start_date_sk
SELECT DISTINCT cp_start_date_sk FROM catalog_page;

hints: Help Lakehouse optimizer make better planning decisions. Currently supports map join as shown in the following example


SELECT /*+ MAPJOIN (t2) */ * FROM table1 t1
JOIN table2 t2
ON (t1.emp_id = t2.emp_id);

select_expr (Required) Specify the columns to query, supporting column names, column expressions, etc. For example: col1_name, col2_name, column expression, ...
Exclude Columns (Optional) Optional. The except_expr format is except(col1_name, col2_name, ...). When you want to read most of the columns in the table while excluding a few columns, you can use the SELECT * except(col1_name, col2_name, ...) from ...; statement, which means that the specified columns (col1, col2) will be excluded when reading the table data.

Example command is as follows.


--Table structure is as follows
DESC students;
+-------------+-----------+---------+
| column_name | data_type | comment |
+-------------+-----------+---------+
| name        | string    |         |
| class       | string    |         |
+-------------+-----------+---------+
--Exclude the class column
SELECT * EXCEPT(class) FROM students LIMIT 1;
+-------+
| name  |
+-------+
| Alice |
+-------+

where_condition (Optional) Filter conditions used to screen data that meets specified criteria. Supports relational operators, like, rlike, in, not in, between…and, etc.

Use with relational operators to filter data that meets specified criteria. Relational operators include:
- >、<、=、>=、<=、<>
- like、rlike
- in、not in
- between…and

GROUP BY expression

Optional. Typically, group by is used in conjunction with aggregate functions to group data based on specified ordinary columns, partition columns, or regular expressions. Grouping Sets, Rollup, Cube are extensions of group by. For details, refer to GROUPING SET. The usage rules for group by are as follows:

The group by operation has a higher priority than the select operation, so the values of group by are the column names of the select input table or expressions composed of columns from the input table. Note that:
- When the value of group by is a regular expression, the complete expression of the column must be used.
- Columns that do not use aggregate functions in the select statement must appear in the group by.

having_condition Optional. Typically, the having clause is used with aggregate functions to achieve filtering.
order_condition (Optional) Globally sort all data by specified columns or constants. The default is ascending order, and the desc keyword can be used for descending order. By default, ascending order places NULL values at the beginning, while descending order places NULL at the end. Order by is a time-consuming and resource-intensive operation because all data needs to be sent to one node for sorting, which requires more memory compared to operations without sorting.
LIMIT ... OFFSET (Optional)

LIMIT <number> indicates that the query result only returns the first <number> records, where <number> is a positive integer. This syntax can be used for pagination or to limit the amount of data queried. Supports the LIMIT m,n syntax. Using limit offset only makes sense when combined with order by, otherwise the data may be inconsistent each time it is executed.
OFFSET <number> indicates that the query result skips the first <number> records and then returns the remaining records, where <number> is a positive integer. This syntax can be used to specify the starting position of the query. The OFFSET keyword can also be replaced with a comma.
LIMIT and OFFSET can be used simultaneously or individually. If used simultaneously, OFFSET must follow LIMIT. For example, LIMIT 10 OFFSET 20 means the query result skips the first 20 records and then returns the next 10 records.

Query Historical Version Data

In addition to the standard SELECT options, Lakehouse also supports users accessing historical data at any point within a defined time period, including changed or deleted data. Supports querying tables, dynamic tables, and materialized views. Note: The historical query of objects depends on the data retention period. The current preview version has a default data retention period of 7 days, which will be adjusted to 1 day in the future. You can adjust the retention period by executing the ALTER command. Note that modifying the retention period may increase storage costs. For specific usage, refer to TIME TRAVEL.


SELECT 
    table_identifier TIMESTAMP AS OF timestamp_expression

By using the TIMESTAMP AS OF clause, users can specify a specific point in time to query the exact position in the table's history within the retention period or the data just before the specified point. The timestamp_expression is a parameter that returns a timestamp type expression, such as:

'2023-11-07 14:49:18', a string that can be cast to a timestamp.
CAST('2023-11-07 14:49:18' AS TIMESTAMP).
CURRENT_TIMESTAMP() - INTERVAL 12 HOURS. The version from 12 hours ago.
Any expression that is itself a timestamp type or can be cast to a timestamp. Usage example


SELECT * FROM events TIMESTAMP AS OF TIMESTAMP'2024-10-18 22:15:12.013'

Syntactic Sugar: Trailing Commas

In SQL statements, using Trailing Commas can make the statements easier to read and edit. Even if there is an extra comma after the last value or parameter, it will not cause an error. For example:


SELECT
    column1,
    column2,
    column3, -- This comma is allowed
FROM
    table_name;

<Notes> - This feature is supported by Singdata. </Notes> ```SQL SELECT client_ip, client_identity, userid, user_agent, log_time -- status_code FROM server_logs; ``` ## Notes

It is recommended that you limit the size of the submitted query text (i.e., SQL statements) to 5MB per statement. SQL texts larger than 5MB cannot be submitted. If you have SQL texts exceeding 5MB, please submit a ticket for resolution. Of course, here are usage examples and best practices generated based on the Apache Doris SELECT statement documentation and Lakehouse SQL query statement documentation you provided:

Examples

Basic Query Sure, here is the translated content:


-- Query the names and classes of all students
2. **Using the WITH Clause (Common Table Expressions)**
   ```SQL
   WITH ranked_students AS (
     SELECT name, class, RANK() OVER (ORDER BY score DESC) as rank
     FROM students
   )
   SELECT * FROM ranked_students WHERE rank <= 10;

Deduplication Query


-- Query different class names
SELECT DISTINCT class FROM students;

Use HINTS to Optimize Queries


SELECT /*+ MAPJOIN(t2) */ * FROM students t1
JOIN classes t2 ON t1.class_id = t2.id;

Conditional Filtering


-- Query the names of students older than 20 years
SELECT name FROM students WHERE age > 20;

GROUP BY and Aggregate Functions


-- Group by class and query the average score of each class
SELECT class, AVG(score) FROM students GROUP BY class;

Using the HAVING Clause


-- Query classes with an average score greater than 60
SELECT class, AVG(score) as avg_score FROM students GROUP BY class HAVING avg_score > 60;

ORDER BY and LIMIT


-- Query the top 5 students with scores in descending order
SELECT name, score FROM students ORDER BY score DESC LIMIT 5;

Query Historical Version Data


-- Query the student table data at the timestamp 2024-10-18 22:15:12.013
SELECT * FROM students TIMESTAMP AS OF TIMESTAMP'2024-10-18 22:15:12.013';

join Example Refer to Join for related usage

Best Practices

Utilize Partition and Bucket Filtering
- Whenever possible, use the partition and bucket of Lakehouse as data filtering conditions to reduce the data scanning range.
Use Index Fields
- Fully utilize the index fields of Lakehouse as data filtering conditions to accelerate query speed. Refer to Index.
Use Aggregation Reasonably
- Aggregation operations should be used when the data volume is large to reduce data transmission and improve query efficiency.
Use ORDER BY and LIMIT for Pagination
- When pagination queries are needed, using the combination of ORDER BY and LIMIT can effectively retrieve data for specific pages.
Pay Attention to Data Type Matching
- When using UNION or JOIN, ensure that the columns being joined have the same data type.
Avoid Large Query Texts
- Limit the query text size to within 5MB to avoid submission failures.
Use HAVING Clause to Filter Aggregated Results
- The HAVING clause should be used after the aggregation function to filter the aggregated result set.
Use WITH Clause to Simplify Complex Queries
- Common Table Expressions (WITH clause) can simplify complex queries, making them easier to understand and maintain.
Pay Attention to Query Costs
- Modifying the data retention period may increase storage costs. Set the data retention period reasonably to balance query needs and costs.