Lakehouse SQL Query Statements
Description
Lakehouse supports data queries using standard SQL SELECT statements. This document will provide a detailed introduction to the basic syntax of query statements, parameter descriptions, and usage examples to help you perform data queries more efficiently.
Syntax
Parameter Description
- WITH cte (Optional) Common Table Expression, used to define a temporary result set in the query.
ALL | DISTINCT
:Filters the result set,all
for all,distinct
will filter out duplicate columns, default isall
.
- hints: Help Lakehouse optimizer make better planning decisions. Currently supports map join as shown in the following example
-
select_expr (Required) Specify the columns to query, supporting column names, column expressions, etc. For example: col1_name, col2_name, column expression, ...
-
Exclude Columns (Optional) Optional. The
except_expr
format isexcept(col1_name, col2_name, ...)
. When you want to read most of the columns in the table while excluding a few columns, you can use theSELECT * except(col1_name, col2_name, ...) from ...;
statement, which means that the specified columns (col1, col2) will be excluded when reading the table data.Example command is as follows.
- where_condition (Optional) Filter conditions used to screen data that meets specified criteria. Supports relational operators, like, rlike, in, not in, between…and, etc.
-
Use with relational operators to filter data that meets specified criteria. Relational operators include:
>
、<
、=
、>=
、<=
、<>
like
、rlike
in
、not in
between…and
- GROUP BY expression
Optional. Typically, group by is used in conjunction with aggregate functions to group data based on specified ordinary columns, partition columns, or regular expressions. Grouping Sets
, Rollup
, Cube
are extensions of group by. For details, refer to GROUPING SET. The usage rules for group by are as follows:
-
The group by operation has a higher priority than the select operation, so the values of group by are the column names of the select input table or expressions composed of columns from the input table. Note that:
- When the value of group by is a regular expression, the complete expression of the column must be used.
- Columns that do not use aggregate functions in the select statement must appear in the group by.
-
having_condition Optional. Typically, the
having
clause is used with aggregate functions to achieve filtering. -
order_condition (Optional) Globally sort all data by specified columns or constants. The default is ascending order, and the
desc
keyword can be used for descending order. By default, ascending order placesNULL
values at the beginning, while descending order placesNULL
at the end. Order by is a time-consuming and resource-intensive operation because all data needs to be sent to one node for sorting, which requires more memory compared to operations without sorting. -
LIMIT ... OFFSET (Optional)
- LIMIT <number> indicates that the query result only returns the first <number> records, where <number> is a positive integer. This syntax can be used for pagination or to limit the amount of data queried. Supports the LIMIT m,n syntax. Using limit offset only makes sense when combined with order by, otherwise the data may be inconsistent each time it is executed.
- OFFSET <number> indicates that the query result skips the first <number> records and then returns the remaining records, where <number> is a positive integer. This syntax can be used to specify the starting position of the query. The OFFSET keyword can also be replaced with a comma.
- LIMIT and OFFSET can be used simultaneously or individually. If used simultaneously, OFFSET must follow LIMIT. For example, LIMIT 10 OFFSET 20 means the query result skips the first 20 records and then returns the next 10 records.
Query Historical Version Data
In addition to the standard SELECT
options, Lakehouse also supports users accessing historical data at any point within a defined time period, including changed or deleted data. Supports querying tables, dynamic tables, and materialized views.
Note: The historical query of objects depends on the data retention period. The current preview version has a default data retention period of 7 days, which will be adjusted to 1 day in the future. You can adjust the retention period by executing the ALTER command. Note that modifying the retention period may increase storage costs. For specific usage, refer to TIME TRAVEL.
By using the TIMESTAMP AS OF clause, users can specify a specific point in time to query the exact position in the table's history within the retention period or the data just before the specified point. The timestamp_expression is a parameter that returns a timestamp type expression, such as:
'2023-11-07 14:49:18'
, a string that can be cast to a timestamp.CAST('2023-11-07 14:49:18' AS TIMESTAMP)
.CURRENT_TIMESTAMP() - INTERVAL 12 HOURS
. The version from 12 hours ago.- Any expression that is itself a timestamp type or can be cast to a timestamp. Usage example
Syntactic Sugar: Trailing Commas
In SQL statements, using Trailing Commas can make the statements easier to read and edit. Even if there is an extra comma after the last value or parameter, it will not cause an error. For example:
<Notes> - This feature is supported by Singdata. </Notes> ```SQL SELECT client_ip, client_identity, userid, user_agent, log_time -- status_code FROM server_logs; ``` ## Notes
- It is recommended that you limit the size of the submitted query text (i.e., SQL statements) to 5MB per statement. SQL texts larger than 5MB cannot be submitted. If you have SQL texts exceeding 5MB, please submit a ticket for resolution.
Of course, here are usage examples and best practices generated based on the Apache Doris
SELECT
statement documentation and Lakehouse SQL query statement documentation you provided:
Examples
- Basic Query Sure, here is the translated content:
- Deduplication Query
- Use HINTS to Optimize Queries
- Conditional Filtering
- GROUP BY and Aggregate Functions
- Using the HAVING Clause
- ORDER BY and LIMIT
- Query Historical Version Data
- join Example Refer to Join for related usage
Best Practices
-
Utilize Partition and Bucket Filtering
- Whenever possible, use the partition and bucket of Lakehouse as data filtering conditions to reduce the data scanning range.
-
Use Index Fields
- Fully utilize the index fields of Lakehouse as data filtering conditions to accelerate query speed. Refer to Index.
-
Use Aggregation Reasonably
- Aggregation operations should be used when the data volume is large to reduce data transmission and improve query efficiency.
-
Use ORDER BY and LIMIT for Pagination
- When pagination queries are needed, using the combination of ORDER BY and LIMIT can effectively retrieve data for specific pages.
-
Pay Attention to Data Type Matching
- When using UNION or JOIN, ensure that the columns being joined have the same data type.
-
Avoid Large Query Texts
- Limit the query text size to within 5MB to avoid submission failures.
-
Use HAVING Clause to Filter Aggregated Results
- The HAVING clause should be used after the aggregation function to filter the aggregated result set.
-
Use WITH Clause to Simplify Complex Queries
- Common Table Expressions (WITH clause) can simplify complex queries, making them easier to understand and maintain.
-
Pay Attention to Query Costs
- Modifying the data retention period may increase storage costs. Set the data retention period reasonably to balance query needs and costs.