Lakehouse Basic Data Filtering and Sorting Guide

Overview

Data querying is the first step in data analysis. Singdata Lakehouse provides complete SQL filtering and sorting capabilities, supporting everything from simple conditional filtering to complex expression combinations. This guide categorizes usage by common business scenarios to help you quickly master efficient data exploration methods.


SQL Commands Covered

Command/ClausePurposeApplicable Scenario
SELECTSpecify returned columnsColumn pruning, reduce data transfer
SELECT * EXCEPT(...)Exclude specified columnsQuickly exclude sensitive fields
WHERERow-level filteringFilter data by condition
ORDER BYResult sortingSort by single or multiple columns
LIMITLimit returned rowsData exploration, paginated queries

Prerequisites

The following examples use a simulated employee table employees:

-- Create test table
CREATE TABLE IF NOT EXISTS employees (
    emp_id INT,
    emp_name STRING,
    dept STRING,
    salary DOUBLE,
    hire_date DATE
);

-- Insert test data
INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 12000, '2020-03-15'),
(2, 'Bob', 'Engineering', 9500, '2021-07-01'),
(3, 'Carol', 'Marketing', 8500, '2019-11-20'),
(4, 'David', 'Marketing', 7800, '2022-01-10'),
(5, 'Eve', 'HR', 6000, '2023-05-05');

Basic Column Selection

Explicitly specify the needed columns when querying to avoid returning unnecessary data. In columnar storage, this can significantly improve query performance.

-- Query only name and department
SELECT emp_name, dept
FROM employees;

Result:

emp_namedept
AliceEngineering
BobEngineering
CarolMarketing
DavidMarketing
EveHR

Conditional Filtering

Use the WHERE clause to filter data by condition. Lakehouse supports a rich set of comparison and logical operators.

Scenario 1: Numeric Range Filtering

-- Query employees with salary greater than 8000
SELECT emp_name, dept, salary
FROM employees
WHERE salary > 8000;

Scenario 2: Multiple Condition Combinations

-- Query Engineering department employees with salary above 10000
SELECT emp_name, dept, salary
FROM employees
WHERE dept = 'Engineering' AND salary > 10000;

Scenario 3: IN and BETWEEN

-- Query employees in Marketing or HR departments
SELECT emp_name, dept
FROM employees
WHERE dept IN ('Marketing', 'HR');

-- Query employees hired between 2020 and 2022
SELECT emp_name, hire_date
FROM employees
WHERE hire_date BETWEEN '2020-01-01' AND '2022-12-31';

Scenario 4: Fuzzy Matching

-- Query employees whose names start with 'A'
SELECT emp_name
FROM employees
WHERE emp_name LIKE 'A%';

Result Sorting

Use ORDER BY to sort query results, supporting ascending (ASC, default) and descending (DESC) order.

-- Sort by salary in descending order
SELECT emp_name, dept, salary
FROM employees
ORDER BY salary DESC;

-- Multi-column sort: first by department ascending, then by salary descending
SELECT emp_name, dept, salary
FROM employees
ORDER BY dept ASC, salary DESC;

Result (multi-column sort):

emp_namedeptsalary
AliceEngineering12000
BobEngineering9500
EveHR6000
CarolMarketing8500
DavidMarketing7800

Limit Returned Rows

Use LIMIT to control the number of returned rows, commonly used for data exploration and paginated queries.

-- View only the top 3 records
SELECT emp_name, dept, salary
FROM employees
ORDER BY salary DESC
LIMIT 3;

Result:

emp_namedeptsalary
AliceEngineering12000
BobEngineering9500
CarolMarketing8500

Exclude Specific Columns

Use the EXCEPT clause to quickly exclude unneeded columns, especially useful for wide table queries.

-- Query all columns except emp_id
SELECT * EXCEPT(emp_id)
FROM employees;

Result:

emp_namedeptsalaryhire_date
AliceEngineering120002020-03-15
BobEngineering95002021-07-01
CarolMarketing85002019-11-20
DavidMarketing78002022-01-10
EveHR60002023-05-05

Clean Up Test Data

After completing query verification, it is recommended to clean up test tables:

-- Drop test table
DROP TABLE IF EXISTS employees;

Important Notes

  1. Column Pruning Optimization: Explicitly specify SELECT column names and avoid SELECT * to significantly reduce I/O and network transfer.
  2. Partition Pruning: If the table is partitioned, filter directly on the partition column in WHERE to skip irrelevant partitions.
  3. Sorting Performance: ORDER BY triggers a full sort; for large data volumes, use it with LIMIT.
  4. NULL Value Sorting: ORDER BY places NULL values last (ascending) or first (descending) by default; use NULLS FIRST/LAST to control.