Singdata - Documents

Overview

TABLESAMPLE is a data sampling method provided by the Lakehouse platform, supporting random sampling based on probability or a fixed number of rows. It includes two sampling strategies: SYSTEM (file-level) and ROW (row-level), catering to different performance and accuracy requirements.

Syntax


SELECT <column_list>
FROM <table_name>
  TABLESAMPLE [ROW | SYSTEM] ( { <probability>  | <num> ROWS } )
[ ...other clauses... ]

Parameter Description

Sampling Type

Type	Description
ROW	Row-level random: Each row is judged independently, with precise result row counts, suitable for small datasets.
SYSTEM	File-level random: Screens randomly by file blocks, offering higher performance but possibly fluctuating results, suitable for big data scenarios.
(Default)	Automatically selects SYSTEM mode if not specified.

Sampling Quantity

Format	Description
`probability PERCENT`	Samples by percentage (0~100), e.g., `30` .
`<number> ROWS`	Specifies the number of rows to sample, e.g., `5 ROWS`.

Usage Example

Basic Sampling


CREATE VIEW test(id, name) AS
  VALUES (1, 'Lisa'),
         (2, 'Mary'),
         (3, 'Evan'),
         (4, 'Fred'),
         (5, 'Alex'),
         (6, 'Mark'),
         (7, 'Lily'),
         (8, 'Lucy'),
         (9, 'Eric'),
         (10, 'Adam');
-- Randomly sample 30% of the data (SYSTEM mode)
SELECT * FROM test TABLESAMPLE (30 );
-- Randomly sample 5 rows of data (SYSTEM mode)
SELECT * FROM test TABLESAMPLE (5 ROWS);

-- Exactly sample 5 rows (ROW mode)
SELECT * FROM employee TABLESAMPLE ROW (5 ROWS);