Overview
TABLESAMPLE
is a data sampling method provided by the Lakehouse platform, supporting random sampling based on probability or a fixed number of rows. It includes two sampling strategies: SYSTEM (file-level) and ROW (row-level), catering to different performance and accuracy requirements.
Syntax
SELECT <column_list>
FROM <table_name>
TABLESAMPLE [ROW | SYSTEM] ( { <probability> | <num> ROWS } )
[ ...other clauses... ]
Parameter Description
Sampling Type
Type | Description |
---|
ROW | Row-level random: Each row is judged independently, with precise result row counts, suitable for small datasets. |
SYSTEM | File-level random: Screens randomly by file blocks, offering higher performance but possibly fluctuating results, suitable for big data scenarios. |
(Default) | Automatically selects SYSTEM mode if not specified. |
Sampling Quantity
Format | Description |
---|
probability PERCENT | Samples by percentage (0~100), e.g., 30 . |
<number> ROWS | Specifies the number of rows to sample, e.g., 5 ROWS . |
Usage Example
Basic Sampling
CREATE VIEW test(id, name) AS
VALUES (1, 'Lisa'),
(2, 'Mary'),
(3, 'Evan'),
(4, 'Fred'),
(5, 'Alex'),
(6, 'Mark'),
(7, 'Lily'),
(8, 'Lucy'),
(9, 'Eric'),
(10, 'Adam');
-- Randomly sample 30% of the data (SYSTEM mode)
SELECT * FROM test TABLESAMPLE (30 );
-- Randomly sample 5 rows of data (SYSTEM mode)
SELECT * FROM test TABLESAMPLE (5 ROWS);
-- Exactly sample 5 rows (ROW mode)
SELECT * FROM employee TABLESAMPLE ROW (5 ROWS);