Overview

TABLESAMPLE is a data sampling method provided by the Lakehouse platform, supporting random sampling based on probability or a fixed number of rows. It includes two sampling strategies: SYSTEM (file-level) and ROW (row-level), catering to different performance and accuracy requirements.

Syntax

SELECT <column_list>
FROM <table_name>
  TABLESAMPLE [ROW | SYSTEM] ( { <probability>  | <num> ROWS } )
[ ...other clauses... ]

Parameter Description

Sampling Type

TypeDescription
ROWRow-level random: Each row is judged independently, with precise result row counts, suitable for small datasets.
SYSTEMFile-level random: Screens randomly by file blocks, offering higher performance but possibly fluctuating results, suitable for big data scenarios.
(Default)Automatically selects SYSTEM mode if not specified.

Sampling Quantity

FormatDescription
probability PERCENTSamples by percentage (0~100), e.g., 30 .
<number> ROWSSpecifies the number of rows to sample, e.g., 5 ROWS.

Usage Example

Basic Sampling

CREATE VIEW test(id, name) AS
  VALUES (1, 'Lisa'),
         (2, 'Mary'),
         (3, 'Evan'),
         (4, 'Fred'),
         (5, 'Alex'),
         (6, 'Mark'),
         (7, 'Lily'),
         (8, 'Lucy'),
         (9, 'Eric'),
         (10, 'Adam');
-- Randomly sample 30% of the data (SYSTEM mode)
SELECT * FROM test TABLESAMPLE (30 );
-- Randomly sample 5 rows of data (SYSTEM mode)
SELECT * FROM test TABLESAMPLE (5 ROWS);

-- Exactly sample 5 rows (ROW mode)
SELECT * FROM employee TABLESAMPLE ROW (5 ROWS);