Lakehouse Data Sampling Exploration Guide

Overview

When working with large-scale datasets, full table scans are often time-consuming. Data sampling allows you to quickly obtain a representative subset of data for exploratory analysis, model training, or query debugging. Singdata Lakehouse provides multiple sampling methods, including random sampling, fixed-row sampling, and bucket sampling. This guide categorizes usage by business scenario to help you quickly master efficient data sampling methods.


SQL Commands Covered

Command/FunctionPurposeApplicable Scenario
LIMIT nLimit the number of returned rowsQuick view of table structure, first few rows
ORDER BY RAND()Random orderingGet random samples, suitable for small datasets
TABLESAMPLETable-level samplingFast sampling for large data volumes, best performance
col % N = kModulo bucketingUniform sampling, guarantees repeatability

Prerequisites

The following examples use a simulated large table large_events (approximately 1000 rows):

-- Create test table
CREATE TABLE IF NOT EXISTS large_events (
    event_id INT,
    event_type STRING,
    user_id INT,
    amount DOUBLE,
    event_time TIMESTAMP
);

-- Insert 1000 rows of test data
INSERT INTO large_events
SELECT 
    seq,
    CASE seq % 3 WHEN 0 THEN 'click' WHEN 1 THEN 'view' ELSE 'purchase' END,
    seq % 100,
    ROUND(RAND() * 100, 2),
    TIMESTAMP '2024-06-01 00:00:00' + INTERVAL (seq % 1440) MINUTE
FROM (SELECT SEQUENCE(1, 1000) AS seqs) t, EXPLODE(t.seqs) AS seq;

Fixed Row Sampling

Use LIMIT to quickly get the first N rows of data, suitable for viewing table structure and data distribution overview.

-- View the first 5 rows
SELECT * FROM large_events LIMIT 5;

Result:

event_idevent_typeuser_idamountevent_time
1view134.562024-06-01 00:01:00
2purchase278.902024-06-01 00:02:00
...............

Random Sampling

Use ORDER BY RAND() to obtain random samples. Suitable for unbiased sample requirements, but performance is poor with large data volumes.

-- Randomly get 10 rows
SELECT * FROM large_events
ORDER BY RAND()
LIMIT 10;

Applicable Scenarios:

  • Datasets are small (under a million rows)
  • Strict randomness is required
  • Model training data partitioning

Percentage Sampling

Use TABLESAMPLE for fixed-ratio sampling with optimal performance, suitable for quick exploration of large data volumes.

-- Sample 10% of the data
SELECT * FROM large_events TABLESAMPLE(10);

Result:

  • The actual number of returned rows is approximately 10% of the total (with slight variation).
  • Sampling is based on underlying data blocks and is extremely fast with no full table scan required.

Bucket Sampling

Use modulo operation to divide data into buckets and take one bucket as a sample. Ensures uniformity and repeatability of sampling.

-- Split into 2 buckets by user_id, take bucket 0 (about 50% of data)
SELECT * FROM large_events
WHERE user_id % 2 = 0;

Advantages:

  • Repeatable: The same conditions return the same sample each time, facilitating debugging and comparison.
  • Uniform: Hash functions ensure even sample distribution.
  • Efficient: No sorting required, filter directly.

Clean Up Test Data

After completing sampling verification, it is recommended to clean up test tables:

-- Drop test table
DROP TABLE IF EXISTS large_events;

Important Notes

  1. LIMIT Order Instability: LIMIT without ORDER BY does not guarantee return order; add sorting for stable results.
  2. RAND() Performance: ORDER BY RAND() requires a full table scan and sort; avoid for large data volumes.
  3. TABLESAMPLE Precision: Based on data block sampling, the returned row count may fluctuate by 5%-10%.
  4. Sampling and Partitions: If a table is partitioned, sampling spans all partitions. To sample a single partition, add a WHERE filter on the partition column first.