Lakehouse Data Sampling Exploration Guide
Overview
When working with large-scale datasets, full table scans are often time-consuming. Data sampling allows you to quickly obtain a representative subset of data for exploratory analysis, model training, or query debugging. Singdata Lakehouse provides multiple sampling methods, including random sampling, fixed-row sampling, and bucket sampling. This guide categorizes usage by business scenario to help you quickly master efficient data sampling methods.
Quick Navigation
- Fixed Row Sampling -- Use LIMIT to get the first N rows
- Random Sampling -- Use ORDER BY RAND() to get random samples
- Percentage Sampling -- Use TABLESAMPLE for proportional sampling
- Bucket Sampling -- Use HASH function for evenly distributed sampling
SQL Commands Covered
| Command/Function | Purpose | Applicable Scenario |
|---|---|---|
LIMIT n | Limit the number of returned rows | Quick view of table structure, first few rows |
ORDER BY RAND() | Random ordering | Get random samples, suitable for small datasets |
TABLESAMPLE | Table-level sampling | Fast sampling for large data volumes, best performance |
col % N = k | Modulo bucketing | Uniform sampling, guarantees repeatability |
Prerequisites
The following examples use a simulated large table large_events (approximately 1000 rows):
Fixed Row Sampling
Use LIMIT to quickly get the first N rows of data, suitable for viewing table structure and data distribution overview.
Result:
| event_id | event_type | user_id | amount | event_time |
|---|---|---|---|---|
| 1 | view | 1 | 34.56 | 2024-06-01 00:01:00 |
| 2 | purchase | 2 | 78.90 | 2024-06-01 00:02:00 |
| ... | ... | ... | ... | ... |
Random Sampling
Use ORDER BY RAND() to obtain random samples. Suitable for unbiased sample requirements, but performance is poor with large data volumes.
Applicable Scenarios:
- Datasets are small (under a million rows)
- Strict randomness is required
- Model training data partitioning
Percentage Sampling
Use TABLESAMPLE for fixed-ratio sampling with optimal performance, suitable for quick exploration of large data volumes.
Result:
- The actual number of returned rows is approximately 10% of the total (with slight variation).
- Sampling is based on underlying data blocks and is extremely fast with no full table scan required.
Bucket Sampling
Use modulo operation to divide data into buckets and take one bucket as a sample. Ensures uniformity and repeatability of sampling.
Advantages:
- Repeatable: The same conditions return the same sample each time, facilitating debugging and comparison.
- Uniform: Hash functions ensure even sample distribution.
- Efficient: No sorting required, filter directly.
Clean Up Test Data
After completing sampling verification, it is recommended to clean up test tables:
Important Notes
- LIMIT Order Instability:
LIMITwithoutORDER BYdoes not guarantee return order; add sorting for stable results. - RAND() Performance:
ORDER BY RAND()requires a full table scan and sort; avoid for large data volumes. - TABLESAMPLE Precision: Based on data block sampling, the returned row count may fluctuate by 5%-10%.
- Sampling and Partitions: If a table is partitioned, sampling spans all partitions. To sample a single partition, add a
WHEREfilter on the partition column first.
