Lakehouse Time Series Analysis Guide
Overview
Time series analysis is a core scenario in data analysis, widely applied to business trend monitoring, user behavior analysis, and operational metric computation. Singdata Lakehouse provides complete time function and window function support, including time truncation, date formatting, rolling window aggregation, and date series generation. This guide categorizes usage by business scenario to help you quickly master efficient time series analysis methods.
Quick Navigation
- Aggregate by Time Granularity -- Use DATE_FORMAT / DATE_TRUNC to summarize by month, week, or day
- Rolling Window Calculation -- Use RANGE BETWEEN INTERVAL to compute 7-day rolling averages
- Year-over-Year / Month-over-Month Calculation -- Use LAG to compute monthly growth rate
- Time Interval Calculation -- Use DATEDIFF + LAG to compute event interval days
- Fill Missing Dates with Zero -- Use SEQUENCE + EXPLODE to fill dates with no data
- Active Users in Last N Days -- Use INTERVAL to filter recently active users
SQL Commands Covered
| Command/Function | Purpose | Applicable Scenario |
|---|---|---|
DATE_FORMAT() | Format timestamp as string | Group by month/day, display |
DATE_TRUNC() | Truncate to specified time granularity | Aggregate by week/month/year |
DATEDIFF() | Calculate day difference between two dates | Event interval, retention analysis |
LAG() / LEAD() | Access previous/next row data | MoM, YoY calculation |
AVG() OVER (RANGE BETWEEN INTERVAL ...) | Time-range-based rolling aggregation | 7-day/30-day rolling average |
SEQUENCE() + EXPLODE() | Generate continuous date series and expand to rows | Date padding, calendar table |
INTERVAL | Time offset | Date arithmetic, range filter |
COALESCE() | Null value replacement | Zero-fill, default value fill |
Prerequisites
The following examples use a simulated user event table doc_ts_events:
Scenario 1: Aggregate by Time Granularity
Count the number of events and total amount by month and event type -- the most common time series aggregation requirement.
Execution Result:
| month | event_type | event_count | total_amount |
|---|---|---|---|
| 2024-01 | purchase | 4 | 460.5 |
| 2024-01 | refund | 1 | 30 |
| 2024-02 | purchase | 4 | 615 |
| 2024-02 | refund | 1 | 50 |
| 2024-03 | purchase | 4 | 625 |
| 2024-03 | refund | 1 | 25 |
To aggregate by week, use DATE_TRUNC('week', ...) to truncate to Monday:
Execution Result (first 5 rows):
| week_start | event_count | weekly_amount |
|---|---|---|
| 2024-01-01 | 1 | 120.5 |
| 2024-01-08 | 1 | 85 |
| 2024-01-15 | 1 | 200 |
| 2024-01-22 | 1 | 55 |
| 2024-01-29 | 1 | 300 |
Scenario 2: Rolling Window Calculation
Compute a 7-day rolling average of amounts for each trading day to smooth short-term fluctuations and observe trends.
RANGE BETWEEN INTERVAL N DAY PRECEDING AND CURRENT ROW defines the window based on time range (not row count), correctly handling non-consecutive dates.
Execution Result:
| sale_day | daily_amount | rolling_7d_avg |
|---|---|---|
| 2024-01-05 | 120.5 | 120.5 |
| 2024-01-10 | 85 | 102.75 |
| 2024-01-20 | 200 | 200 |
| 2024-01-25 | 55 | 127.5 |
| 2024-02-03 | 300 | 300 |
| 2024-02-08 | 150 | 225 |
| 2024-02-18 | 90 | 90 |
| 2024-02-22 | 75 | 82.5 |
| 2024-03-01 | 220 | 220 |
| 2024-03-07 | 180 | 200 |
| 2024-03-18 | 130 | 130 |
| 2024-03-25 | 95 | 95 |
Scenario 3: YoY/MoM Calculation
Use LAG() to get the previous period's data and compute the month-over-month growth rate.
Execution Result:
| month | monthly_amount | prev_month_amount | mom_growth_pct |
|---|---|---|---|
| 2024-01 | 460.5 | NULL | NULL |
| 2024-02 | 615 | 460.5 | 33.55 |
| 2024-03 | 625 | 615 | 1.63 |
Scenario 4: Time Interval Calculation
Compute the number of days between consecutive events for each user, used to analyze activity frequency and purchase cycles.
Execution Result:
| user_id | event_type | event_day | prev_event_day | days_since_last |
|---|---|---|---|---|
| 101 | purchase | 2024-01-05 | NULL | NULL |
| 101 | refund | 2024-01-15 | 2024-01-05 | 10 |
| 101 | purchase | 2024-02-03 | 2024-01-15 | 19 |
| 101 | purchase | 2024-02-22 | 2024-02-03 | 19 |
| 101 | refund | 2024-03-12 | 2024-02-22 | 19 |
| 102 | purchase | 2024-01-10 | NULL | NULL |
| 102 | purchase | 2024-01-25 | 2024-01-10 | 15 |
| 102 | purchase | 2024-02-18 | 2024-01-25 | 24 |
| 102 | purchase | 2024-03-18 | 2024-02-18 | 29 |
| 103 | purchase | 2024-01-20 | NULL | NULL |
| 103 | refund | 2024-02-14 | 2024-01-20 | 25 |
| 103 | purchase | 2024-03-07 | 2024-02-14 | 22 |
| 104 | purchase | 2024-02-08 | NULL | NULL |
| 104 | purchase | 2024-03-01 | 2024-02-08 | 22 |
| 104 | purchase | 2024-03-25 | 2024-03-01 | 24 |
Scenario 5: Fill Missing Dates with Zero
When some dates have no data, direct GROUP BY skips those dates, causing gaps in line charts. Use SEQUENCE + EXPLODE to generate a complete date series, then LEFT JOIN actual data and use COALESCE to fill NULLs with zero.
Execution Result (first 10 rows):
| cal_day | daily_amount |
|---|---|
| 2024-01-01 | 0 |
| 2024-01-02 | 0 |
| 2024-01-03 | 0 |
| 2024-01-04 | 0 |
| 2024-01-05 | 120.5 |
| 2024-01-06 | 0 |
| 2024-01-07 | 0 |
| 2024-01-08 | 0 |
| 2024-01-09 | 0 |
| 2024-01-10 | 85 |
Scenario 6: Active Users in Last N Days
Filter users who had at least 2 purchases in the last 90 days, and count their purchase frequency and last active date.
Execution Result:
| user_id | purchase_count | last_active_day |
|---|---|---|
| 104 | 3 | 2024-03-25 |
| 102 | 4 | 2024-03-18 |
| 103 | 2 | 2024-03-07 |
| 101 | 3 | 2024-02-22 |
Clean Up Test Data
After completing time series analysis verification, it is recommended to clean up test tables:
Important Notes
- Timezone Handling:
TIMESTAMPtype stores UTC time.DATE_FORMATandDATE_TRUNCparse in the session timezone (default UTC+8). When inserting data, useCAST('yyyy-MM-dd HH:mm:ss' AS TIMESTAMP)to explicitly specify time literals and avoid implicit conversion failures. - RANGE vs ROWS:
ROWS BETWEEN N PRECEDINGdefines the window by row count, whileRANGE BETWEEN INTERVAL N DAY PRECEDINGdefines it by time range. Results differ when dates are non-consecutive; time series analysis typically usesRANGE. - DATE_FORMAT Format Strings: Lakehouse uses Java-style format strings:
yyyyfor four-digit year,MMfor month,ddfor date,HHfor 24-hour hour. Format letters are case-sensitive. - DATEDIFF Parameter Order:
DATEDIFF(end_date, start_date)returns the number of days asend_date - start_date. - Window Function Execution Order: Window functions execute after
WHEREandGROUP BY; window function results cannot be used directly inWHERE. UseQUALIFYor subqueries for filtering. - SEQUENCE with Large Date Ranges:
SEQUENCEgenerates an array, andEXPLODEexpands it to rows. When generating large date ranges (e.g., multiple years), be mindful of the impact on query performance.
