Lakehouse Time Series Analysis Guide

Overview

Time series analysis is a core scenario in data analysis, widely applied to business trend monitoring, user behavior analysis, and operational metric computation. Singdata Lakehouse provides complete time function and window function support, including time truncation, date formatting, rolling window aggregation, and date series generation. This guide categorizes usage by business scenario to help you quickly master efficient time series analysis methods.


SQL Commands Covered

Command/FunctionPurposeApplicable Scenario
DATE_FORMAT()Format timestamp as stringGroup by month/day, display
DATE_TRUNC()Truncate to specified time granularityAggregate by week/month/year
DATEDIFF()Calculate day difference between two datesEvent interval, retention analysis
LAG() / LEAD()Access previous/next row dataMoM, YoY calculation
AVG() OVER (RANGE BETWEEN INTERVAL ...)Time-range-based rolling aggregation7-day/30-day rolling average
SEQUENCE() + EXPLODE()Generate continuous date series and expand to rowsDate padding, calendar table
INTERVALTime offsetDate arithmetic, range filter
COALESCE()Null value replacementZero-fill, default value fill

Prerequisites

The following examples use a simulated user event table doc_ts_events:

-- Create test table
CREATE TABLE IF NOT EXISTS doc_ts_events (
    event_id   INT,
    user_id    INT,
    event_type STRING,
    amount     DOUBLE,
    event_time TIMESTAMP
);

-- Insert test data (covering Jan 2024 through Mar 2024)
INSERT INTO doc_ts_events VALUES
(1,  101, 'purchase', 120.50, CAST('2024-01-05 10:00:00' AS TIMESTAMP)),
(2,  102, 'purchase',  85.00, CAST('2024-01-10 14:30:00' AS TIMESTAMP)),
(3,  101, 'refund',    30.00, CAST('2024-01-15 09:00:00' AS TIMESTAMP)),
(4,  103, 'purchase', 200.00, CAST('2024-01-20 16:00:00' AS TIMESTAMP)),
(5,  102, 'purchase',  55.00, CAST('2024-01-25 11:00:00' AS TIMESTAMP)),
(6,  101, 'purchase', 300.00, CAST('2024-02-03 10:00:00' AS TIMESTAMP)),
(7,  104, 'purchase', 150.00, CAST('2024-02-08 13:00:00' AS TIMESTAMP)),
(8,  103, 'refund',    50.00, CAST('2024-02-14 15:00:00' AS TIMESTAMP)),
(9,  102, 'purchase',  90.00, CAST('2024-02-18 09:30:00' AS TIMESTAMP)),
(10, 101, 'purchase',  75.00, CAST('2024-02-22 17:00:00' AS TIMESTAMP)),
(11, 104, 'purchase', 220.00, CAST('2024-03-01 10:00:00' AS TIMESTAMP)),
(12, 103, 'purchase', 180.00, CAST('2024-03-07 14:00:00' AS TIMESTAMP)),
(13, 101, 'refund',    25.00, CAST('2024-03-12 11:00:00' AS TIMESTAMP)),
(14, 102, 'purchase', 130.00, CAST('2024-03-18 16:00:00' AS TIMESTAMP)),
(15, 104, 'purchase',  95.00, CAST('2024-03-25 12:00:00' AS TIMESTAMP));

Scenario 1: Aggregate by Time Granularity

Count the number of events and total amount by month and event type -- the most common time series aggregation requirement.

-- Aggregate by month and event type
SELECT
    DATE_FORMAT(event_time, 'yyyy-MM') AS month,
    event_type,
    COUNT(*)                           AS event_count,
    ROUND(SUM(amount), 2)              AS total_amount
FROM doc_ts_events
GROUP BY DATE_FORMAT(event_time, 'yyyy-MM'), event_type
ORDER BY month, event_type;

Execution Result:

monthevent_typeevent_counttotal_amount
2024-01purchase4460.5
2024-01refund130
2024-02purchase4615
2024-02refund150
2024-03purchase4625
2024-03refund125

To aggregate by week, use DATE_TRUNC('week', ...) to truncate to Monday:

-- Aggregate by natural week (week_start is the Monday of each week)
SELECT
    DATE_FORMAT(DATE_TRUNC('week', event_time), 'yyyy-MM-dd') AS week_start,
    COUNT(*)                                                   AS event_count,
    ROUND(SUM(amount), 2)                                      AS weekly_amount
FROM doc_ts_events
WHERE event_type = 'purchase'
GROUP BY DATE_TRUNC('week', event_time)
ORDER BY week_start;

Execution Result (first 5 rows):

week_startevent_countweekly_amount
2024-01-011120.5
2024-01-08185
2024-01-151200
2024-01-22155
2024-01-291300

Scenario 2: Rolling Window Calculation

Compute a 7-day rolling average of amounts for each trading day to smooth short-term fluctuations and observe trends.

RANGE BETWEEN INTERVAL N DAY PRECEDING AND CURRENT ROW defines the window based on time range (not row count), correctly handling non-consecutive dates.

WITH daily_sales AS (
    SELECT
        CAST(DATE_FORMAT(event_time, 'yyyy-MM-dd') AS DATE) AS sale_day,
        ROUND(SUM(amount), 2)                               AS daily_amount
    FROM doc_ts_events
    WHERE event_type = 'purchase'
    GROUP BY DATE_FORMAT(event_time, 'yyyy-MM-dd')
)
SELECT
    sale_day,
    daily_amount,
    ROUND(
        AVG(daily_amount) OVER (
            ORDER BY sale_day
            RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
        ), 2
    ) AS rolling_7d_avg
FROM daily_sales
ORDER BY sale_day;

Execution Result:

sale_daydaily_amountrolling_7d_avg
2024-01-05120.5120.5
2024-01-1085102.75
2024-01-20200200
2024-01-2555127.5
2024-02-03300300
2024-02-08150225
2024-02-189090
2024-02-227582.5
2024-03-01220220
2024-03-07180200
2024-03-18130130
2024-03-259595

Scenario 3: YoY/MoM Calculation

Use LAG() to get the previous period's data and compute the month-over-month growth rate.

WITH monthly_sales AS (
    SELECT
        DATE_FORMAT(event_time, 'yyyy-MM') AS month,
        ROUND(SUM(amount), 2)              AS monthly_amount
    FROM doc_ts_events
    WHERE event_type = 'purchase'
    GROUP BY DATE_FORMAT(event_time, 'yyyy-MM')
)
SELECT
    month,
    monthly_amount,
    LAG(monthly_amount, 1) OVER (ORDER BY month)                AS prev_month_amount,
    ROUND(
        (monthly_amount - LAG(monthly_amount, 1) OVER (ORDER BY month))
        / LAG(monthly_amount, 1) OVER (ORDER BY month) * 100,
        2
    )                                                           AS mom_growth_pct
FROM monthly_sales
ORDER BY month;

Execution Result:

monthmonthly_amountprev_month_amountmom_growth_pct
2024-01460.5NULLNULL
2024-02615460.533.55
2024-036256151.63

Scenario 4: Time Interval Calculation

Compute the number of days between consecutive events for each user, used to analyze activity frequency and purchase cycles.

SELECT
    user_id,
    event_type,
    DATE_FORMAT(event_time, 'yyyy-MM-dd')                           AS event_day,
    LAG(DATE_FORMAT(event_time, 'yyyy-MM-dd'), 1)
        OVER (PARTITION BY user_id ORDER BY event_time)             AS prev_event_day,
    DATEDIFF(
        DATE_FORMAT(event_time, 'yyyy-MM-dd'),
        LAG(DATE_FORMAT(event_time, 'yyyy-MM-dd'), 1)
            OVER (PARTITION BY user_id ORDER BY event_time)
    )                                                               AS days_since_last
FROM doc_ts_events
ORDER BY user_id, event_time;

Execution Result:

user_idevent_typeevent_dayprev_event_daydays_since_last
101purchase2024-01-05NULLNULL
101refund2024-01-152024-01-0510
101purchase2024-02-032024-01-1519
101purchase2024-02-222024-02-0319
101refund2024-03-122024-02-2219
102purchase2024-01-10NULLNULL
102purchase2024-01-252024-01-1015
102purchase2024-02-182024-01-2524
102purchase2024-03-182024-02-1829
103purchase2024-01-20NULLNULL
103refund2024-02-142024-01-2025
103purchase2024-03-072024-02-1422
104purchase2024-02-08NULLNULL
104purchase2024-03-012024-02-0822
104purchase2024-03-252024-03-0124

Scenario 5: Fill Missing Dates with Zero

When some dates have no data, direct GROUP BY skips those dates, causing gaps in line charts. Use SEQUENCE + EXPLODE to generate a complete date series, then LEFT JOIN actual data and use COALESCE to fill NULLs with zero.

WITH date_spine AS (
    -- Generate continuous dates from 2024-01-01 to 2024-03-31
    SELECT EXPLODE(
        SEQUENCE(
            CAST('2024-01-01' AS DATE),
            CAST('2024-03-31' AS DATE),
            INTERVAL 1 DAY
        )
    ) AS cal_day
),
daily_sales AS (
    SELECT
        CAST(DATE_FORMAT(event_time, 'yyyy-MM-dd') AS DATE) AS sale_day,
        ROUND(SUM(amount), 2)                               AS daily_amount
    FROM doc_ts_events
    WHERE event_type = 'purchase'
    GROUP BY DATE_FORMAT(event_time, 'yyyy-MM-dd')
)
SELECT
    d.cal_day,
    COALESCE(s.daily_amount, 0) AS daily_amount
FROM date_spine d
LEFT JOIN daily_sales s ON d.cal_day = s.sale_day
ORDER BY d.cal_day
LIMIT 10;

Execution Result (first 10 rows):

cal_daydaily_amount
2024-01-010
2024-01-020
2024-01-030
2024-01-040
2024-01-05120.5
2024-01-060
2024-01-070
2024-01-080
2024-01-090
2024-01-1085

Scenario 6: Active Users in Last N Days

Filter users who had at least 2 purchases in the last 90 days, and count their purchase frequency and last active date.

SELECT
    user_id,
    COUNT(*)                                   AS purchase_count,
    MAX(DATE_FORMAT(event_time, 'yyyy-MM-dd')) AS last_active_day
FROM doc_ts_events
WHERE event_type = 'purchase'
  AND event_time >= CAST('2024-03-31' AS TIMESTAMP) - INTERVAL 90 DAY
GROUP BY user_id
HAVING COUNT(*) >= 2
ORDER BY last_active_day DESC;

Execution Result:

user_idpurchase_countlast_active_day
10432024-03-25
10242024-03-18
10322024-03-07
10132024-02-22

Clean Up Test Data

After completing time series analysis verification, it is recommended to clean up test tables:

-- Drop test table
DROP TABLE IF EXISTS doc_ts_events;

Important Notes

  1. Timezone Handling: TIMESTAMP type stores UTC time. DATE_FORMAT and DATE_TRUNC parse in the session timezone (default UTC+8). When inserting data, use CAST('yyyy-MM-dd HH:mm:ss' AS TIMESTAMP) to explicitly specify time literals and avoid implicit conversion failures.
  2. RANGE vs ROWS: ROWS BETWEEN N PRECEDING defines the window by row count, while RANGE BETWEEN INTERVAL N DAY PRECEDING defines it by time range. Results differ when dates are non-consecutive; time series analysis typically uses RANGE.
  3. DATE_FORMAT Format Strings: Lakehouse uses Java-style format strings: yyyy for four-digit year, MM for month, dd for date, HH for 24-hour hour. Format letters are case-sensitive.
  4. DATEDIFF Parameter Order: DATEDIFF(end_date, start_date) returns the number of days as end_date - start_date.
  5. Window Function Execution Order: Window functions execute after WHERE and GROUP BY; window function results cannot be used directly in WHERE. Use QUALIFY or subqueries for filtering.
  6. SEQUENCE with Large Date Ranges: SEQUENCE generates an array, and EXPLODE expands it to rows. When generating large date ranges (e.g., multiple years), be mindful of the impact on query performance.