Time Series Data Warehouse: Power Load Analysis and Forecasting
Building a multi-layer time-series data warehouse from PJM grid hourly load data to output peak-valley pricing strategies, load curve analysis, and anomaly detection alerts. Based on the real PJM Interconnection power dataset (2018, 4 days, 96 hourly records), this guide provides an end-to-end demonstration of the complete Kafka PIPE → Bronze → Silver → Gold build process, covering four core time-series capabilities: Window Functions (LAG / LEAD / ROWS BETWEEN), peak-valley identification, year-over-year comparison, and Z-score anomaly detection.
Overview
The typical pipeline for a power load data warehouse is: smart meter reporting → real-time ingestion → raw storage (Bronze) → hourly aggregation and cleansing (Silver) → daily peak-valley metrics (Gold) → load forecasting and BI.
Singdata Lakehouse addresses several core challenges with the following combination:
Problem
Solution
High-volume real-time ingestion of smart meter minute-level data
Kafka PIPE for continuous ingestion — no consumer code required
Hourly aggregation and peak-valley annotation need auto-update with upstream data
Dynamic Table with declarative SQL; system auto-computes incrementally
Load curve analysis requires comparison with previous and next hours
LAG / LEAD window functions for cross-row reference without self-JOINs
Sliding average to smooth noisy data
ROWS BETWEEN N PRECEDING AND CURRENT ROW
Same-season winter/summer comparison analysis
Conditional aggregation + CASE WHEN MONTH()
Daily peak-valley price spread and peak-valley ratio statistics
Gold layer aggregation supporting drill-down at any granularity
SQL Commands Used
Command / Function
Purpose
Notes
CREATE TABLE
Create Bronze layer raw load table and meter metadata table
Static tables serving as upstream sources for Dynamic Tables
CREATE BLOOMFILTER INDEX
Create index on event_time column
Accelerates point-query filtering for time range queries
System auto-detects upstream changes and refreshes incrementally
LAG / LEAD
Reference previous/next row load values
Compute hourly change amounts and trends
AVG ... OVER (ROWS BETWEEN)
Sliding window average
Smooth noise, identify trends
STDDEV
Calculate intraday load standard deviation
Basis for Z-score anomaly detection
REFRESH DYNAMIC TABLE
Manually trigger a single refresh
Used for initial build or debugging
Prerequisites
All examples in this guide run under the best_practice_energy_ts schema.
CREATE SCHEMA IF NOT EXISTS best_practice_energy_ts;
Bronze Layer: Raw Load Data Table
Create Tables
doc_pjme_load_raw stores hourly load data from the PJM Eastern Interconnection (PJME), with each row representing one hourly observation.
CREATE TABLE IF NOT EXISTS best_practice_energy_ts.doc_pjme_load_raw (
event_time TIMESTAMP,
load_mw DOUBLE
);
Also create a meter region master data table for subsequent dimensional enrichment:
CREATE TABLE IF NOT EXISTS best_practice_energy_ts.doc_meter_metadata (
meter_id STRING,
region STRING,
voltage_level STRING,
capacity_mw DOUBLE,
install_year INT,
operator STRING
);
Create BloomFilter Index
Time-series queries almost always include time range filtering on event_time. This column is suitable for a BloomFilter Index to accelerate equality and range filtering.
CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_bf_event_time
ON TABLE doc_pjme_load_raw (event_time);
⚠️ Note: CREATE BLOOMFILTER INDEX requires the same schema context as the target table. Before executing, run USE SCHEMA best_practice_energy_ts or use the -s best_practice_energy_ts parameter, otherwise you will get an "index and table must in the same schema" error.
Configure Kafka PIPE
Smart meter readings are streamed in real-time via a Kafka topic. Replace the broker address and topic name with your actual values before using in production.
Option 1: Write via Kafka (recommended)
First create a raw string receiver table, then create the PIPE:
CREATE TABLE IF NOT EXISTS best_practice_energy_ts.doc_kafka_raw_load (
kafka_value STRING
);
CREATE PIPE IF NOT EXISTS best_practice_energy_ts.pipe_energy_load
VIRTUAL_CLUSTER = 'DEFAULT'
BATCH_INTERVAL_IN_SECONDS = '60'
AS
COPY INTO best_practice_energy_ts.doc_kafka_raw_load
FROM (
SELECT CAST(value AS STRING) AS kafka_value
FROM READ_KAFKA(
'<kafka-broker>:9092', -- replace with actual broker address
'energy.load.realtime', -- topic name
'',
'cz_energy_consumer', -- consumer group ID
'','','','',
'raw', 'raw',
0,
map()
)
);
💡 Tip: In the PIPE DDL, positional parameters 5–8 for READ_KAFKA (start/end offsets, timestamps) must be left empty and are managed automatically by the PIPE at runtime.
Python producer example to trigger Kafka writes (using kafka-python):
from kafka import KafkaProducer
import json, time, random
producer = KafkaProducer(
bootstrap_servers=['<kafka-broker>:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Simulate smart meter reporting once per minute
while True:
record = {
"event_time": time.strftime('%Y-%m-%d %H:%M:%S'),
"meter_id": "PJME",
"load_mw": round(random.uniform(25000, 55000), 1)
}
producer.send('energy.load.realtime', value=record)
print(f"Sent: {record}")
time.sleep(60)
Option 2: INSERT simulation (when Kafka is unavailable)
If Kafka is not yet configured, use INSERT INTO to write directly to the target table, simulating the effect of messages already parsed and written, to validate subsequent Dynamic Table logic.
This guide uses the PJM Hourly Energy Consumption dataset (CC0 license), selecting 2 typical winter (January) and 2 typical summer (July) days from 2018 — 96 hourly records total.
Import data from local CSV (recommended):
-- Step 1: Upload local CSV to User Volume via SQL PUT
PUT '/path/to/your/doc_pjme_load_raw.csv' TO USER VOLUME FILE 'doc_pjme_load_raw.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_energy_ts.doc_pjme_load_raw
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('doc_pjme_load_raw.csv');
You can also insert small batches of test data inline (no CSV file needed):
INSERT INTO best_practice_energy_ts.doc_pjme_load_raw (event_time, load_mw) VALUES
(CAST('2018-01-01 00:00:00' AS TIMESTAMP), 39928.0),
(CAST('2018-01-01 01:00:00' AS TIMESTAMP), 38925.0),
-- ... 96 records total, covering 2018-01-01, 2018-01-15, 2018-07-01, 2018-07-15
(CAST('2018-07-15 23:00:00' AS TIMESTAMP), 37301.0);
Verify data load:
SELECT COUNT(*) AS total_rows FROM best_practice_energy_ts.doc_pjme_load_raw;
total_rows
----------
96
Insert meter metadata:
Import from local CSV (recommended):
-- Step 1: Upload local CSV to User Volume via SQL PUT
PUT '/path/to/your/doc_meter_metadata.csv' TO USER VOLUME FILE 'doc_meter_metadata.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_energy_ts.doc_meter_metadata
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('doc_meter_metadata.csv');
You can also insert small batches of test data inline (no CSV file needed):
Silver Layer Dynamic Table: Hourly Aggregation and Peak-Valley Annotation
The Silver layer aggregates Bronze raw data at hourly granularity, computing averages, peaks, and valleys, and annotating tariff_period (peak / valley) by electricity pricing time slot.
Peak-valley time slot definition (example — adjust according to actual business needs):
Peak time: 09:00–21:59 daily
Valley time: 00:00–08:59, 22:00–23:59
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_energy_ts.doc_silver_hourly_load
AS
SELECT
DATE_TRUNC('hour', event_time) AS hour_ts,
DATE(event_time) AS load_date,
HOUR(event_time) AS load_hour,
ROUND(AVG(load_mw), 1) AS avg_load_mw,
ROUND(MAX(load_mw), 1) AS max_load_mw,
ROUND(MIN(load_mw), 1) AS min_load_mw,
COUNT(*) AS data_points,
CASE
WHEN HOUR(event_time) BETWEEN 9 AND 21 THEN 'peak'
ELSE 'valley'
END AS tariff_period
FROM best_practice_energy_ts.doc_pjme_load_raw
WHERE load_mw IS NOT NULL AND load_mw > 0
GROUP BY DATE_TRUNC('hour', event_time), DATE(event_time), HOUR(event_time);
⚠️ Note: Do not write REFRESH INTERVAL in Dynamic Table DDL. Refresh scheduling is managed via Lakehouse Studio Tasks. See the "Configure Refresh Scheduling Task" section below.
Manually trigger the first refresh:
REFRESH DYNAMIC TABLE best_practice_energy_ts.doc_silver_hourly_load;
SELECT COUNT(*) AS silver_count FROM best_practice_energy_ts.doc_silver_hourly_load;
silver_count
------------
96
Configure Silver Layer Refresh Scheduling Task
💡 Tip: The examples below use cz-cli (the Singdata ClickZetta Lakehouse CLI tool). If cz-cli is not installed, see cz-cli Setup Guide. If you prefer not to use the command line, you can also run SQL in the Development → SQL Editor of Lakehouse Studio, and configure/trigger scheduled tasks in the Studio → Tasks page.
Create a refresh task in Lakehouse Studio under the energy_ts folder via cz-cli:
After task creation, you can attach data quality check rules (e.g., COUNT(*) > 0) and alert notifications to refresh_silver_hourly_load in the Lakehouse Studio task interface, without modifying the Dynamic Table definition itself.
Time Series Analysis: Window Functions in Practice
The following analyses are all based on Silver layer data, demonstrating three typical time-series computation patterns.
Hourly Load Change (LAG / LEAD)
LAG references the previous hour's data, LEAD references the next hour's data, computing the load change for each hour:
SELECT
load_date,
load_hour,
avg_load_mw,
LAG(avg_load_mw, 1) OVER (PARTITION BY load_date ORDER BY load_hour) AS prev_hour_mw,
LEAD(avg_load_mw, 1) OVER (PARTITION BY load_date ORDER BY load_hour) AS next_hour_mw,
ROUND(
avg_load_mw
- LAG(avg_load_mw, 1) OVER (PARTITION BY load_date ORDER BY load_hour),
1
) AS hour_delta_mw
FROM best_practice_energy_ts.doc_silver_hourly_load
WHERE load_date = CAST('2018-07-01' AS DATE)
ORDER BY load_hour;
Result interpretation: Load continuously declines from 1 AM to 7 AM (hour_delta_mw is negative), dropping from 37,751 MW to 27,992 MW — a 26% decrease. This matches the pattern of reduced summer nighttime cooling demand. Starting at 7 AM the delta turns positive, signaling the start of the morning peak.
3-Hour Rolling Average (ROWS BETWEEN)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW computes the rolling average over the past 3 hours, smoothing short-term fluctuations:
SELECT
load_date,
load_hour,
avg_load_mw,
ROUND(AVG(avg_load_mw) OVER (
PARTITION BY load_date
ORDER BY load_hour
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 1) AS rolling_3h_avg_mw
FROM best_practice_energy_ts.doc_silver_hourly_load
WHERE load_date = CAST('2018-07-01' AS DATE)
ORDER BY load_hour;
Result interpretation: The rolling average (rolling_3h_avg_mw) changes more gradually than the instantaneous value (avg_load_mw) — especially during the sharp morning load surge (hours 8–9), where the rolling average creates a lag effect. This can be used to filter noise from meter fluctuations.
Intraday Peak-Valley Identification and Peak Percentage
Calculate the daily peak, valley, and each hour's percentage of the peak:
SELECT
load_date,
load_hour,
avg_load_mw,
tariff_period,
MAX(avg_load_mw) OVER (PARTITION BY load_date) AS daily_peak_mw,
MIN(avg_load_mw) OVER (PARTITION BY load_date) AS daily_valley_mw,
ROUND(100.0 * avg_load_mw / MAX(avg_load_mw) OVER (PARTITION BY load_date), 1) AS pct_of_peak
FROM best_practice_energy_ts.doc_silver_hourly_load
WHERE load_date = CAST('2018-07-01' AS DATE)
ORDER BY load_hour;
Result interpretation: The intraday peak-valley spread for 2018-07-01 reaches 23,811 MW (peak-valley ratio 46%, see Gold layer), far higher than winter (approximately 16%). At 6 AM (27,992 MW) is the daily trough — only 54% of peak — making it the optimal time slot for energy storage charging and valley electricity pricing.
The Gold layer aggregates Silver layer data at daily granularity, outputting peak-valley price spreads, peak-valley ratios, and time-period average values for BI dashboards and pricing systems.
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_energy_ts.doc_gold_daily_load_profile
AS
SELECT
load_date,
COUNT(*) AS hours_recorded,
ROUND(AVG(avg_load_mw), 1) AS daily_avg_mw,
ROUND(MAX(max_load_mw), 1) AS daily_peak_mw,
ROUND(MIN(min_load_mw), 1) AS daily_valley_mw,
ROUND(MAX(max_load_mw) - MIN(min_load_mw), 1) AS peak_valley_spread_mw,
ROUND(
100.0 * (MAX(max_load_mw) - MIN(min_load_mw)) / MAX(max_load_mw), 1
) AS peak_valley_ratio_pct,
ROUND(
SUM(CASE WHEN tariff_period = 'peak' THEN avg_load_mw ELSE 0 END)
/ NULLIF(SUM(CASE WHEN tariff_period = 'peak' THEN 1 ELSE 0 END), 0),
1
) AS peak_period_avg_mw,
ROUND(
SUM(CASE WHEN tariff_period = 'valley' THEN avg_load_mw ELSE 0 END)
/ NULLIF(SUM(CASE WHEN tariff_period = 'valley' THEN 1 ELSE 0 END), 0),
1
) AS valley_period_avg_mw,
CASE
WHEN MONTH(load_date) IN (6,7,8) THEN 'summer'
WHEN MONTH(load_date) IN (12,1,2) THEN 'winter'
WHEN MONTH(load_date) IN (3,4,5) THEN 'spring'
ELSE 'autumn'
END AS season
FROM best_practice_energy_ts.doc_silver_hourly_load
GROUP BY load_date;
Manually trigger refresh and view results:
REFRESH DYNAMIC TABLE best_practice_energy_ts.doc_gold_daily_load_profile;
SELECT load_date, daily_avg_mw, daily_peak_mw, daily_valley_mw,
peak_valley_spread_mw, peak_valley_ratio_pct,
peak_period_avg_mw, valley_period_avg_mw, season
FROM best_practice_energy_ts.doc_gold_daily_load_profile
ORDER BY load_date;
Winter vs. summer: Winter (January) peak-valley spread is approximately 6,601–7,007 MW with a 15–17% ratio. Summer (July) peak-valley spread reaches 15,636–23,811 MW with a 37–46% ratio. Summer electricity consumption is extremely concentrated in the afternoon and evening (air conditioning load), making it the highest-value interval for peak-valley pricing.
2018-07-01 extreme day: Peak of 51,803 MW is the highest in the dataset with a 46% peak-valley ratio. Meanwhile 2018-07-15 only reaches 42,348 MW, showing significant load variation across days within the same month — monthly averages are not suitable for estimating daily peaks.
Peak vs. valley period average difference: On summer day 2018-07-01, peak period average (46,463 MW) is 1.38x valley period average (33,638 MW), indicating the strongest demand response incentive from time-of-use pricing in summer.
💡 Tip: In Lakehouse Studio, you can set a dependency relationship for refresh_gold_daily_load_profile to trigger only after refresh_silver_hourly_load completes successfully, ensuring the Gold layer reads the latest Silver layer data.
Winter/Summer Seasonal Comparison
Group Silver layer data by month and compare average winter vs. summer load at each hourly point:
SELECT
load_hour,
ROUND(AVG(CASE WHEN MONTH(load_date) = 7 THEN avg_load_mw END), 1) AS summer_avg_mw,
ROUND(AVG(CASE WHEN MONTH(load_date) = 1 THEN avg_load_mw END), 1) AS winter_avg_mw,
ROUND(
AVG(CASE WHEN MONTH(load_date) = 7 THEN avg_load_mw END)
- AVG(CASE WHEN MONTH(load_date) = 1 THEN avg_load_mw END),
1
) AS summer_vs_winter_delta
FROM best_practice_energy_ts.doc_silver_hourly_load
GROUP BY load_hour
ORDER BY load_hour;
Result interpretation: Winter load is generally higher than summer from midnight to mid-morning (hours 0–11), with the difference peaking at -12,561 MW at 7 AM (winter is approximately 46% higher than summer). The root cause is that winter early-morning heating demand combined with later sunrise shifts the morning peak earlier. After noon, summer gradually overtakes winter, with afternoon air conditioning load taking over. This pattern provides important guidance for dynamically adjusting peak-valley time slots by season.
Anomaly Detection: Z-Score Method
Identify anomalous time slots within the day based on intraday standard deviation. Z-score absolute value > 2.0 is flagged as an anomaly:
WITH stats AS (
SELECT
load_date,
AVG(avg_load_mw) AS mean_mw,
STDDEV(avg_load_mw) AS std_mw
FROM best_practice_energy_ts.doc_silver_hourly_load
GROUP BY load_date
)
SELECT
h.load_date,
h.load_hour,
h.avg_load_mw,
ROUND((h.avg_load_mw - s.mean_mw) / NULLIF(s.std_mw, 0), 2) AS z_score,
CASE
WHEN ABS((h.avg_load_mw - s.mean_mw) / NULLIF(s.std_mw, 0)) > 2.0 THEN 'anomaly'
ELSE 'normal'
END AS anomaly_flag
FROM best_practice_energy_ts.doc_silver_hourly_load h
JOIN stats s ON h.load_date = s.load_date
WHERE h.load_date = CAST('2018-07-01' AS DATE)
ORDER BY h.load_hour;
First 10 rows (excerpt):
load_date | load_hour | avg_load_mw | z_score | anomaly_flag
-----------+-----------+-------------+---------+-------------
2018-07-01 | 0 | 37751 | -0.32 | normal
2018-07-01 | 1 | 34716 | -0.66 | normal
2018-07-01 | 2 | 32345 | -0.93 | normal
2018-07-01 | 3 | 30546 | -1.14 | normal
2018-07-01 | 4 | 29300 | -1.28 | normal
2018-07-01 | 5 | 28511 | -1.37 | normal
2018-07-01 | 6 | 27992 | -1.43 | normal
2018-07-01 | 7 | 28211 | -1.4 | normal
2018-07-01 | 8 | 30337 | -1.16 | normal
2018-07-01 | 9 | 33759 | -0.77 | normal
Intraday standard deviation statistics:
SELECT
load_date,
ROUND(AVG(avg_load_mw), 1) AS mean_mw,
ROUND(STDDEV(avg_load_mw), 1) AS stddev_mw,
ROUND(AVG(avg_load_mw) + 2 * STDDEV(avg_load_mw), 1) AS upper_2sigma,
ROUND(AVG(avg_load_mw) - 2 * STDDEV(avg_load_mw), 1) AS lower_2sigma
FROM best_practice_energy_ts.doc_silver_hourly_load
GROUP BY load_date
ORDER BY load_date;
Result interpretation: Winter (January) intraday standard deviation is approximately 2,000–2,500 MW with a flat load curve. Summer (July) standard deviation reaches 5,000–8,800 MW with extreme intraday volatility. The 2σ upper bound for summer 2018-07-01 is 58,252 MW, while the actual peak of 51,803 MW does not exceed this limit — so there are no Z-score anomalies for this day. This means although the absolute peak is high, it falls within the statistical pattern for that day. Equipment failures causing instantaneous spikes would trigger Z-score > 2 anomaly flags.
Load Forecasting (External Function Integration)
Gold layer data can be connected to time-series forecasting models. The following shows the architectural approach for calling external Prophet / ARIMA forecasting services via External Function (illustrative code — actual deployment requires configuring an API Connection):
-- Create External Function calling Prophet forecasting service (illustrative)
-- CREATE EXTERNAL FUNCTION best_practice_energy_ts.predict_next_24h(
-- history_load ARRAY<DOUBLE>,
-- history_timestamps ARRAY<STRING>
-- )
-- RETURNS STRUCT<forecast_mw ARRAY<DOUBLE>, forecast_timestamps ARRAY<STRING>>
-- LANGUAGE PYTHON
-- HANDLER = 'ProphetForecast.predict'
-- RESOURCES = 'volume://functions/prophet_forecast.zip'
-- CONNECTION = my_api_connection;
-- Using the forecast function (illustrative — uncomment to call after External Function is configured)
SELECT
load_date,
daily_avg_mw,
daily_peak_mw
-- predict_next_24h(...) AS forecast_result -- uncomment when calling in practice
FROM best_practice_energy_ts.doc_gold_daily_load_profile
ORDER BY load_date;
Smart Meter (real-time) Historical CSV (batch)
| |
v pipe_energy_load v INSERT
doc_kafka_raw_load doc_pjme_load_raw
| BloomFilter idx: event_time
|
doc_meter_metadata (dimension reference)
|
v Studio Task: every 30 min
doc_silver_hourly_load (Dynamic Table)
avg/max/min load_mw · tariff_period
LAG / LEAD · ROWS BETWEEN rolling avg
|
v Studio Task: daily 01:00
doc_gold_daily_load_profile (Dynamic Table)
daily peak/valley · spread · ratio · season
|
+-------------+-------------+
v v v
Load Curve BI Peak-Valley External Function
Dashboard Pricing System Prophet Forecast
Notes
Do not write REFRESH INTERVAL in Dynamic Table DDL: All Dynamic Tables in this guide have no REFRESH INTERVAL in their DDL. Refresh scheduling is managed uniformly through Lakehouse Studio Tasks, which allows monitoring alerts and data quality checks to be attached to the same task, and makes it easy to adjust refresh frequency without rebuilding the Dynamic Table.
BloomFilter Index only applies to new data: CREATE BLOOMFILTER INDEX takes effect for data written after the index is created. For tables with large amounts of existing data, note that the BLOOMFILTER type currently does not support BUILD INDEX — rebuilding the table is required to cover existing data.
Idempotency of Silver layer aggregation: The Silver layer Dynamic Table groups Bronze layer data by DATE_TRUNC('hour', event_time). If the Bronze layer contains multiple records per hour (e.g., minute-level data), AVG / MAX / MIN will aggregate correctly. If the Bronze layer uses INSERT OVERWRITE, it causes the Dynamic Table to fall back to full refresh mode. Use append writes (INSERT INTO) instead.
Intraday limitation of Z-score: The Z-score in this guide uses the current day's mean and standard deviation as the baseline, suitable for detecting anomalous time slots within the day. For cross-day detection (e.g., comparing against the same hour historically), use multi-day window statistics — e.g., historical mean and standard deviation at the same hourly point.
Winter/summer comparison limitations: This dataset contains only 2 days each from January and July 2018. Statistical conclusions are only for illustrating the analytical method. For production use, the complete yearly dataset should be used (PJM dataset has approximately 140,000 rows from 2002–2018) for robust conclusions.
Peak-valley time slots can be adjusted for your business: This guide uses 09:00–21:59 as peak time. You can modify the CASE WHEN in the Silver layer DDL according to actual electricity pricing policies (e.g., residential vs. commercial peak-valley time slots differ). No changes to downstream Gold layer logic are needed.