Time Series Data Warehouse: Power Load Analysis and Forecasting

Building a multi-layer time-series data warehouse from PJM grid hourly load data to output peak-valley pricing strategies, load curve analysis, and anomaly detection alerts. Based on the real PJM Interconnection power dataset (2018, 4 days, 96 hourly records), this guide provides an end-to-end demonstration of the complete Kafka PIPE → Bronze → Silver → Gold build process, covering four core time-series capabilities: Window Functions (LAG / LEAD / ROWS BETWEEN), peak-valley identification, year-over-year comparison, and Z-score anomaly detection.


Overview

The typical pipeline for a power load data warehouse is: smart meter reporting → real-time ingestion → raw storage (Bronze) → hourly aggregation and cleansing (Silver) → daily peak-valley metrics (Gold) → load forecasting and BI.

Singdata Lakehouse addresses several core challenges with the following combination:

ProblemSolution
High-volume real-time ingestion of smart meter minute-level dataKafka PIPE for continuous ingestion — no consumer code required
Hourly aggregation and peak-valley annotation need auto-update with upstream dataDynamic Table with declarative SQL; system auto-computes incrementally
Load curve analysis requires comparison with previous and next hoursLAG / LEAD window functions for cross-row reference without self-JOINs
Sliding average to smooth noisy dataROWS BETWEEN N PRECEDING AND CURRENT ROW
Same-season winter/summer comparison analysisConditional aggregation + CASE WHEN MONTH()
Daily peak-valley price spread and peak-valley ratio statisticsGold layer aggregation supporting drill-down at any granularity

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate Bronze layer raw load table and meter metadata tableStatic tables serving as upstream sources for Dynamic Tables
CREATE BLOOMFILTER INDEXCreate index on event_time columnAccelerates point-query filtering for time range queries
CREATE PIPECreate Kafka continuous ingestion pipelineBound to the Bronze layer target table
CREATE DYNAMIC TABLECreate Silver / Gold layer incremental computation tablesSystem auto-detects upstream changes and refreshes incrementally
LAG / LEADReference previous/next row load valuesCompute hourly change amounts and trends
AVG ... OVER (ROWS BETWEEN)Sliding window averageSmooth noise, identify trends
STDDEVCalculate intraday load standard deviationBasis for Z-score anomaly detection
REFRESH DYNAMIC TABLEManually trigger a single refreshUsed for initial build or debugging

Prerequisites

All examples in this guide run under the best_practice_energy_ts schema.

CREATE SCHEMA IF NOT EXISTS best_practice_energy_ts;


Bronze Layer: Raw Load Data Table

Create Tables

doc_pjme_load_raw stores hourly load data from the PJM Eastern Interconnection (PJME), with each row representing one hourly observation.

CREATE TABLE IF NOT EXISTS best_practice_energy_ts.doc_pjme_load_raw ( event_time TIMESTAMP, load_mw DOUBLE );

Also create a meter region master data table for subsequent dimensional enrichment:

CREATE TABLE IF NOT EXISTS best_practice_energy_ts.doc_meter_metadata ( meter_id STRING, region STRING, voltage_level STRING, capacity_mw DOUBLE, install_year INT, operator STRING );

Create BloomFilter Index

Time-series queries almost always include time range filtering on event_time. This column is suitable for a BloomFilter Index to accelerate equality and range filtering.

CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_bf_event_time ON TABLE doc_pjme_load_raw (event_time);

Configure Kafka PIPE

Smart meter readings are streamed in real-time via a Kafka topic. Replace the broker address and topic name with your actual values before using in production.

Option 1: Write via Kafka (recommended)

First create a raw string receiver table, then create the PIPE:

CREATE TABLE IF NOT EXISTS best_practice_energy_ts.doc_kafka_raw_load ( kafka_value STRING ); CREATE PIPE IF NOT EXISTS best_practice_energy_ts.pipe_energy_load VIRTUAL_CLUSTER = 'DEFAULT' BATCH_INTERVAL_IN_SECONDS = '60' AS COPY INTO best_practice_energy_ts.doc_kafka_raw_load FROM ( SELECT CAST(value AS STRING) AS kafka_value FROM READ_KAFKA( '<kafka-broker>:9092', -- replace with actual broker address 'energy.load.realtime', -- topic name '', 'cz_energy_consumer', -- consumer group ID '','','','', 'raw', 'raw', 0, map() ) );

Python producer example to trigger Kafka writes (using kafka-python):

from kafka import KafkaProducer import json, time, random producer = KafkaProducer( bootstrap_servers=['<kafka-broker>:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) # Simulate smart meter reporting once per minute while True: record = { "event_time": time.strftime('%Y-%m-%d %H:%M:%S'), "meter_id": "PJME", "load_mw": round(random.uniform(25000, 55000), 1) } producer.send('energy.load.realtime', value=record) print(f"Sent: {record}") time.sleep(60)

Option 2: INSERT simulation (when Kafka is unavailable)

If Kafka is not yet configured, use INSERT INTO to write directly to the target table, simulating the effect of messages already parsed and written, to validate subsequent Dynamic Table logic.

This guide uses the PJM Hourly Energy Consumption dataset (CC0 license), selecting 2 typical winter (January) and 2 typical summer (July) days from 2018 — 96 hourly records total.

Import data from local CSV (recommended):

-- Step 1: Upload local CSV to User Volume via SQL PUT PUT '/path/to/your/doc_pjme_load_raw.csv' TO USER VOLUME FILE 'doc_pjme_load_raw.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_energy_ts.doc_pjme_load_raw FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('doc_pjme_load_raw.csv');

You can also insert small batches of test data inline (no CSV file needed):

INSERT INTO best_practice_energy_ts.doc_pjme_load_raw (event_time, load_mw) VALUES (CAST('2018-01-01 00:00:00' AS TIMESTAMP), 39928.0), (CAST('2018-01-01 01:00:00' AS TIMESTAMP), 38925.0), -- ... 96 records total, covering 2018-01-01, 2018-01-15, 2018-07-01, 2018-07-15 (CAST('2018-07-15 23:00:00' AS TIMESTAMP), 37301.0);

Verify data load:

SELECT COUNT(*) AS total_rows FROM best_practice_energy_ts.doc_pjme_load_raw;

total_rows ---------- 96

Insert meter metadata:

Import from local CSV (recommended):

-- Step 1: Upload local CSV to User Volume via SQL PUT PUT '/path/to/your/doc_meter_metadata.csv' TO USER VOLUME FILE 'doc_meter_metadata.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_energy_ts.doc_meter_metadata FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('doc_meter_metadata.csv');

You can also insert small batches of test data inline (no CSV file needed):

INSERT INTO best_practice_energy_ts.doc_meter_metadata VALUES ('PJME', 'East PJM', '345kV', 60000.0, 1997, 'PJM Interconnection'), ('AEP', 'American Electric Power', '138kV', 22000.0, 2004, 'AEP Ohio'), ('DAYTON', 'Dayton Power', '69kV', 6500.0, 2003, 'AES Ohio'), ('COMED', 'ComEd Chicago', '345kV', 25000.0, 2002, 'Commonwealth Edison'), ('DEOK', 'Duke Energy Ohio-KY', '138kV', 8000.0, 2004, 'Duke Energy');


Silver Layer Dynamic Table: Hourly Aggregation and Peak-Valley Annotation

The Silver layer aggregates Bronze raw data at hourly granularity, computing averages, peaks, and valleys, and annotating tariff_period (peak / valley) by electricity pricing time slot.

Peak-valley time slot definition (example — adjust according to actual business needs):

  • Peak time: 09:00–21:59 daily
  • Valley time: 00:00–08:59, 22:00–23:59

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_energy_ts.doc_silver_hourly_load AS SELECT DATE_TRUNC('hour', event_time) AS hour_ts, DATE(event_time) AS load_date, HOUR(event_time) AS load_hour, ROUND(AVG(load_mw), 1) AS avg_load_mw, ROUND(MAX(load_mw), 1) AS max_load_mw, ROUND(MIN(load_mw), 1) AS min_load_mw, COUNT(*) AS data_points, CASE WHEN HOUR(event_time) BETWEEN 9 AND 21 THEN 'peak' ELSE 'valley' END AS tariff_period FROM best_practice_energy_ts.doc_pjme_load_raw WHERE load_mw IS NOT NULL AND load_mw > 0 GROUP BY DATE_TRUNC('hour', event_time), DATE(event_time), HOUR(event_time);

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_energy_ts.doc_silver_hourly_load; SELECT COUNT(*) AS silver_count FROM best_practice_energy_ts.doc_silver_hourly_load;

silver_count ------------ 96

Configure Silver Layer Refresh Scheduling Task

Create a refresh task in Lakehouse Studio under the energy_ts folder via cz-cli:

# 1. Create folder cz-cli task create-folder energy_ts -p skill_test # 2. Create SQL task cz-cli task create refresh_silver_hourly_load --type SQL --folder energy_ts -p skill_test # 3. Set task content cz-cli task save-content refresh_silver_hourly_load \ --content "REFRESH DYNAMIC TABLE best_practice_energy_ts.doc_silver_hourly_load;" \ -p skill_test # 4. Set schedule: refresh every 30 minutes cz-cli task save-cron refresh_silver_hourly_load \ --cron "0 */30 * * * ? *" -p skill_test

After task creation, you can attach data quality check rules (e.g., COUNT(*) > 0) and alert notifications to refresh_silver_hourly_load in the Lakehouse Studio task interface, without modifying the Dynamic Table definition itself.


Time Series Analysis: Window Functions in Practice

The following analyses are all based on Silver layer data, demonstrating three typical time-series computation patterns.

Hourly Load Change (LAG / LEAD)

LAG references the previous hour's data, LEAD references the next hour's data, computing the load change for each hour:

SELECT load_date, load_hour, avg_load_mw, LAG(avg_load_mw, 1) OVER (PARTITION BY load_date ORDER BY load_hour) AS prev_hour_mw, LEAD(avg_load_mw, 1) OVER (PARTITION BY load_date ORDER BY load_hour) AS next_hour_mw, ROUND( avg_load_mw - LAG(avg_load_mw, 1) OVER (PARTITION BY load_date ORDER BY load_hour), 1 ) AS hour_delta_mw FROM best_practice_energy_ts.doc_silver_hourly_load WHERE load_date = CAST('2018-07-01' AS DATE) ORDER BY load_hour;

First 8 rows (excerpt):

load_date | load_hour | avg_load_mw | prev_hour_mw | next_hour_mw | hour_delta_mw -----------+-----------+-------------+--------------+--------------+-------------- 2018-07-01 | 0 | 37751 | null | 34716 | null 2018-07-01 | 1 | 34716 | 37751 | 32345 | -3035 2018-07-01 | 2 | 32345 | 34716 | 30546 | -2371 2018-07-01 | 3 | 30546 | 32345 | 29300 | -1799 2018-07-01 | 4 | 29300 | 30546 | 28511 | -1246 2018-07-01 | 5 | 28511 | 29300 | 27992 | -789 2018-07-01 | 6 | 27992 | 28511 | 28211 | -519 2018-07-01 | 7 | 28211 | 27992 | 30337 | 219

Result interpretation: Load continuously declines from 1 AM to 7 AM (hour_delta_mw is negative), dropping from 37,751 MW to 27,992 MW — a 26% decrease. This matches the pattern of reduced summer nighttime cooling demand. Starting at 7 AM the delta turns positive, signaling the start of the morning peak.

3-Hour Rolling Average (ROWS BETWEEN)

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW computes the rolling average over the past 3 hours, smoothing short-term fluctuations:

SELECT load_date, load_hour, avg_load_mw, ROUND(AVG(avg_load_mw) OVER ( PARTITION BY load_date ORDER BY load_hour ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ), 1) AS rolling_3h_avg_mw FROM best_practice_energy_ts.doc_silver_hourly_load WHERE load_date = CAST('2018-07-01' AS DATE) ORDER BY load_hour;

First 10 rows (excerpt):

load_date | load_hour | avg_load_mw | rolling_3h_avg_mw -----------+-----------+-------------+------------------ 2018-07-01 | 0 | 37751 | 37751 2018-07-01 | 1 | 34716 | 36233.5 2018-07-01 | 2 | 32345 | 34937.3 2018-07-01 | 3 | 30546 | 32535.7 2018-07-01 | 4 | 29300 | 30730.3 2018-07-01 | 5 | 28511 | 29452.3 2018-07-01 | 6 | 27992 | 28601 2018-07-01 | 7 | 28211 | 28238 2018-07-01 | 8 | 30337 | 28846.7 2018-07-01 | 9 | 33759 | 30769

Result interpretation: The rolling average (rolling_3h_avg_mw) changes more gradually than the instantaneous value (avg_load_mw) — especially during the sharp morning load surge (hours 8–9), where the rolling average creates a lag effect. This can be used to filter noise from meter fluctuations.

Intraday Peak-Valley Identification and Peak Percentage

Calculate the daily peak, valley, and each hour's percentage of the peak:

SELECT load_date, load_hour, avg_load_mw, tariff_period, MAX(avg_load_mw) OVER (PARTITION BY load_date) AS daily_peak_mw, MIN(avg_load_mw) OVER (PARTITION BY load_date) AS daily_valley_mw, ROUND(100.0 * avg_load_mw / MAX(avg_load_mw) OVER (PARTITION BY load_date), 1) AS pct_of_peak FROM best_practice_energy_ts.doc_silver_hourly_load WHERE load_date = CAST('2018-07-01' AS DATE) ORDER BY load_hour;

First 10 rows (excerpt):

load_date | load_hour | avg_load_mw | tariff_period | daily_peak_mw | daily_valley_mw | pct_of_peak -----------+-----------+-------------+---------------+---------------+-----------------+------------ 2018-07-01 | 0 | 37751 | valley | 51803 | 27992 | 72.9 2018-07-01 | 1 | 34716 | valley | 51803 | 27992 | 67 2018-07-01 | 2 | 32345 | valley | 51803 | 27992 | 62.4 2018-07-01 | 3 | 30546 | valley | 51803 | 27992 | 59 2018-07-01 | 4 | 29300 | valley | 51803 | 27992 | 56.6 2018-07-01 | 5 | 28511 | valley | 51803 | 27992 | 55 2018-07-01 | 6 | 27992 | valley | 51803 | 27992 | 54 2018-07-01 | 7 | 28211 | valley | 51803 | 27992 | 54.5 2018-07-01 | 8 | 30337 | valley | 51803 | 27992 | 58.6 2018-07-01 | 9 | 33759 | peak | 51803 | 27992 | 65.2

Result interpretation: The intraday peak-valley spread for 2018-07-01 reaches 23,811 MW (peak-valley ratio 46%, see Gold layer), far higher than winter (approximately 16%). At 6 AM (27,992 MW) is the daily trough — only 54% of peak — making it the optimal time slot for energy storage charging and valley electricity pricing.


Gold Layer Dynamic Table: Daily Peak-Valley Metrics

The Gold layer aggregates Silver layer data at daily granularity, outputting peak-valley price spreads, peak-valley ratios, and time-period average values for BI dashboards and pricing systems.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_energy_ts.doc_gold_daily_load_profile AS SELECT load_date, COUNT(*) AS hours_recorded, ROUND(AVG(avg_load_mw), 1) AS daily_avg_mw, ROUND(MAX(max_load_mw), 1) AS daily_peak_mw, ROUND(MIN(min_load_mw), 1) AS daily_valley_mw, ROUND(MAX(max_load_mw) - MIN(min_load_mw), 1) AS peak_valley_spread_mw, ROUND( 100.0 * (MAX(max_load_mw) - MIN(min_load_mw)) / MAX(max_load_mw), 1 ) AS peak_valley_ratio_pct, ROUND( SUM(CASE WHEN tariff_period = 'peak' THEN avg_load_mw ELSE 0 END) / NULLIF(SUM(CASE WHEN tariff_period = 'peak' THEN 1 ELSE 0 END), 0), 1 ) AS peak_period_avg_mw, ROUND( SUM(CASE WHEN tariff_period = 'valley' THEN avg_load_mw ELSE 0 END) / NULLIF(SUM(CASE WHEN tariff_period = 'valley' THEN 1 ELSE 0 END), 0), 1 ) AS valley_period_avg_mw, CASE WHEN MONTH(load_date) IN (6,7,8) THEN 'summer' WHEN MONTH(load_date) IN (12,1,2) THEN 'winter' WHEN MONTH(load_date) IN (3,4,5) THEN 'spring' ELSE 'autumn' END AS season FROM best_practice_energy_ts.doc_silver_hourly_load GROUP BY load_date;

Manually trigger refresh and view results:

REFRESH DYNAMIC TABLE best_practice_energy_ts.doc_gold_daily_load_profile; SELECT load_date, daily_avg_mw, daily_peak_mw, daily_valley_mw, peak_valley_spread_mw, peak_valley_ratio_pct, peak_period_avg_mw, valley_period_avg_mw, season FROM best_practice_energy_ts.doc_gold_daily_load_profile ORDER BY load_date;

load_date | daily_avg_mw | daily_peak_mw | daily_valley_mw | peak_valley_spread_mw | peak_valley_ratio_pct | peak_period_avg_mw | valley_period_avg_mw | season -----------+--------------+---------------+-----------------+-----------------------+-----------------------+--------------------+---------------------+------- 2018-01-01 | 40191 | 44343 | 37742 | 6601 | 14.9 | 40964 | 39277.5 | winter 2018-01-15 | 39257.8 | 42249 | 35242 | 7007 | 16.6 | 40938.5 | 37271.5 | winter 2018-07-01 | 40584.8 | 51803 | 27992 | 23811 | 46 | 46463 | 33637.9 | summer 2018-07-15 | 34204.1 | 42348 | 26712 | 15636 | 36.9 | 37203.7 | 30659.1 | summer

Result interpretation:

  • Winter vs. summer: Winter (January) peak-valley spread is approximately 6,601–7,007 MW with a 15–17% ratio. Summer (July) peak-valley spread reaches 15,636–23,811 MW with a 37–46% ratio. Summer electricity consumption is extremely concentrated in the afternoon and evening (air conditioning load), making it the highest-value interval for peak-valley pricing.
  • 2018-07-01 extreme day: Peak of 51,803 MW is the highest in the dataset with a 46% peak-valley ratio. Meanwhile 2018-07-15 only reaches 42,348 MW, showing significant load variation across days within the same month — monthly averages are not suitable for estimating daily peaks.
  • Peak vs. valley period average difference: On summer day 2018-07-01, peak period average (46,463 MW) is 1.38x valley period average (33,638 MW), indicating the strongest demand response incentive from time-of-use pricing in summer.

Configure Gold Layer Refresh Scheduling Task

# Create daily 01:00 refresh task (depends on Silver layer data being updated) cz-cli task create refresh_gold_daily_load_profile --type SQL --folder energy_ts -p skill_test cz-cli task save-content refresh_gold_daily_load_profile \ --content "REFRESH DYNAMIC TABLE best_practice_energy_ts.doc_gold_daily_load_profile;" \ -p skill_test # Refresh daily at 01:00 AM cz-cli task save-cron refresh_gold_daily_load_profile \ --cron "0 0 1 * * ? *" -p skill_test


Winter/Summer Seasonal Comparison

Group Silver layer data by month and compare average winter vs. summer load at each hourly point:

SELECT load_hour, ROUND(AVG(CASE WHEN MONTH(load_date) = 7 THEN avg_load_mw END), 1) AS summer_avg_mw, ROUND(AVG(CASE WHEN MONTH(load_date) = 1 THEN avg_load_mw END), 1) AS winter_avg_mw, ROUND( AVG(CASE WHEN MONTH(load_date) = 7 THEN avg_load_mw END) - AVG(CASE WHEN MONTH(load_date) = 1 THEN avg_load_mw END), 1 ) AS summer_vs_winter_delta FROM best_practice_energy_ts.doc_silver_hourly_load GROUP BY load_hour ORDER BY load_hour;

First 12 hours (excerpt):

load_hour | summer_avg_mw | winter_avg_mw | summer_vs_winter_delta ----------+---------------+---------------+----------------------- 0 | 36147 | 38300 | -2153 1 | 33269 | 37325 | -4056 2 | 31082.5 | 36793 | -5710.5 3 | 29477.5 | 36525 | -7047.5 4 | 28386 | 36603 | -8217 5 | 27671.5 | 37167.5 | -9496 6 | 27352 | 38453.5 | -11101.5 7 | 27480 | 40041.5 | -12561.5 8 | 29054 | 40980 | -11926 9 | 31691.5 | 41242 | -9550.5 10 | 34662 | 40961.5 | -6299.5 11 | 37456 | 40531.5 | -3075.5

Result interpretation: Winter load is generally higher than summer from midnight to mid-morning (hours 0–11), with the difference peaking at -12,561 MW at 7 AM (winter is approximately 46% higher than summer). The root cause is that winter early-morning heating demand combined with later sunrise shifts the morning peak earlier. After noon, summer gradually overtakes winter, with afternoon air conditioning load taking over. This pattern provides important guidance for dynamically adjusting peak-valley time slots by season.


Anomaly Detection: Z-Score Method

Identify anomalous time slots within the day based on intraday standard deviation. Z-score absolute value > 2.0 is flagged as an anomaly:

WITH stats AS ( SELECT load_date, AVG(avg_load_mw) AS mean_mw, STDDEV(avg_load_mw) AS std_mw FROM best_practice_energy_ts.doc_silver_hourly_load GROUP BY load_date ) SELECT h.load_date, h.load_hour, h.avg_load_mw, ROUND((h.avg_load_mw - s.mean_mw) / NULLIF(s.std_mw, 0), 2) AS z_score, CASE WHEN ABS((h.avg_load_mw - s.mean_mw) / NULLIF(s.std_mw, 0)) > 2.0 THEN 'anomaly' ELSE 'normal' END AS anomaly_flag FROM best_practice_energy_ts.doc_silver_hourly_load h JOIN stats s ON h.load_date = s.load_date WHERE h.load_date = CAST('2018-07-01' AS DATE) ORDER BY h.load_hour;

First 10 rows (excerpt):

load_date | load_hour | avg_load_mw | z_score | anomaly_flag -----------+-----------+-------------+---------+------------- 2018-07-01 | 0 | 37751 | -0.32 | normal 2018-07-01 | 1 | 34716 | -0.66 | normal 2018-07-01 | 2 | 32345 | -0.93 | normal 2018-07-01 | 3 | 30546 | -1.14 | normal 2018-07-01 | 4 | 29300 | -1.28 | normal 2018-07-01 | 5 | 28511 | -1.37 | normal 2018-07-01 | 6 | 27992 | -1.43 | normal 2018-07-01 | 7 | 28211 | -1.4 | normal 2018-07-01 | 8 | 30337 | -1.16 | normal 2018-07-01 | 9 | 33759 | -0.77 | normal

Intraday standard deviation statistics:

SELECT load_date, ROUND(AVG(avg_load_mw), 1) AS mean_mw, ROUND(STDDEV(avg_load_mw), 1) AS stddev_mw, ROUND(AVG(avg_load_mw) + 2 * STDDEV(avg_load_mw), 1) AS upper_2sigma, ROUND(AVG(avg_load_mw) - 2 * STDDEV(avg_load_mw), 1) AS lower_2sigma FROM best_practice_energy_ts.doc_silver_hourly_load GROUP BY load_date ORDER BY load_date;

load_date | mean_mw | stddev_mw | upper_2sigma | lower_2sigma -----------+----------+-----------+--------------+------------- 2018-01-01 | 40191 | 2081.4 | 44353.9 | 36028.2 2018-01-15 | 39257.8 | 2500.5 | 44258.8 | 34256.7 2018-07-01 | 40584.8 | 8833.6 | 58252.1 | 22917.6 2018-07-15 | 34204.1 | 5398.7 | 45001.4 | 23406.8

Result interpretation: Winter (January) intraday standard deviation is approximately 2,000–2,500 MW with a flat load curve. Summer (July) standard deviation reaches 5,000–8,800 MW with extreme intraday volatility. The 2σ upper bound for summer 2018-07-01 is 58,252 MW, while the actual peak of 51,803 MW does not exceed this limit — so there are no Z-score anomalies for this day. This means although the absolute peak is high, it falls within the statistical pattern for that day. Equipment failures causing instantaneous spikes would trigger Z-score > 2 anomaly flags.


Load Forecasting (External Function Integration)

Gold layer data can be connected to time-series forecasting models. The following shows the architectural approach for calling external Prophet / ARIMA forecasting services via External Function (illustrative code — actual deployment requires configuring an API Connection):

-- Create External Function calling Prophet forecasting service (illustrative) -- CREATE EXTERNAL FUNCTION best_practice_energy_ts.predict_next_24h( -- history_load ARRAY<DOUBLE>, -- history_timestamps ARRAY<STRING> -- ) -- RETURNS STRUCT<forecast_mw ARRAY<DOUBLE>, forecast_timestamps ARRAY<STRING>> -- LANGUAGE PYTHON -- HANDLER = 'ProphetForecast.predict' -- RESOURCES = 'volume://functions/prophet_forecast.zip' -- CONNECTION = my_api_connection; -- Using the forecast function (illustrative — uncomment to call after External Function is configured) SELECT load_date, daily_avg_mw, daily_peak_mw -- predict_next_24h(...) AS forecast_result -- uncomment when calling in practice FROM best_practice_energy_ts.doc_gold_daily_load_profile ORDER BY load_date;


Data Warehouse Object Overview

After the full build, objects under the best_practice_energy_ts schema:

SHOW TABLES IN best_practice_energy_ts;

schema_name | table_name | is_dynamic -------------------------+------------------------------+----------- best_practice_energy_ts | doc_gold_daily_load_profile | true best_practice_energy_ts | doc_kafka_raw_load | false best_practice_energy_ts | doc_meter_metadata | false best_practice_energy_ts | doc_pjme_load_raw | false best_practice_energy_ts | doc_silver_hourly_load | true

Architecture:

Smart Meter (real-time) Historical CSV (batch) | | v pipe_energy_load v INSERT doc_kafka_raw_load doc_pjme_load_raw | BloomFilter idx: event_time | doc_meter_metadata (dimension reference) | v Studio Task: every 30 min doc_silver_hourly_load (Dynamic Table) avg/max/min load_mw · tariff_period LAG / LEAD · ROWS BETWEEN rolling avg | v Studio Task: daily 01:00 doc_gold_daily_load_profile (Dynamic Table) daily peak/valley · spread · ratio · season | +-------------+-------------+ v v v Load Curve BI Peak-Valley External Function Dashboard Pricing System Prophet Forecast


Notes

  • Do not write REFRESH INTERVAL in Dynamic Table DDL: All Dynamic Tables in this guide have no REFRESH INTERVAL in their DDL. Refresh scheduling is managed uniformly through Lakehouse Studio Tasks, which allows monitoring alerts and data quality checks to be attached to the same task, and makes it easy to adjust refresh frequency without rebuilding the Dynamic Table.

  • BloomFilter Index only applies to new data: CREATE BLOOMFILTER INDEX takes effect for data written after the index is created. For tables with large amounts of existing data, note that the BLOOMFILTER type currently does not support BUILD INDEX — rebuilding the table is required to cover existing data.

  • Idempotency of Silver layer aggregation: The Silver layer Dynamic Table groups Bronze layer data by DATE_TRUNC('hour', event_time). If the Bronze layer contains multiple records per hour (e.g., minute-level data), AVG / MAX / MIN will aggregate correctly. If the Bronze layer uses INSERT OVERWRITE, it causes the Dynamic Table to fall back to full refresh mode. Use append writes (INSERT INTO) instead.

  • Intraday limitation of Z-score: The Z-score in this guide uses the current day's mean and standard deviation as the baseline, suitable for detecting anomalous time slots within the day. For cross-day detection (e.g., comparing against the same hour historically), use multi-day window statistics — e.g., historical mean and standard deviation at the same hourly point.

  • Winter/summer comparison limitations: This dataset contains only 2 days each from January and July 2018. Statistical conclusions are only for illustrating the analytical method. For production use, the complete yearly dataset should be used (PJM dataset has approximately 140,000 rows from 2002–2018) for robust conclusions.

  • Peak-valley time slots can be adjusted for your business: This guide uses 09:00–21:59 as peak time. You can modify the CASE WHEN in the Silver layer DDL according to actual electricity pricing policies (e.g., residential vs. commercial peak-valley time slots differ). No changes to downstream Gold layer logic are needed.