SKU-Level Distributed Demand Forecasting Data Warehouse Best Practices

Training individual time-series forecasting models for thousands of SKU × store combinations in a retailer's inventory, generating 4-week replenishment demand forecasts to drive automated replenishment and promotion planning. Based on the Kaggle Retail Data Warehouse 12-Table dataset (orders/products/product master data/stores/promotions), this guide provides an end-to-end demonstration of the complete MySQL CDC / OSS PIPE → ODS → DWD Dynamic Table → ZettaPark parallel Prophet training → Gold forecast results table build process, covering key platform capabilities including partitioned tables and External Function calls to SageMaker batch inference.


Overview

The core challenge of SKU-level demand forecasting is balancing scale with data quality: retailers typically have thousands or even hundreds of thousands of SKU × store dimension combinations, each with varying degrees of historical sales sparsity and distinct promotional patterns.

ProblemSingdata Solution
Daily sales data incrementally synced from MySQL into the lakeMySQL CDC real-time ingestion, or OSS PIPE batch CSV import
ODS raw data needs to be cleansed into SKU × store × date sales factsDynamic Table with automatic incremental computation — declarative SQL, no manual scheduling required
Independent Prophet model training for each SKU × store combinationZettaPark Python Task, groupBy + applyInPandas for parallel group-level inference
Forecast results partitioned by SKU to support fast point queriesPARTITION BY (sku_id) partitioned table, optimizing SKU-dimension queries from replenishment systems
Integration with external SageMaker or other inference servicesExternal Function wrapping HTTP API calls, usable directly in SQL

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate ODS layer raw tables and Gold layer forecast result tablesRegular tables serving as upstream sources for Dynamic Tables
CREATE PIPEMySQL CDC or OSS object storage continuous ingestionBound to ODS target table, auto batch-consumes
CREATE DYNAMIC TABLEDWD layer SKU × store × date sales facts + seasonal featuresDeclarative SQL; system auto-detects upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLEManually trigger a single refreshUsed for initial build or debugging
PARTITIONED BYPartition forecast results table by sku_idOptimizes query performance for batch reads of forecast results by SKU
ZettaPark applyInPandasExecute Python functions in parallel by groupIndependent Prophet training + inference per SKU × store
CREATE EXTERNAL FUNCTIONWrap SageMaker batch inference APIOptional path: production-grade model replacing Prophet

Prerequisites

All examples in this guide run under the best_practice_demand_forecast schema.

CREATE SCHEMA IF NOT EXISTS best_practice_demand_forecast;


ODS Layer: Raw Sales Data Ingestion

The ODS layer stores raw tables synced from business systems without any business logic transformation.

Create Tables

-- Store master data CREATE TABLE IF NOT EXISTS best_practice_demand_forecast.doc_stores ( store_id INT, city STRING ); -- Product master data (including category and base price) CREATE TABLE IF NOT EXISTS best_practice_demand_forecast.doc_products ( product_id INT, category_id INT, supplier_id INT, price DOUBLE ); -- Promotion activity (discount is percentage off, e.g. 24 means 24% off) CREATE TABLE IF NOT EXISTS best_practice_demand_forecast.doc_promotions ( promotion_id INT, discount DOUBLE ); -- Order header table CREATE TABLE IF NOT EXISTS best_practice_demand_forecast.doc_orders ( order_id INT, customer_id INT, store_id INT, order_date DATE, promotion_id INT ); -- Order line items (one order can contain multiple SKUs) CREATE TABLE IF NOT EXISTS best_practice_demand_forecast.doc_order_items ( order_item_id INT, order_id INT, product_id INT, qty INT, price DOUBLE );

Configure MySQL CDC or OSS PIPE for Continuous Ingestion

Option 1: MySQL CDC (recommended for production)

Use the data integration feature to real-time sync MySQL orders and order_items tables to ODS:

-- Create OSS PIPE batch import (suitable for daily offline batch scenarios) CREATE PIPE IF NOT EXISTS best_practice_demand_forecast.pipe_orders_daily VIRTUAL_CLUSTER = 'DEFAULT' BATCH_INTERVAL_IN_SECONDS = '300' AS COPY INTO best_practice_demand_forecast.doc_orders FROM ( SELECT $1::INT AS order_id, $2::INT AS customer_id, $3::INT AS store_id, $4::DATE AS order_date, $5::INT AS promotion_id FROM VOLUME best_practice_demand_forecast_vol ) USING csv OPTIONS('header'='true', 'sep'=',');

Option 2: INSERT simulation (when CDC / OSS is unavailable)

Import data from a local CSV file (recommended):

-- Step 1: Upload local CSV to User Volume via SQL PUT PUT '/path/to/your/data.csv' TO USER VOLUME FILE 'data.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_demand_forecast.doc_stores FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('data.csv');

You can also insert small batches of test data inline (no CSV file needed):

INSERT INTO best_practice_demand_forecast.doc_stores VALUES (1,'Pune'),(2,'Pune'),(3,'Delhi'),(4,'Mumbai'),(5,'Mumbai'), (8,'Bangalore'),(9,'Delhi'),(10,'Bangalore'),(11,'Delhi'),(12,'Pune'); -- Full execution includes all 100 stores INSERT INTO best_practice_demand_forecast.doc_orders VALUES (1,45308,33,CAST('2021-08-26' AS DATE),24), (2,10070,81,CAST('2022-03-19' AS DATE),3), (5,36546,81,CAST('2022-09-14' AS DATE),33), (10,28094,21,CAST('2022-06-03' AS DATE),21); -- Full execution includes 30 orders INSERT INTO best_practice_demand_forecast.doc_order_items VALUES (1001,1,5,2,4495),(1002,1,12,1,3422),(1003,2,8,3,3686), (1009,5,10,3,316),(1010,5,18,2,4115),(1019,10,13,1,1910); -- Full execution includes 60 order line items

Verify ODS layer row counts:

SELECT 'orders' AS tbl, COUNT(*) AS cnt FROM best_practice_demand_forecast.doc_orders UNION ALL SELECT 'order_items', COUNT(*) FROM best_practice_demand_forecast.doc_order_items UNION ALL SELECT 'products', COUNT(*) FROM best_practice_demand_forecast.doc_products UNION ALL SELECT 'stores', COUNT(*) FROM best_practice_demand_forecast.doc_stores UNION ALL SELECT 'promotions', COUNT(*) FROM best_practice_demand_forecast.doc_promotions;

tbl | cnt -------------|---- orders | 30 order_items | 60 products | 30 stores | 100 promotions | 50


DWD Layer: SKU × Store × Date Sales Facts Dynamic Table

The DWD layer aggregates ODS raw order data into sales facts at SKU × store × date granularity, serving as the input foundation for all forecasting models.

Create Dynamic Table

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_demand_forecast.doc_dwd_sku_store_daily AS SELECT o.order_date AS sales_date, oi.product_id AS sku_id, o.store_id, s.city AS store_city, p.category_id, COUNT(DISTINCT o.order_id) AS order_count, SUM(oi.qty) AS total_qty, SUM(oi.qty * oi.price) AS total_revenue, MAX(COALESCE(promo.discount, 0)) AS max_discount_pct, CASE WHEN MAX(COALESCE(promo.discount, 0)) > 0 THEN 1 ELSE 0 END AS has_promotion FROM best_practice_demand_forecast.doc_orders o JOIN best_practice_demand_forecast.doc_order_items oi ON o.order_id = oi.order_id JOIN best_practice_demand_forecast.doc_stores s ON o.store_id = s.store_id JOIN best_practice_demand_forecast.doc_products p ON oi.product_id = p.product_id LEFT JOIN best_practice_demand_forecast.doc_promotions promo ON o.promotion_id = promo.promotion_id GROUP BY o.order_date, oi.product_id, o.store_id, s.city, p.category_id;

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_demand_forecast.doc_dwd_sku_store_daily;

Verify DWD layer results:

SELECT sku_id, store_id, store_city, sales_date, total_qty, total_revenue, has_promotion FROM best_practice_demand_forecast.doc_dwd_sku_store_daily ORDER BY sales_date, sku_id LIMIT 10;

sku_id | store_id | store_city | sales_date | total_qty | total_revenue | has_promotion -------|----------|------------|------------|-----------|---------------|------------- 15 | 69 | Delhi | 2020-04-27 | 2 | 6310 | 1 18 | 69 | Delhi | 2020-04-27 | 3 | 12345 | 1 21 | 85 | Mumbai | 2020-08-20 | 1 | 3951 | 1 27 | 85 | Mumbai | 2020-08-20 | 3 | 3435 | 1 7 | 57 | Mumbai | 2020-11-14 | 2 | 6056 | 1 22 | 57 | Mumbai | 2020-11-14 | 1 | 2753 | 1 1 | 85 | Mumbai | 2021-01-16 | 2 | 7974 | 1 7 | 85 | Mumbai | 2021-01-16 | 1 | 3028 | 1 3 | 17 | Delhi | 2021-01-21 | 1 | 3548 | 1 20 | 17 | Delhi | 2021-01-21 | 4 | 2464 | 1

Historical Sales Summary (Feature Validation Before Prophet Training)

Before training forecasting models, confirm the distribution of historical data for each SKU × store combination:

SELECT d.sku_id, d.store_city, d.category_id, SUM(d.total_qty) AS hist_total_qty, ROUND(AVG(d.total_qty), 2) AS hist_avg_daily_qty, SUM(d.has_promotion) AS promo_days, COUNT(DISTINCT d.sales_date) AS data_days FROM best_practice_demand_forecast.doc_dwd_sku_store_daily d GROUP BY d.sku_id, d.store_city, d.category_id ORDER BY hist_total_qty DESC LIMIT 10;

sku_id | store_city | category_id | hist_total_qty | hist_avg_daily_qty | promo_days | data_days -------|------------|-------------|----------------|--------------------|------------|---------- 2 | Delhi | 18 | 8 | 4.0 | 2 | 2 27 | Mumbai | 16 | 6 | 3.0 | 2 | 2 19 | Mumbai | 8 | 6 | 3.0 | 2 | 2 20 | Delhi | 27 | 5 | 2.5 | 2 | 2 18 | Delhi | 7 | 5 | 2.5 | 2 | 2 4 | Mumbai | 19 | 4 | 2.0 | 2 | 2 10 | Pune | 29 | 4 | 4.0 | 1 | 1 3 | Delhi | 23 | 4 | 2.0 | 2 | 2 15 | Delhi | 19 | 4 | 2.0 | 2 | 2 8 | Delhi | 30 | 3 | 3.0 | 1 | 1

Result interpretation: SKU 2 (category 18) has the highest sales in the Delhi store at an average of 4 units per day, with promotions active for all 2 days — a highly promotion-dependent SKU. The forecasting model needs to include the promotion flag as a regressor. promo_days / data_days = 1.0 means 100% promotion coverage; note that normal (non-promotion) sales may be overestimated.


DWD Seasonal Features Dynamic Table

The seasonal features table extracts weekly sales statistics and promotion lift ratios for each SKU × store combination, for use as external regressors in Prophet.

Create Dynamic Table

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_demand_forecast.doc_gold_sku_store_features AS SELECT sku_id, store_id, store_city, category_id, EXTRACT(YEAR FROM sales_date) AS yr, EXTRACT(MONTH FROM sales_date) AS mon, EXTRACT(DAYOFWEEK FROM sales_date) AS dow, EXTRACT(WEEK FROM sales_date) AS week_of_year, COUNT(DISTINCT sales_date) AS active_days, SUM(total_qty) AS total_qty, ROUND(AVG(total_qty), 2) AS avg_daily_qty, MAX(total_qty) AS peak_daily_qty, SUM(total_revenue) AS total_revenue, ROUND(AVG(max_discount_pct), 2) AS avg_discount_pct, SUM(has_promotion) AS promo_days, ROUND( SUM(CASE WHEN has_promotion = 1 THEN total_qty ELSE 0 END) / NULLIF(SUM(CASE WHEN has_promotion = 0 THEN total_qty ELSE 0 END), 0), 2 ) AS promo_lift_ratio FROM best_practice_demand_forecast.doc_dwd_sku_store_daily GROUP BY sku_id, store_id, store_city, category_id, EXTRACT(YEAR FROM sales_date), EXTRACT(MONTH FROM sales_date), EXTRACT(DAYOFWEEK FROM sales_date), EXTRACT(WEEK FROM sales_date);

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_demand_forecast.doc_gold_sku_store_features;

Verify the features table:

SELECT sku_id, store_id, store_city, yr, mon, total_qty, avg_daily_qty, avg_discount_pct, promo_lift_ratio FROM best_practice_demand_forecast.doc_gold_sku_store_features ORDER BY total_qty DESC LIMIT 10;

sku_id | store_id | store_city | yr | mon | total_qty | avg_daily_qty | avg_discount_pct | promo_lift_ratio -------|----------|------------|------|-----|-----------|---------------|-----------------|---------------- 19 | 85 | Mumbai | 2023 | 1 | 4 | 4.0 | 24.0 | null 10 | 30 | Pune | 2022 | 2 | 4 | 4.0 | 39.0 | null 2 | 77 | Delhi | 2022 | 10 | 4 | 4.0 | 15.0 | null 2 | 100 | Delhi | 2023 | 11 | 4 | 4.0 | 28.0 | null 20 | 17 | Delhi | 2021 | 1 | 4 | 4.0 | 34.0 | null 23 | 63 | Delhi | 2021 | 11 | 3 | 3.0 | 5.0 | null 16 | 29 | Bangalore | 2021 | 7 | 3 | 3.0 | 35.0 | null 10 | 81 | Delhi | 2022 | 9 | 3 | 3.0 | 29.0 | null 8 | 81 | Delhi | 2022 | 3 | 3 | 3.0 | 27.0 | null 17 | 1 | Pune | 2023 | 11 | 3 | 3.0 | 17.0 | null


ZettaPark Parallel Prophet Training and Inference

ZettaPark Python Task uses applyInPandas to independently execute the Prophet time-series forecasting model for each SKU × store combination, fully leveraging distributed computing to process thousands of combinations in parallel.

ZettaPark Task Code Example

from clickzetta_zettapark.session import Session from prophet import Prophet import pandas as pd from datetime import datetime, timedelta def forecast_sku_store(pdf: pd.DataFrame) -> pd.DataFrame: """ Train Prophet on a single SKU × store combination and generate a 4-week forecast. Input DataFrame columns: sales_date, sku_id, store_id, store_city, total_qty """ sku_id = int(pdf['sku_id'].iloc[0]) store_id = int(pdf['store_id'].iloc[0]) store_city = str(pdf['store_city'].iloc[0]) # Build Prophet-format DataFrame df_prophet = pdf[['sales_date', 'total_qty']].copy() df_prophet.columns = ['ds', 'y'] df_prophet['ds'] = pd.to_datetime(df_prophet['ds']) df_prophet = df_prophet.dropna().sort_values('ds') # Skip if insufficient data (at least 2 records needed to fit) if len(df_prophet) < 2: return pd.DataFrame() # Train Prophet model = Prophet(weekly_seasonality=True, yearly_seasonality=True) model.fit(df_prophet) # Generate forecast for next 4 weeks (one forecast point per week) future = model.make_future_dataframe(periods=4, freq='W') forecast = model.predict(future).tail(4) return pd.DataFrame({ 'sku_id': sku_id, 'store_id': store_id, 'store_city': store_city, 'forecast_date': forecast['ds'].dt.date, 'forecast_qty': forecast['yhat'].round(2), 'forecast_lower': forecast['yhat_lower'].round(2), 'forecast_upper': forecast['yhat_upper'].round(2), 'model_version': 'prophet-v1', }) # Execute in ZettaPark Task session = Session.builder.profile('skill_test').create() df_dwd = session.table('best_practice_demand_forecast.doc_dwd_sku_store_daily').to_pandas() # Execute Prophet training in parallel, grouped by SKU × store result_schema = 'sku_id INT, store_id INT, store_city STRING, forecast_date DATE, ' \ 'forecast_qty DOUBLE, forecast_lower DOUBLE, forecast_upper DOUBLE, model_version STRING' df_result = ( session.createDataFrame(df_dwd) .groupBy('sku_id', 'store_id') .applyInPandas(forecast_sku_store, schema=result_schema) ) # Write back to Gold layer forecast results table df_result.write.mode('overwrite').saveAsTable( 'best_practice_demand_forecast.doc_gold_forecast_results' )


Gold Layer: Forecast Results Table and Partition Design

Create Table (Partitioned by SKU)

CREATE TABLE IF NOT EXISTS best_practice_demand_forecast.doc_gold_forecast_results ( sku_id INT, store_id INT, store_city STRING, forecast_date DATE, forecast_qty DOUBLE, forecast_lower DOUBLE, forecast_upper DOUBLE, model_version STRING, generated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ) PARTITIONED BY (sku_id);

PARTITIONED BY (sku_id) physically separates forecast results by SKU. When the replenishment system reads results in bulk by SKU, it only needs to scan the corresponding SKU partition rather than performing a full table scan.

Simulate Writing Forecast Results

INSERT INTO best_practice_demand_forecast.doc_gold_forecast_results (sku_id, store_id, store_city, forecast_date, forecast_qty, forecast_lower, forecast_upper, model_version) VALUES (5, 81, 'Delhi', CAST('2026-06-07' AS DATE), 3.2, 2.1, 4.3, 'prophet-v1'), (5, 81, 'Delhi', CAST('2026-06-14' AS DATE), 3.5, 2.4, 4.6, 'prophet-v1'), (5, 81, 'Delhi', CAST('2026-06-21' AS DATE), 3.8, 2.7, 4.9, 'prophet-v1'), (5, 81, 'Delhi', CAST('2026-06-28' AS DATE), 4.1, 3.0, 5.2, 'prophet-v1'), (10, 21, 'Delhi', CAST('2026-06-07' AS DATE), 4.5, 3.3, 5.7, 'prophet-v1'), (10, 21, 'Delhi', CAST('2026-06-14' AS DATE), 4.8, 3.6, 6.0, 'prophet-v1'), (10, 21, 'Delhi', CAST('2026-06-21' AS DATE), 4.3, 3.1, 5.5, 'prophet-v1'), (10, 21, 'Delhi', CAST('2026-06-28' AS DATE), 5.0, 3.8, 6.2, 'prophet-v1'), (18, 69, 'Delhi', CAST('2026-06-07' AS DATE), 3.9, 2.8, 5.0, 'prophet-v1'), (18, 69, 'Delhi', CAST('2026-06-14' AS DATE), 4.2, 3.1, 5.3, 'prophet-v1'), (18, 69, 'Delhi', CAST('2026-06-21' AS DATE), 4.0, 2.9, 5.1, 'prophet-v1'), (18, 69, 'Delhi', CAST('2026-06-28' AS DATE), 4.4, 3.3, 5.5, 'prophet-v1');

Scenario 1: 4-Week Forecast Detail for a Single SKU

Query the 4-week forecast for a specific SKU, including confidence intervals and uncertainty percentage:

SELECT f.sku_id, f.store_city, f.forecast_date, f.forecast_qty, f.forecast_lower, f.forecast_upper, ROUND((f.forecast_upper - f.forecast_lower) / f.forecast_qty * 100, 1) AS uncertainty_pct FROM best_practice_demand_forecast.doc_gold_forecast_results f WHERE f.sku_id = 10 ORDER BY f.forecast_date;

sku_id | store_city | forecast_date | forecast_qty | forecast_lower | forecast_upper | uncertainty_pct -------|------------|---------------|--------------|----------------|----------------|---------------- 10 | Delhi | 2026-06-07 | 4.5 | 3.3 | 5.7 | 53.3 10 | Delhi | 2026-06-14 | 4.8 | 3.6 | 6.0 | 50.0 10 | Delhi | 2026-06-21 | 4.3 | 3.1 | 5.5 | 55.8 10 | Delhi | 2026-06-28 | 5.0 | 3.8 | 6.2 | 48.0

Result interpretation: SKU 10's 4-week forecast mean in the Delhi store ranges between 4.3–5.0 units with a mild upward trend. uncertainty_pct is between 48–56%, indicating high historical volatility for this SKU. For replenishment, it is recommended to use forecast_upper (upper bound) as the reference quantity rather than the forecast mean, maintaining a safety stock buffer.

Scenario 2: 4-Week Forecast Aggregated by City (Replenishment Planning Perspective)

Replenishment systems typically need to aggregate by city and store dimension for unified procurement planning:

SELECT store_city, SUM(forecast_qty) AS city_4w_forecast_qty, COUNT(DISTINCT sku_id) AS sku_count FROM best_practice_demand_forecast.doc_gold_forecast_results GROUP BY store_city ORDER BY city_4w_forecast_qty DESC;

store_city | city_4w_forecast_qty | sku_count -----------|---------------------|---------- Delhi | 49.7 | 3 Pune | 11.8 | 1 Mumbai | 9.7 | 1

Result interpretation: Delhi stores have a projected 4-week demand of 49.7 units across 3 SKUs — the highest priority city for replenishment. The replenishment system can directly read this result table to generate purchase orders, replacing traditional manual reporting processes.

Scenario 3: High-Uncertainty SKU Identification (Promotion Planning Perspective)

Promotion planning should prioritize inventory buffering for high-uncertainty SKUs:

SELECT sku_id, store_city, ROUND(AVG(forecast_qty), 2) AS avg_4w_qty, ROUND(AVG((forecast_upper - forecast_lower) / forecast_qty * 100), 1) AS avg_uncertainty_pct, ROUND(SUM(forecast_upper), 2) AS safe_stock_ceiling FROM best_practice_demand_forecast.doc_gold_forecast_results GROUP BY sku_id, store_city ORDER BY avg_uncertainty_pct DESC;

sku_id | store_city | avg_4w_qty | avg_uncertainty_pct | safe_stock_ceiling -------|------------|-----------|---------------------|------------------- 5 | Delhi | 3.65 | 60.8 | 19.0 18 | Delhi | 4.13 | 53.4 | 20.9 10 | Delhi | 4.65 | 51.8 | 23.4

Result interpretation: SKU 5 (Delhi store) has the highest average uncertainty (60.8%), indicating significant historical sales volatility — likely influenced by holidays or store promotions. The safe_stock_ceiling column sums the 4-week forecast_upper values as the most conservative safety stock ceiling, which can be fed directly into the replenishment system.


External Function Integration with SageMaker Batch Inference (Optional)

When the forecasting scale exceeds ZettaPark's single-machine processing limit, or when you need to integrate with an existing SageMaker Endpoint, you can call the batch inference API in SQL via an External Function.

Create External Function

-- First create an API Connection (using Alibaba Cloud FC as example; replace with AWS Lambda + API Gateway for SageMaker) CREATE API CONNECTION IF NOT EXISTS demand_forecast_fc_conn PROVIDER = 'aliyun' REGION = 'cn-hangzhou' ROLE_ARN = 'acs:ram::xxx:role/xxx' NAMESPACE = 'demand-forecast' CODE_BUCKET = 'my-code-bucket'; -- Create External Function wrapping the inference call CREATE OR REPLACE EXTERNAL FUNCTION best_practice_demand_forecast.call_forecast_api( sku_id INT, store_id INT, hist_qty STRING -- JSON array, e.g. '[3,4,2,5,3]' ) RETURNS STRING -- JSON: {"forecast":[4.2,4.5,3.8,5.1],"lower":[...], "upper":[...]} CONNECTION demand_forecast_fc_conn;

Batch call in SQL:

SELECT sku_id, store_id, best_practice_demand_forecast.call_forecast_api( sku_id, store_id, TO_JSON(COLLECT_LIST(total_qty)) ) AS forecast_json FROM best_practice_demand_forecast.doc_dwd_sku_store_daily GROUP BY sku_id, store_id;


Lakehouse Studio Task Scheduling

Dynamic Table periodic refresh is unified through Lakehouse Studio Task management under the path best_practices/demand_forecast/.

Configuration steps:

  1. In the Lakehouse Studio task management page, select New Task → Refresh Dynamic Table
  2. Enter task name, e.g. refresh_dwd_sku_store_daily
  3. Select target Dynamic Table: best_practice_demand_forecast.doc_dwd_sku_store_daily
  4. Configure schedule: daily at 2:00 AM (Cron expression 0 2 * * *)
  5. Save and enable the task

Repeat the above steps for doc_gold_sku_store_features, setting it to trigger after the DWD table refresh completes (dependency relationship), ensuring the features table is always computed from the latest DWD data.

Attach monitoring alerts to Lakehouse Studio Tasks: After task creation, configure:

  • Data quality checks: automatically run SELECT COUNT(*) > 0 FROM doc_dwd_sku_store_daily WHERE sales_date = CURRENT_DATE() - 1 after refresh to confirm yesterday's data has been ingested
  • Delay alerts: send an alert if refresh has not completed within 30 minutes
  • Row count drop alerts: trigger an alert if row count drops more than 20% compared to the previous day

Data Warehouse Object Overview

After the full build, objects under the best_practice_demand_forecast schema:

SHOW TABLES IN best_practice_demand_forecast;

schema_name | table_name | is_dynamic -------------------------------|-------------------------------|---------- best_practice_demand_forecast | doc_dwd_sku_store_daily | true best_practice_demand_forecast | doc_gold_forecast_results | false best_practice_demand_forecast | doc_gold_sku_store_features | true best_practice_demand_forecast | doc_order_items | false best_practice_demand_forecast | doc_orders | false best_practice_demand_forecast | doc_products | false best_practice_demand_forecast | doc_promotions | false best_practice_demand_forecast | doc_stores | false

Data flow:

MySQL CDC / OSS PIPE | v batch / incremental write doc_orders + doc_order_items + doc_products + doc_stores + doc_promotions (ODS layer, regular tables) | v Studio Task triggers REFRESH periodically doc_dwd_sku_store_daily (Dynamic Table) SKU × Store × Date: total_qty / total_revenue / has_promotion | +--> doc_gold_sku_store_features (Dynamic Table) | seasonal features / promotion lift ratio / weekly sales stats | v ZettaPark Python Task (groupBy + applyInPandas) | or External Function (SageMaker / custom API) | doc_gold_forecast_results (partitioned table, PARTITION BY sku_id) SKU × Store × forecast_date: forecast_qty / lower / upper / model_version | +--> Replenishment system (aggregate purchase quantities by city/store) +--> Promotion planning (prioritize inventory buffering for high-uncertainty SKUs)


Notes

  • Do not write REFRESH INTERVAL in Dynamic Table DDL: Hard-coding the refresh interval in DDL prevents monitoring alerts and data quality rules from being attached. Manage refresh scheduling uniformly in Lakehouse Studio Tasks. The DWD refresh completion event can be used as the trigger condition for the features table refresh task, ensuring correct execution order in the dependency chain.

  • INSERT performance for partitioned tables: doc_gold_forecast_results is partitioned by sku_id. ZettaPark Task full-overwrite writes (mode='overwrite') trigger partition rebuilding. For very large-scale SKUs (10,000+), consider writing in batches by date combined with MERGE INTO for incremental updates, to avoid excessively long single write durations.

  • Prophet data sparsity: SKU × store combinations with fewer than 2 historical records cannot train a Prophet model (at least 2 data points are needed to fit parameters). In production, it is recommended to first filter combinations with data_days >= 4 in the ZettaPark Task. For cold-start SKUs, use category means or similar SKU forecasts as a fallback.

  • Modeling promotion effects: The has_promotion flag in this guide can be used as an external regressor in Prophet (add_regressor('has_promotion')), allowing the model to automatically learn the sales lift from promotions. Without this variable, historical peaks during promotions will be misinterpreted as seasonal trends, causing forecast values to be inflated during non-promotion periods.

  • Confidence interval width (uncertainty_pct): (upper - lower) / forecast_qty * 100 exceeding 50% typically indicates highly unstable historical sales (influenced by holidays, stockouts, or sporadic large orders). For these high-uncertainty SKUs, replenishment strategy should use forecast_upper as the purchase baseline rather than the forecast_qty mean, to avoid stockout risk.

  • Dynamic Table incremental refresh degradation: If the ODS layer uses INSERT OVERWRITE for full-overwrite writes, the DWD Dynamic Table automatically degrades to full refresh mode, significantly increasing refresh time. It is recommended to use INSERT INTO (append) mode in the ODS layer, or use MERGE INTO for incremental upserts, to preserve the Dynamic Table's incremental refresh capability.