Building a Retail Chain POS Operations Data Warehouse
Integrate POS transaction data and inventory data from 100 chain stores nationwide to build a store operations analysis and SKU turnover data warehouse. Using the Retail Data Warehouse 12-Table Dataset, this guide provides a complete end-to-end walkthrough from ODS raw POS transactions to ADS fast/slow-moving SKU analysis and store rankings. It covers six core platform capabilities: MySQL CDC multi-table full database mirroring, Dynamic Table incremental aggregation, static partitioning by store_id, BloomFilter Index, External Schema (Hive), and Time Travel month-end reconciliation.
Overview
Core challenges in a retail chain data warehouse:
Challenge
Singdata Solution
POS systems across regional stores use separate databases with sharded tables — need unified consolidation
MySQL CDC multi-table full database mirroring, merging sharded databases into a single ODS layer
Multi-layer aggregation across ODS→DWD→DWS→ADS, error-prone with manual scheduling
Dynamic Table cascading refresh with declarative SQL; system automatically maintains the dependency chain
Historical order queries for a single store are slow; cross-partition full scans are expensive
PARTITIONED BY (store_id) + SESSION_CONFIGS parameterized partitioning, refreshing per store
product_id in order line items is a high-cardinality column with frequent point lookups
BloomFilter Index for millisecond-level skipping of non-matching data blocks
Years of historical data still in a Hive cluster — don't want to migrate it
External Schema (Hive) for direct SQL queries against external historical tables
Month-end reconciliation requires comparing historical sales snapshots with current system data
Time Travel with TIMESTAMP AS OF to roll back to any historical version
SQL Commands Used
Command / Feature
Purpose
Notes
CREATE TABLE
Create ODS raw POS transaction tables and dimension tables
Regular tables, serving as upstream for Dynamic Tables
CREATE BLOOMFILTER INDEX
Create a BloomFilter index on the product_id column
Suitable for point-lookup filtering on high-cardinality columns
After unzipping, you get 12 CSV files: stores.csv (100 rows), orders.csv (300k rows), order_items.csv (600k rows), products.csv (10k rows), etc. This guide uses 100 stores, 100 orders, and 200 order line items as the demo dataset.
ODS Layer: Multi-Store POS Raw Data
The ODS layer directly receives raw data from regional POS systems. Tables are created with natural distribution by business primary key — no aggregation or transformation.
MySQL CDC Ingestion
In production, each regional POS system is typically an independent MySQL instance. A Studio multi-table real-time sync task (task_type=281) uses CDC to sync Binlog changes in real time to the ODS layer.
⚠️ Note: MySQL CDC connections do not use CREATE STORAGE CONNECTION. Singdata Lakehouse's STORAGE CONNECTION only supports object storage (OSS/HDFS, etc.) and Kafka types — it does not support MySQL. MySQL data sources must be configured in Studio's datasource management, and CDC tasks are created through the multi-table real-time sync feature.
Source MySQL preparation
Confirm the following parameters are correctly configured on the source MySQL:
Parameter
Required value
How to check
log_bin
ON
SHOW GLOBAL VARIABLES LIKE 'log_bin'
binlog_format
ROW
SHOW GLOBAL VARIABLES LIKE 'binlog_format'
binlog_row_image
FULL
SHOW GLOBAL VARIABLES LIKE 'binlog_row_image'
The sync account needs the following permissions:
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'cdc_user'@'%';
Studio multi-table real-time sync configuration steps
Add the MySQL data source in Studio datasource management (one data source per regional database, e.g., ds_mysql_pos_north, ds_mysql_pos_south)
Create a new task in Studio Development → Multi-table real-time sync, selecting the sync mode:
Full database mirror: database-level granularity, auto-adapts to new tables, suitable for initial full database ingestion
Multi-table mirror: sync selected tables, suitable when only some tables are needed
Configure the source data source, select the target workspace and schema (best_practice_retail_pos)
Set sync_mode to Full sync + incremental, which pulls historical data first then switches to CDC
Submit the task and manually start it in Studio
After the task starts, it goes through three phases: initialize → full sync → incremental CDC. End-to-end latency in the incremental phase is seconds. Multiple regional databases can have separate tasks writing to their respective tables under the same schema.
💡 Tip: If MySQL CDC + Kafka is not configured yet, you can directly write simulated data with INSERT INTO. All subsequent Dynamic Table and query logic remains the same. The table creation and data examples below use the INSERT approach.
Create Tables
Store master table (partitioned)
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_stores (
store_id INT,
city STRING
);
Product dimension tables
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_categories (
category_id INT,
category_name STRING
);
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_products (
product_id INT,
category_id INT,
supplier_id INT,
price DOUBLE
);
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_promotions (
promotion_id INT,
discount INT -- discount percentage, e.g. 24 means 24% off (76% of original price)
);
Customer and order tables
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_customers (
customer_id INT,
city STRING,
signup_date DATE
);
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_orders (
order_id INT,
customer_id INT,
store_id INT,
order_date DATE,
promotion_id INT
)
PARTITIONED BY (store_id);
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_order_items (
order_item_id INT,
order_id INT,
product_id INT,
qty INT,
price DOUBLE
);
Payment and return tables
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_payments (
payment_id INT,
order_id INT,
amount DOUBLE
);
CREATE TABLE IF NOT EXISTS best_practice_retail_pos.doc_ods_returns (
return_id INT,
order_item_id INT,
refund DOUBLE
);
Load Data
💡 Tip: The examples below use cz-cli (the Singdata ClickZetta Lakehouse command-line tool). If cz-cli is not installed, see the cz-cli setup guide. If you prefer not to use the command line, you can run SQL in Lakehouse Studio's Development → SQL Editor and configure and trigger scheduled tasks in the Studio → Tasks page.
Import Kaggle CSV files via cz-cli (replace with PIPE auto-ingestion in production).
Import from local CSV (recommended)
Save each table's data as a CSV file, then import in bulk via User Volume:
-- Step 1: Upload the local CSV file to User Volume via SQL PUT
PUT '/path/to/stores.csv' TO USER VOLUME FILE 'stores.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_retail_pos.doc_ods_stores
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('stores.csv');
Repeat the above steps for all other tables (doc_ods_orders, doc_ods_order_items, doc_ods_products, doc_ods_customers, doc_ods_payments, doc_ods_returns, etc.), uploading the corresponding CSV files and running COPY INTO for each.
You can also insert a small batch of test data inline (no CSV file required):
-- Example: load store master data (100 stores)
INSERT INTO best_practice_retail_pos.doc_ods_stores VALUES
(1,'Pune'),(2,'Pune'),(3,'Delhi'),(4,'Mumbai'),(5,'Mumbai'),
-- ... 100 rows total
(100,'Delhi');
-- Load orders (100) and order items (200)
INSERT INTO best_practice_retail_pos.doc_ods_orders VALUES
(1,45,33,CAST('2021-08-26' AS DATE),24),
(2,10,81,CAST('2022-03-19' AS DATE),3),
-- ...
(100,31,63,CAST('2022-03-30' AS DATE),33);
Verify ODS table row counts:
SELECT
(SELECT COUNT(*) FROM best_practice_retail_pos.doc_ods_stores) AS stores,
(SELECT COUNT(*) FROM best_practice_retail_pos.doc_ods_orders) AS orders,
(SELECT COUNT(*) FROM best_practice_retail_pos.doc_ods_order_items) AS items,
(SELECT COUNT(*) FROM best_practice_retail_pos.doc_ods_products) AS products,
(SELECT COUNT(*) FROM best_practice_retail_pos.doc_ods_payments) AS payments,
(SELECT COUNT(*) FROM best_practice_retail_pos.doc_ods_returns) AS returns;
Filtering order line items by product_id is a frequent operation (SKU sales ranking, fast/slow-moving SKU analysis). The product_id column has cardinality on the order of the product count, making it a good candidate for a BloomFilter Index:
CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_bf_product_id
ON TABLE doc_ods_order_items (product_id);
⚠️ Note: CREATE BLOOMFILTER INDEX requires the same schema context as the target table. Run USE SCHEMA best_practice_retail_pos or pass -s best_practice_retail_pos before executing. The BloomFilter Index only applies automatically to data written after it is created. Existing data requires BUILD INDEX to rebuild (the BLOOMFILTER type does not support BUILD INDEX; to cover existing data the table must be rebuilt).
External Schema: Accessing Hive Historical Archive Data
When historical archive data is still in a Hive cluster, query it directly via External Schema without migrating:
-- Step 1: Create a Catalog Connection pointing to Hive Metastore
CREATE CATALOG CONNECTION IF NOT EXISTS conn_hive_pos
TYPE HMS
HIVE_METASTORE_URIS = 'thrift://hive-metastore:9083'
STORAGE_CONNECTION = 'conn_oss_archive';
-- Step 2: Mount the Hive database as an External Schema
CREATE EXTERNAL SCHEMA IF NOT EXISTS pos_hive_archive
CONNECTION conn_hive_pos
OPTIONS (SCHEMA = 'pos_archive_db');
-- Step 3: Query historical archived orders directly (two-level naming, like a local table)
SELECT order_id, store_id, order_date, total_amount
FROM pos_hive_archive.historical_orders
WHERE store_id = 33
AND order_date >= CAST('2019-01-01' AS DATE)
ORDER BY order_date DESC
LIMIT 5;
⚠️ Note: Tables under an External Schema only support SELECT queries; INSERT / UPDATE / DELETE are not supported. To backfill historical data into a local Lakehouse table, run INSERT INTO best_practice_retail_pos.doc_ods_orders SELECT ... FROM pos_hive_archive.historical_orders.
DWD Layer: Standardized Sales Detail
The DWD layer JOINs multiple ODS fact tables with dimension tables to produce a wide table with derived discounted amounts and return flags, forming a complete view of each transaction.
Create Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_retail_pos.doc_dwd_sales_detail
AS
SELECT
oi.order_item_id,
o.order_id,
o.store_id,
s.city AS store_city,
o.order_date,
o.customer_id,
c.city AS customer_city,
oi.product_id,
p.category_id,
cat.category_name,
p.price AS list_price,
oi.qty,
oi.price AS unit_price,
CAST(oi.qty * oi.price AS DOUBLE) AS gross_amount,
COALESCE(pr.discount, 0) AS discount_pct,
ROUND(oi.qty * oi.price * (1.0 - COALESCE(pr.discount, 0) / 100.0), 2) AS net_amount,
CASE WHEN r.return_id IS NOT NULL THEN 1 ELSE 0 END AS is_returned,
COALESCE(r.refund, 0) AS refund_amount
FROM best_practice_retail_pos.doc_ods_order_items oi
JOIN best_practice_retail_pos.doc_ods_orders o ON oi.order_id = o.order_id
JOIN best_practice_retail_pos.doc_ods_stores s ON o.store_id = s.store_id
JOIN best_practice_retail_pos.doc_ods_customers c ON o.customer_id = c.customer_id
JOIN best_practice_retail_pos.doc_ods_products p ON oi.product_id = p.product_id
JOIN best_practice_retail_pos.doc_ods_categories cat ON p.category_id = cat.category_id
LEFT JOIN best_practice_retail_pos.doc_ods_promotions pr ON o.promotion_id = pr.promotion_id
LEFT JOIN best_practice_retail_pos.doc_ods_returns r ON oi.order_item_id = r.order_item_id;
⚠️ Note: Do not include REFRESH INTERVAL in the CREATE DYNAMIC TABLE DDL. Refresh scheduling is managed through Studio Tasks (see below), which allows attaching data quality checks and alert rules to the same task.
Result interpretation: Row 4 (product 23, promotion discount 34%) has net_amount (5586.24) = 8464 × (1 - 0.34) — the discount calculation is correct. Row 3 has is_returned = 0, indicating this order item was not returned; the LEFT JOIN to doc_ods_returns returned NULL, which was converted to 0 by CASE WHEN.
DWS Layer: Store Daily Sales Summary
The DWS layer aggregates DWD details at store + transaction date granularity to support daily reports, weekly trends, and period-over-period analysis.
Store Daily Summary Dynamic Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_retail_pos.doc_dws_store_daily_sales
AS
SELECT
store_id,
store_city,
order_date,
COUNT(DISTINCT order_id) AS order_count,
COUNT(order_item_id) AS item_count,
SUM(qty) AS total_qty,
ROUND(SUM(gross_amount), 2) AS gross_revenue,
ROUND(SUM(net_amount), 2) AS net_revenue,
ROUND(AVG(discount_pct), 2) AS avg_discount_pct,
SUM(is_returned) AS return_count,
ROUND(SUM(refund_amount), 2) AS total_refund,
COUNT(DISTINCT product_id) AS sku_count
FROM best_practice_retail_pos.doc_dwd_sales_detail
GROUP BY store_id, store_city, order_date;
REFRESH DYNAMIC TABLE best_practice_retail_pos.doc_dws_store_daily_sales;
SELECT store_id, store_city, order_date, order_count, total_qty,
gross_revenue, net_revenue, avg_discount_pct, return_count, sku_count
FROM best_practice_retail_pos.doc_dws_store_daily_sales
ORDER BY net_revenue DESC
LIMIT 8;
Result interpretation: Delhi store 17 (2021-12-09) tops the single-day chart with ¥22,831 net revenue and a discount rate of only 9%, indicating low promotional intensity but high average transaction value around the holiday period. Hyderabad store 55 (discount rate 25.5%) had higher transaction volume (qty=9) but net revenue was compressed by the discount.
Static Partition Refresh by Store
When you need to precisely backfill a specific store's data for a specific month, use PARTITIONED BY (store_id) + SESSION_CONFIGS parameterized partitioning:
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_retail_pos.doc_dws_store_date_partition (
store_id, order_date, order_count, item_count, total_qty, net_revenue
)
PARTITIONED BY (store_id)
AS
SELECT
store_id,
order_date,
COUNT(DISTINCT order_id) AS order_count,
COUNT(order_item_id) AS item_count,
SUM(qty) AS total_qty,
ROUND(SUM(net_amount), 2) AS net_revenue
FROM best_practice_retail_pos.doc_dwd_sales_detail
WHERE store_id = CAST(SESSION_CONFIGS()['dt.args.store_id'] AS INT)
GROUP BY store_id, order_date;
⚠️ Note: A partitioned Dynamic Table must explicitly declare PARTITIONED BY — you cannot rely on dynamic partition inference. SESSION_CONFIGS()['dt.args.xxx'] returns type STRING; you must CAST it to the target type (INT / DATE, etc.) before it can match the partition column. Otherwise refresh will fail with a type incompatibility error.
💡 Tip: Studio Tasks support attaching data quality checks (e.g., net_revenue > 0) and alert notifications to the same task. If DWD refresh results in 0 rows in doc_dws_store_daily_sales on a given day, you can configure an alert on the task to send a Lark message to the on-call person.
ADS Layer: Fast/Slow-Moving SKU Analysis and Store Rankings
SKU Sales Velocity Analysis
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_retail_pos.doc_ads_sku_velocity
AS
SELECT
product_id,
category_id,
category_name,
SUM(qty) AS total_sold_qty,
COUNT(DISTINCT order_id) AS order_count,
ROUND(SUM(net_amount), 2) AS total_net_revenue,
COUNT(DISTINCT store_id) AS store_coverage,
SUM(is_returned) AS return_count,
ROUND(SUM(is_returned) * 100.0 / NULLIF(COUNT(*), 0), 2) AS return_rate_pct,
CASE
WHEN SUM(qty) >= 10 THEN 'fast_moving'
WHEN SUM(qty) >= 5 THEN 'normal'
ELSE 'slow_moving'
END AS velocity_label,
ROUND(SUM(net_amount) / NULLIF(COUNT(DISTINCT store_id), 0), 2) AS revenue_per_store
FROM best_practice_retail_pos.doc_dwd_sales_detail
GROUP BY product_id, category_id, category_name;
REFRESH DYNAMIC TABLE best_practice_retail_pos.doc_ads_sku_velocity;
-- View SKU count and revenue distribution by velocity tier
SELECT velocity_label, COUNT(*) AS sku_count,
ROUND(SUM(total_net_revenue), 2) AS label_revenue
FROM best_practice_retail_pos.doc_ads_sku_velocity
GROUP BY velocity_label
ORDER BY label_revenue DESC;
Result interpretation: Of 100 SKUs, 37 are normal movers (qty 5–9) and 61 are slow-moving (qty < 5). Slow-moving SKUs contributed ¥392,062 in net revenue, about 43% of the total. This shows that high-priced slow movers still contribute significant revenue — procurement decisions should consider both total_net_revenue and total_sold_qty, not just sales volume alone.
View top 10 best-selling SKUs:
SELECT product_id, category_name, total_sold_qty, order_count,
total_net_revenue, store_coverage, velocity_label
FROM best_practice_retail_pos.doc_ads_sku_velocity
ORDER BY total_sold_qty DESC
LIMIT 10;
Result interpretation: Product 20 (Cat_27) sold only 2 units at ~¥477 each — a typical low-price, low-volume double slow mover, prioritize for promotional clearance. Products 54, 66, and 2 each completed just 1 sale at 1 store; consider delisting or concentrating inventory in high-traffic stores for trial sales.
Store Revenue Rankings
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_retail_pos.doc_ads_store_ranking
AS
SELECT
store_id,
store_city,
SUM(order_count) AS total_orders,
SUM(item_count) AS total_items,
SUM(total_qty) AS total_qty,
ROUND(SUM(gross_revenue), 2) AS gross_revenue,
ROUND(SUM(net_revenue), 2) AS net_revenue,
ROUND(AVG(avg_discount_pct), 2) AS avg_discount_pct,
SUM(return_count) AS total_returns,
ROUND(SUM(return_count) * 100.0 / NULLIF(SUM(item_count), 0), 2) AS return_rate_pct,
RANK() OVER (ORDER BY SUM(net_revenue) DESC) AS revenue_rank
FROM best_practice_retail_pos.doc_dws_store_daily_sales
GROUP BY store_id, store_city;
REFRESH DYNAMIC TABLE best_practice_retail_pos.doc_ads_store_ranking;
SELECT store_id, store_city, total_orders, total_qty,
gross_revenue, net_revenue, avg_discount_pct,
total_returns, return_rate_pct, revenue_rank
FROM best_practice_retail_pos.doc_ads_store_ranking
ORDER BY revenue_rank
LIMIT 10;
Hyderabad store 85 (rank 1) has ¥31,331 net revenue, but avg_discount_pct = 24.33 (relatively high discount rate) and a return rate of 16.67% — there are quality concerns behind the high revenue. Investigate the high-return SKU mix.
Delhi store 17 (rank 2) has a discount rate of only 21.5% and a return rate of 0% — a truly high-quality store that can be used as an operational benchmark to promote best practices.
Delhi store 87 (rank 5) has a discount rate as high as 34% and a return rate of 25%. High-promotion-driven revenue is unsustainable; gradually adjust the promotion strategy.
Complete scheduling chain: refresh_dwd_sales_detail (01:30) → refresh_dws_store_daily (01:30, depends on the previous) → refresh_ads_* (02:00).
Time Travel: Month-End Reconciliation
During month-end financial reconciliation, you need to roll back to the order snapshot at the month-end cutoff time and compare it with the current data to detect any subsequently backloaded orders.
-- View the ODS order table version history
DESC HISTORY best_practice_retail_pos.doc_ods_orders;
Scenario: Month-end financial reconciliation — Assume the month-end cutoff was version 3 (60 orders). Now the system shows 100 orders, and you need to identify the 40 orders added after the cutoff:
-- Roll back to the store revenue snapshot at the month-end cutoff
SELECT
snap.order_date,
snap.store_id,
COUNT(*) AS orders_in_snapshot,
SUM(p.amount) AS snapshot_revenue
FROM best_practice_retail_pos.doc_ods_orders TIMESTAMP AS OF '2026-06-06 14:41:15.488' snap
JOIN best_practice_retail_pos.doc_ods_payments p ON snap.order_id = p.order_id
GROUP BY snap.order_date, snap.store_id
ORDER BY snapshot_revenue DESC
LIMIT 5;
-- Find orders backloaded after the month-end cutoff (present in current system, absent in historical snapshot)
SELECT o_cur.order_id, o_cur.store_id, o_cur.order_date
FROM best_practice_retail_pos.doc_ods_orders o_cur
LEFT JOIN (
SELECT order_id
FROM best_practice_retail_pos.doc_ods_orders
TIMESTAMP AS OF '2026-06-06 14:41:15.488'
) snap ON o_cur.order_id = snap.order_id
WHERE snap.order_id IS NULL
ORDER BY o_cur.order_id
LIMIT 10;
Result interpretation: order_id 61–100 (40 orders total) were backloaded after the month-end cutoff. Submit these order IDs to finance for review to determine whether to include them in the current month's accounting or carry them forward to the next month — this is the core value of Time Travel in month-end reconciliation scenarios.
💡 Tip: TIMESTAMP AS OF accepts literal constants (e.g., '2026-06-06 14:41:15.488'), but does not support expressions like NOW() - INTERVAL 30 DAY. To use relative time, first compute the specific timestamp with SELECT CURRENT_TIMESTAMP() - INTERVAL 30 DAYS, then substitute it into the query.
Data Warehouse Object Summary
All objects in the best_practice_retail_pos schema:
BloomFilter Index does not automatically apply to existing data: CREATE BLOOMFILTER INDEX only accelerates new data written after creation. Existing data must be written again after the index is created, or you accept that existing data will not benefit from BloomFilter acceleration.
Partitioned Dynamic Tables must explicitly declare PARTITIONED BY: You cannot rely on the system to infer partition columns automatically. SESSION_CONFIGS()['dt.args.xxx'] returns STRING and must be CAST to the target type before it can match the partition column; otherwise refresh will fail with a type incompatibility error.
Do not write REFRESH INTERVAL for REFRESH DYNAMIC TABLE: All periodic refreshes for Dynamic Tables are managed through Studio Tasks. Studio Tasks support attaching data quality rules and alerts to the same task. Writing REFRESH INTERVAL in the DDL bypasses this management mechanism.
Time Travel TIMESTAMP AS OF only accepts constants: Runtime expressions like NOW() - INTERVAL N DAY are not supported. Pre-calculate the target timestamp before calling. DESC HISTORY returns UTC time; add 8 hours when converting to local time (UTC+8).
External Schema is read-only: External tables under pos_hive_archive do not support INSERT / UPDATE / DELETE. To import historical data into local Lakehouse tables, use INSERT INTO ... SELECT ... FROM pos_hive_archive.xxx for explicit migration.
ODS layer doc_ods_orders uses PARTITIONED BY (store_id, order_date): If the upstream CDC uses INSERT OVERWRITE (rather than INSERT INTO), Dynamic Tables will fall back to full refresh. Use append-write mode (INSERT only) to preserve Dynamic Table incremental refresh capability.