Manufacturing Quality Control Data Warehouse Best Practices (SPC Statistical Process Control)
Build a three-layer quality control data warehouse from MES system real-time inspection data and manual sampling records to support SPC control charts, Cpk process capability analysis, and defect Pareto analysis. Using a dataset of 1,000 manufacturing defect records (covering 100 product types, 3 production lines, and 3 defect categories), this guide demonstrates the complete end-to-end build process: Kafka PIPE → Bronze → Silver → Gold, covering the practical application of four platform capabilities: BloomFilter Index, SQL UDFs, Dynamic Tables, and sliding window statistics.
Overview
The typical data pipeline for manufacturing quality control is: online inspection reporting → real-time ingestion → raw storage (Bronze) → cleansing and labeling (Silver) → SPC metric aggregation (Gold).
Singdata Lakehouse addresses several core challenges through the following combination:
Problem
Singdata Solution
MES inspection data written in real-time at high frequency
Kafka PIPE continuous ingestion, no need to write consumers manually
ingest_time uses DEFAULT CURRENT_TIMESTAMP(), automatically populated when Kafka PIPE writes records — no need to carry this field in the message body.
Create BloomFilter Index
Both Silver and Gold layers will frequently filter by product_id, a high-cardinality column (100 product types). BloomFilter Index is appropriate here.
CREATE BLOOMFILTER INDEX idx_bf_product_id
ON TABLE doc_defect_events (product_id);
⚠️ Note: CREATE BLOOMFILTER INDEX requires the same schema context as the target table. Run USE SCHEMA best_practice_manufacturing_spc first, or specify -s best_practice_manufacturing_spc in cz-cli, otherwise you will get an "index and table must in the same schema" error.
spec_ucl / spec_lcl are product design specifications used for Cpk calculation. The UCL/LCL in production control charts are calculated from actual data (see Silver layer).
Configure Kafka PIPE (Real-time Ingestion)
Option 1: Write via actual Kafka (recommended)
In production, the MES system pushes inspection results to a Kafka topic and PIPE automatically consumes and writes to the Bronze layer. Python producer example:
Create Kafka PIPE (the DDL will attempt to connect to the broker for validation):
CREATE TABLE IF NOT EXISTS best_practice_manufacturing_spc.kafka_raw_defects (value STRING);
CREATE PIPE IF NOT EXISTS best_practice_manufacturing_spc.pipe_defect_events
VIRTUAL_CLUSTER = 'DEFAULT'
BATCH_INTERVAL_IN_SECONDS = '60'
AS
COPY INTO best_practice_manufacturing_spc.kafka_raw_defects
FROM (
SELECT CAST(value AS STRING) AS value
FROM READ_KAFKA(
'<kafka-broker>:9092',
'mes_defect_events',
'',
'cz_mes_consumer',
'','','','',
'raw', 'raw',
0,
map()
)
);
💡 Tip: In the PIPE DDL, positional parameters 5–8 in READ_KAFKA (start/end offsets) must be left empty and are managed automatically by the PIPE at runtime.
Option 2: INSERT simulation (without a Kafka environment)
💡 Tip: The examples below use cz-cli (the Singdata Lakehouse command-line tool). If cz-cli is not installed, refer to the cz-cli setup guide. Alternatively, you can execute SQL in Lakehouse Studio → Development → SQL Editor and configure scheduled tasks on the Studio → Tasks page.
If Kafka is not yet configured, save the data as a local CSV file and use cz-cli to upload it to User Volume for import via COPY INTO (recommended):
Load from local CSV (recommended)
-- Step 1: Upload the local CSV file to User Volume via SQL PUT
PUT '/path/to/defect_events_data.csv' TO USER VOLUME FILE 'defect_events_data.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_manufacturing_spc.doc_defect_events
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('defect_events_data.csv');
You can also insert small batches of test data inline (no CSV file needed):
This guide uses the Kaggle dataset (fahmidachowdhury/manufacturing-defects, 1,000 records) loaded via bulk INSERT to verify the complete computation pipeline:
INSERT INTO best_practice_manufacturing_spc.doc_defect_events
(defect_id, product_id, defect_type, defect_date,
defect_location, severity, inspection_method, repair_cost)
VALUES
(1, 15, 'Structural', CAST('2024-06-06' AS DATE), 'Component', 'Minor', 'Visual Inspection', 245.47),
(2, 6, 'Functional', CAST('2024-04-26' AS DATE), 'Component', 'Minor', 'Visual Inspection', 26.87),
(3, 84, 'Structural', CAST('2024-02-15' AS DATE), 'Internal', 'Minor', 'Automated Testing', 835.81),
(4, 10, 'Functional', CAST('2024-03-28' AS DATE), 'Internal', 'Critical', 'Automated Testing', 444.47)
-- ... 1,000 records total
;
Verify Bronze layer row count:
SELECT COUNT(*) AS total_rows FROM best_practice_manufacturing_spc.doc_defect_events;
total_rows
----------
1000
View data distribution overview:
SELECT defect_type, COUNT(*) AS defect_count
FROM best_practice_manufacturing_spc.doc_defect_events
GROUP BY defect_type
ORDER BY defect_count DESC;
Upload manual sampling spreadsheets to Volume, then bulk import via COPY INTO. Supports automatically skipping already-imported files (idempotent):
COPY INTO best_practice_manufacturing_spc.doc_defect_events
(defect_id, product_id, defect_type, defect_date,
defect_location, severity, inspection_method, repair_cost)
FROM (
SELECT
$1::INT AS defect_id,
$2::INT AS product_id,
$3 AS defect_type,
$4::DATE AS defect_date,
$5 AS defect_location,
$6 AS severity,
$7 AS inspection_method,
$8::DOUBLE AS repair_cost
FROM @best_practice_manufacturing_spc.sampling_volume/defects_data.csv
)
USING csv
OPTIONS('header'='true', 'sep'=',');
💡 Tip: COPY INTO defaults to file-level idempotent deduplication — running the same file multiple times will not re-import it. To allow re-importing the same file, add the FORCE = TRUE option.
SQL UDFs: Cpk and Severity Scoring
Cpk Process Capability Index
Cpk (Process Capability Index) measures how well a production process is centered relative to specification requirements and its variation level. Cpk ≥ 1.33 indicates good process capability; < 1.0 means the process does not meet specification requirements.
CREATE OR REPLACE FUNCTION best_practice_manufacturing_spc.calc_cpk(
avg_val DOUBLE,
std_val DOUBLE,
ucl DOUBLE,
lcl DOUBLE
)
RETURNS DOUBLE
AS CASE
WHEN std_val <= 0 THEN NULL
ELSE LEAST((ucl - avg_val) / (3.0 * std_val),
(avg_val - lcl) / (3.0 * std_val))
END;
Validate the function (target value centered, process standard deviation 1.2, specification range ±5):
SELECT best_practice_manufacturing_spc.calc_cpk(100.5, 1.2, 105.0, 95.0) AS cpk_sample;
cpk_sample
----------
1.25
💡 Tip: Cpk = 1.25 corresponds to a sigma level of approximately 3.75σ and a PPM of about 197 — a state of "acceptable but with room for improvement." Production line targets typically require Cpk ≥ 1.33 (PPM ≤ 64).
Severity Scoring UDF
Maps text-based severity to numeric scores for weighted risk aggregation in the Silver layer:
CREATE OR REPLACE FUNCTION best_practice_manufacturing_spc.severity_score(
severity STRING
)
RETURNS INT
AS CASE severity
WHEN 'Critical' THEN 3
WHEN 'Moderate' THEN 2
WHEN 'Minor' THEN 1
ELSE 0
END;
Silver Layer Dynamic Table: Cleansing and Dimension Enrichment
The Silver layer performs two tasks on top of Bronze raw defect events:
LEFT JOIN doc_product_master to enrich each event with production line, product category, and specification limits
Compute severity_score and is_critical flags to allow direct aggregation in the Gold layer
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_manufacturing_spc.silver_defect_enriched
REFRESH INTERVAL 10 MINUTE VCLUSTER DEFAULT
AS
SELECT
e.defect_id,
e.product_id,
e.defect_type,
e.defect_date,
e.defect_location,
e.severity,
e.inspection_method,
e.repair_cost,
e.ingest_time,
p.product_name,
p.production_line,
p.product_category,
p.spec_ucl,
p.spec_lcl,
p.spec_target,
best_practice_manufacturing_spc.severity_score(e.severity) AS severity_score,
CASE
WHEN e.severity = 'Critical' THEN 1
ELSE 0
END AS is_critical,
DATE_TRUNC('month', e.defect_date) AS defect_month
FROM best_practice_manufacturing_spc.doc_defect_events e
LEFT JOIN best_practice_manufacturing_spc.doc_product_master p ON e.product_id = p.product_id;
⚠️ Note: Do not use REFRESH INTERVAL in Dynamic Table DDL for production scheduling. Instead, create a "refresh Dynamic Table" task in Lakehouse Studio with a Cron expression. This way, monitoring alerts and data quality rules can be added to the same task for unified observability (see the "Studio Refresh Task Configuration" section). The REFRESH INTERVAL 10 MINUTE in the DDL sets the refresh capability of the DT; actual triggering is controlled by the Studio Task.
Manually trigger the first refresh:
REFRESH DYNAMIC TABLE best_practice_manufacturing_spc.silver_defect_enriched;
SELECT COUNT(*) AS silver_count
FROM best_practice_manufacturing_spc.silver_defect_enriched;
silver_count
------------
1000
View distribution by production line and defect type (direct query on Silver layer):
SELECT
production_line,
defect_type,
COUNT(*) AS defect_count,
SUM(is_critical) AS critical_count,
ROUND(AVG(repair_cost), 2) AS avg_repair_cost
FROM best_practice_manufacturing_spc.silver_defect_enriched
GROUP BY production_line, defect_type
ORDER BY production_line, defect_count DESC;
Result interpretation: Line-3 Structural defects have the highest average repair cost (540.60), and the Critical ratio (42/122 = 34.4%) is also higher than the same defect type on Line-1 and Line-2 — this is the priority target for rework cost management.
SPC Control Charts: Sliding Window UCL/LCL Calculation
SPC control charts calculate process control upper and lower limits (UCL/LCL) using process mean (μ) and standard deviation (σ) to identify out-of-control points. This section uses window functions to implement a c control chart (count-type), suitable for defect count data.
Control limit formulas:
UCL = μ + 3σ
LCL = max(0, μ − 3σ) (count data lower limit cannot be negative)
WITH monthly_stats AS (
-- First aggregate by product + month to avoid nested aggregate function errors
SELECT
product_id,
defect_month,
COUNT(*) AS monthly_defects
FROM best_practice_manufacturing_spc.silver_defect_enriched
GROUP BY product_id, defect_month
)
SELECT
product_id,
defect_month,
monthly_defects,
-- Rolling 3-month average (moving average)
ROUND(AVG(monthly_defects) OVER (
PARTITION BY product_id
ORDER BY defect_month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 2) AS rolling_3m_avg,
-- Full historical process mean (center line of control chart)
ROUND(AVG(monthly_defects) OVER (PARTITION BY product_id), 2) AS process_mean,
ROUND(STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id), 2) AS process_std,
-- UCL / LCL
ROUND(AVG(monthly_defects) OVER (PARTITION BY product_id)
+ 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id), 2) AS ucl,
ROUND(GREATEST(0, AVG(monthly_defects) OVER (PARTITION BY product_id)
- 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id)), 2) AS lcl,
-- Process status determination
CASE
WHEN monthly_defects > AVG(monthly_defects) OVER (PARTITION BY product_id)
+ 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id)
THEN 'OUT_OF_CONTROL'
WHEN monthly_defects < GREATEST(0, AVG(monthly_defects) OVER (PARTITION BY product_id)
- 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id))
THEN 'OUT_OF_CONTROL'
ELSE 'IN_CONTROL'
END AS spc_status
FROM monthly_stats
WHERE product_id IN (10, 14, 15)
ORDER BY product_id, defect_month;
Result interpretation: The sample products (10, 14, 15) are all IN_CONTROL during the observation period, with monthly defect counts not exceeding the UCL. Product 15 has the smallest process variation (σ = 0.71), indicating the most stable inspection process. Product 14's June defect count of 4 is approaching the UCL (5.49) — it is advisable to monitor subsequent trends.
⚠️ Note: Window functions do not allow nested aggregation (SUM(SUM(col)) OVER (...) will produce an "aggregate function cannot contain another aggregate function" error in grouped aggregation scenarios). The correct approach is to first complete group aggregation in a CTE, then apply window functions on the outer layer.
Cpk Analysis: Process Capability by Production Line
Apply the calc_cpk UDF to repair cost data aggregated by production line (using repair cost as a substitute for physical measurement values in this demonstration):
WITH line_stats AS (
SELECT
production_line,
COUNT(*) AS total_defects,
ROUND(AVG(repair_cost), 2) AS avg_repair_cost,
ROUND(STDDEV_SAMP(repair_cost), 2) AS std_repair_cost
FROM best_practice_manufacturing_spc.silver_defect_enriched
GROUP BY production_line
)
SELECT
production_line,
total_defects,
avg_repair_cost,
std_repair_cost,
ROUND(best_practice_manufacturing_spc.calc_cpk(
avg_repair_cost,
std_repair_cost,
1000.0, -- Upper specification limit (maximum acceptable repair cost)
0.0 -- Lower specification limit
), 3) AS repair_cost_cpk
FROM line_stats
ORDER BY production_line;
Result interpretation: All three production lines have repair_cost_cpk well below 1.0, indicating that the repair cost distribution has excessive variation relative to the specification range (0–1000). The main reason is the wide cost distribution range (10–1000), with standard deviation around 290. Line-2 has the highest Cpk (0.587), indicating slightly better cost concentration than the other two lines.
Gold Layer Dynamic Table: Monthly Production Line Aggregation
The Gold layer aggregates Silver layer data at production_line + defect_month granularity, outputting monthly defect trends and Critical ratios for quality management dashboards.
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_manufacturing_spc.gold_production_line_monthly
PARTITIONED BY (production_line)
REFRESH INTERVAL 10 MINUTE VCLUSTER DEFAULT
TBLPROPERTIES ('static_partitions' = 'true')
AS
SELECT
production_line,
defect_month,
COUNT(*) AS total_defects,
SUM(is_critical) AS critical_defects,
ROUND(SUM(is_critical)*100.0/COUNT(*), 2) AS critical_rate_pct,
ROUND(SUM(repair_cost), 2) AS total_repair_cost,
ROUND(AVG(repair_cost), 2) AS avg_repair_cost,
COUNT(DISTINCT product_id) AS affected_products
FROM best_practice_manufacturing_spc.silver_defect_enriched
GROUP BY production_line, defect_month;
⚠️ Note: A Dynamic Table with partitions must explicitly declare TBLPROPERTIES ('static_partitions' = 'true') to use static partitioning mode. Without this declaration, the system defaults to dynamic partition inference, which may cause abnormal partition data overwrites during incremental refresh.
View Line-3 monthly trends (the production line with the largest Critical ratio fluctuation):
SELECT production_line, defect_month, total_defects, critical_rate_pct, total_repair_cost
FROM best_practice_manufacturing_spc.gold_production_line_monthly
WHERE production_line = 'Line-3'
ORDER BY defect_month;
Result interpretation: Line-3 had the highest defect count in January (86), but the Critical ratio (38.37%) was not the highest for the period — April reached 48.89% with a relatively low total defect count (45). This "low volume but high severity" pattern may indicate a raw material batch issue, requiring further investigation by product_id dimension.
Gold Layer Dynamic Table: Defect Pareto Analysis
Pareto analysis applies the "80/20 rule" to identify the small number of root cause categories responsible for the majority of defect costs.
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_manufacturing_spc.gold_defect_pareto
REFRESH INTERVAL 10 MINUTE VCLUSTER DEFAULT
AS
SELECT
defect_type,
severity,
defect_location,
COUNT(*) AS defect_count,
ROUND(SUM(repair_cost), 2) AS total_repair_cost,
ROUND(COUNT(*)*100.0 / SUM(COUNT(*)) OVER (), 2) AS defect_pct,
SUM(COUNT(*)) OVER (
ORDER BY COUNT(*) DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_count,
ROUND(SUM(COUNT(*)) OVER (
ORDER BY COUNT(*) DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS cumulative_pct
FROM best_practice_manufacturing_spc.silver_defect_enriched
GROUP BY defect_type, severity, defect_location
ORDER BY defect_count DESC;
Result interpretation: The Top 10 categories account for 42.60% of total defects. Surface Structural defects have the highest count, but if sorted by repair cost, Functional × Critical × Component (39 records, total cost 23,047) has the highest per-unit cost (591 per unit). Recommended Pareto remediation priority: first address Surface Structural Minor defects (highest count), then address Component Functional Critical defects (highest per-unit cost).
View total comparison of three main defect types:
SELECT defect_type,
SUM(defect_count) AS total,
ROUND(SUM(total_repair_cost), 2) AS total_cost
FROM best_practice_manufacturing_spc.gold_defect_pareto
GROUP BY defect_type
ORDER BY total DESC;
PPM (Parts Per Million) measures defects per million units and is the standardized quality metric in SPC and Six Sigma systems:
SELECT
defect_type,
COUNT(*) AS defect_count,
ROUND(COUNT(*) * 1000000.0 / 1000, 0) AS ppm_rate
FROM best_practice_manufacturing_spc.doc_defect_events
GROUP BY defect_type
ORDER BY ppm_rate DESC;
Result interpretation: PPM for all three defect types is above 300,000, corresponding to a sigma level of approximately 2σ (Six Sigma target is 3.4 PPM, i.e., 6σ). This is a characteristic of the dataset — 1,000 out of 1,000 items have defects. In production use, the denominator should be the total number of inspected units (including conforming units), typically sourced from MES production completion records, requiring an additional table join.
Studio Refresh Task Configuration
Periodic refresh of Dynamic Tables is managed via Lakehouse Studio Tasks, to which monitoring alerts and data quality rules can be added on the same task.
Use cz-cli task commands to create refresh tasks (equivalent to operating in the Studio UI):
After tasks are created, you can see the two refresh tasks under best_practices/manufacturing_spc/ in the Studio UI. Click a task → Alert Configuration to bind "refresh failure notification" or "data row count is 0 alert" rules. Once ready, publish the tasks:
MES System (Kafka) Manual Sampling (CSV)
| |
v Kafka PIPE (60s batch) v COPY INTO (Volume)
kafka_raw_defects doc_defect_events (Bronze)
| BloomFilter Index (product_id)
doc_product_master ----+ LEFT JOIN
(production_line | UCL / LCL / spec_target)
|
v Studio Task: refresh every 10 min
silver_defect_enriched (Dynamic Table)
severity_score UDF · is_critical · defect_month
| |
+-----------------+ +------------------+
v Studio Task: refresh every 10 min v
gold_production_line_monthly (DT) gold_defect_pareto (DT)
PARTITIONED BY production_line cumulative_pct (Pareto 80%)
static_partitions = true Window Function ORDER BY
| |
v v
Quality Dashboard Root Cause Analysis
Cpk · UCL/LCL 80/20 Defect Focus
Notes
Window functions do not support nested aggregation: Writing STDDEV_SAMP(COUNT(*)) OVER (...) (nesting an aggregate function inside a window function) will produce an "aggregate function cannot contain another aggregate function" error. The correct approach is to first complete the GROUP BY aggregation in a CTE, then apply window functions to the result columns in the outer query.
Partitioned Dynamic Tables must declare static_partitions: A DT with PARTITIONED BY must set TBLPROPERTIES ('static_partitions' = 'true'). Without this declaration, the system uses dynamic partition inference, which may cause old partition data to be overwritten or lost during incremental refresh.
BloomFilter Index does not auto-apply to existing data: CREATE BLOOMFILTER INDEX only takes effect for data written after creation and does not support BUILD INDEX to cover existing data (BLOOMFILTER type does not have this capability; rebuilding the table and re-inserting is required to cover existing data).
Dynamic Table refresh scheduling managed via Lakehouse Studio Tasks: Do not rely on REFRESH INTERVAL in the DDL for production scheduling. Configure Cron in Studio Tasks instead, which allows binding alert rules and data quality checks to the same task for unified observability.
Denominator selection in PPM calculation: The PPM demonstration in this guide uses 1,000 inspection records as the denominator, for illustration only. In production, the PPM denominator should be the total number of inspected units (including conforming units), typically sourced from MES production completion data and requiring an additional table.
Handling std_val = 0 in Cpk: The calc_cpk UDF returns NULL when std_val <= 0 to avoid division by zero. This occurs with a sample size of 1. Before Gold layer aggregation, filter out groups with insufficient sample size (HAVING COUNT(*) > 1).