Manufacturing Quality Control Data Warehouse Best Practices (SPC Statistical Process Control)

Build a three-layer quality control data warehouse from MES system real-time inspection data and manual sampling records to support SPC control charts, Cpk process capability analysis, and defect Pareto analysis. Using a dataset of 1,000 manufacturing defect records (covering 100 product types, 3 production lines, and 3 defect categories), this guide demonstrates the complete end-to-end build process: Kafka PIPE → Bronze → Silver → Gold, covering the practical application of four platform capabilities: BloomFilter Index, SQL UDFs, Dynamic Tables, and sliding window statistics.


Overview

The typical data pipeline for manufacturing quality control is: online inspection reporting → real-time ingestion → raw storage (Bronze) → cleansing and labeling (Silver) → SPC metric aggregation (Gold).

Singdata Lakehouse addresses several core challenges through the following combination:

ProblemSingdata Solution
MES inspection data written in real-time at high frequencyKafka PIPE continuous ingestion, no need to write consumers manually
Manual sampling CSV bulk importVolume + COPY INTO, supports incremental loading
Bronze → Silver → Gold automatic incremental computationDynamic Table with declarative SQL; system automatically schedules dependency chain
High-cardinality product_id column with frequent point lookupsBloomFilter Index for fast on-demand filtering
Cpk process capability and severity scoring logic reusableSQL UDF encapsulates formulas, callable from both Silver and Gold layers
UCL/LCL sliding window control limit calculationWindow functions + CTE, computes process mean and 3σ control limits per product
Efficient querying of large-scale historical inspection data by production lineGold layer Dynamic Table with static partitioning (PARTITIONED BY production_line)

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate Bronze layer raw defect event table and product master data tableRegular tables, upstream to Dynamic Tables
CREATE BLOOMFILTER INDEXCreate BloomFilter index on product_id columnSuitable for equality filtering on high-cardinality columns
CREATE PIPECreate Kafka continuous ingestion pipelineBound to Bronze layer target table
COPY INTOBulk import manual sampling CSV filesLoaded from Volume, supports incremental loading
CREATE FUNCTIONCreate SQL UDFs calc_cpk and severity_scoreEncapsulate Cpk calculation and severity scoring
CREATE DYNAMIC TABLECreate Silver / Gold layer incremental computation tablesSystem automatically identifies upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLEManually trigger one refreshUsed during initial build or debugging
AVG / STDDEV_SAMP ... OVERSliding window mean and standard deviationCalculate UCL/LCL process control limits

Prerequisites

All examples in this guide run under the best_practice_manufacturing_spc schema.

CREATE SCHEMA IF NOT EXISTS best_practice_manufacturing_spc;


Bronze Layer: Raw Defect Event Table

Create Table

CREATE TABLE IF NOT EXISTS best_practice_manufacturing_spc.doc_defect_events ( defect_id INT, product_id INT, defect_type STRING, defect_date DATE, defect_location STRING, severity STRING, inspection_method STRING, repair_cost DOUBLE, ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

ingest_time uses DEFAULT CURRENT_TIMESTAMP(), automatically populated when Kafka PIPE writes records — no need to carry this field in the message body.

Create BloomFilter Index

Both Silver and Gold layers will frequently filter by product_id, a high-cardinality column (100 product types). BloomFilter Index is appropriate here.

CREATE BLOOMFILTER INDEX idx_bf_product_id ON TABLE doc_defect_events (product_id);

Product Master Data Table

CREATE TABLE IF NOT EXISTS best_practice_manufacturing_spc.doc_product_master ( product_id INT, product_name STRING, production_line STRING, product_category STRING, spec_ucl DOUBLE, -- Upper specification limit (USL) spec_lcl DOUBLE, -- Lower specification limit (LSL) spec_target DOUBLE -- Target value );

spec_ucl / spec_lcl are product design specifications used for Cpk calculation. The UCL/LCL in production control charts are calculated from actual data (see Silver layer).

Configure Kafka PIPE (Real-time Ingestion)

Option 1: Write via actual Kafka (recommended)

In production, the MES system pushes inspection results to a Kafka topic and PIPE automatically consumes and writes to the Bronze layer. Python producer example:

from kafka import KafkaProducer import json, datetime, random producer = KafkaProducer( bootstrap_servers=['<kafka-broker>:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) def send_defect_event(defect_id, product_id): message = { "defect_id": defect_id, "product_id": product_id, "defect_type": random.choice(["Structural", "Functional", "Cosmetic"]), "defect_date": datetime.date.today().isoformat(), "defect_location": random.choice(["Component", "Internal", "Surface"]), "severity": random.choice(["Minor", "Moderate", "Critical"]), "inspection_method": random.choice(["Automated Testing", "Visual Inspection", "Manual Testing"]), "repair_cost": round(random.uniform(10.0, 1000.0), 2) } producer.send('mes_defect_events', value=message) producer.flush() # Simulate sending 10 records for i in range(1001, 1011): send_defect_event(i, random.randint(1, 100)) producer.close()

Create Kafka PIPE (the DDL will attempt to connect to the broker for validation):

CREATE TABLE IF NOT EXISTS best_practice_manufacturing_spc.kafka_raw_defects (value STRING); CREATE PIPE IF NOT EXISTS best_practice_manufacturing_spc.pipe_defect_events VIRTUAL_CLUSTER = 'DEFAULT' BATCH_INTERVAL_IN_SECONDS = '60' AS COPY INTO best_practice_manufacturing_spc.kafka_raw_defects FROM ( SELECT CAST(value AS STRING) AS value FROM READ_KAFKA( '<kafka-broker>:9092', 'mes_defect_events', '', 'cz_mes_consumer', '','','','', 'raw', 'raw', 0, map() ) );

Option 2: INSERT simulation (without a Kafka environment)

If Kafka is not yet configured, save the data as a local CSV file and use cz-cli to upload it to User Volume for import via COPY INTO (recommended):

Load from local CSV (recommended)

-- Step 1: Upload the local CSV file to User Volume via SQL PUT PUT '/path/to/defect_events_data.csv' TO USER VOLUME FILE 'defect_events_data.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_manufacturing_spc.doc_defect_events FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('defect_events_data.csv');

You can also insert small batches of test data inline (no CSV file needed):

This guide uses the Kaggle dataset (fahmidachowdhury/manufacturing-defects, 1,000 records) loaded via bulk INSERT to verify the complete computation pipeline:

INSERT INTO best_practice_manufacturing_spc.doc_defect_events (defect_id, product_id, defect_type, defect_date, defect_location, severity, inspection_method, repair_cost) VALUES (1, 15, 'Structural', CAST('2024-06-06' AS DATE), 'Component', 'Minor', 'Visual Inspection', 245.47), (2, 6, 'Functional', CAST('2024-04-26' AS DATE), 'Component', 'Minor', 'Visual Inspection', 26.87), (3, 84, 'Structural', CAST('2024-02-15' AS DATE), 'Internal', 'Minor', 'Automated Testing', 835.81), (4, 10, 'Functional', CAST('2024-03-28' AS DATE), 'Internal', 'Critical', 'Automated Testing', 444.47) -- ... 1,000 records total ;

Verify Bronze layer row count:

SELECT COUNT(*) AS total_rows FROM best_practice_manufacturing_spc.doc_defect_events;

total_rows ---------- 1000

View data distribution overview:

SELECT defect_type, COUNT(*) AS defect_count FROM best_practice_manufacturing_spc.doc_defect_events GROUP BY defect_type ORDER BY defect_count DESC;

defect_type | defect_count ------------+------------- Structural | 352 Functional | 339 Cosmetic | 309

Manual Sampling CSV Import (Volume + COPY INTO)

Upload manual sampling spreadsheets to Volume, then bulk import via COPY INTO. Supports automatically skipping already-imported files (idempotent):

COPY INTO best_practice_manufacturing_spc.doc_defect_events (defect_id, product_id, defect_type, defect_date, defect_location, severity, inspection_method, repair_cost) FROM ( SELECT $1::INT AS defect_id, $2::INT AS product_id, $3 AS defect_type, $4::DATE AS defect_date, $5 AS defect_location, $6 AS severity, $7 AS inspection_method, $8::DOUBLE AS repair_cost FROM @best_practice_manufacturing_spc.sampling_volume/defects_data.csv ) USING csv OPTIONS('header'='true', 'sep'=',');


SQL UDFs: Cpk and Severity Scoring

Cpk Process Capability Index

Cpk (Process Capability Index) measures how well a production process is centered relative to specification requirements and its variation level. Cpk ≥ 1.33 indicates good process capability; < 1.0 means the process does not meet specification requirements.

Formula: Cpk = min((USL - μ) / (3σ), (μ - LSL) / (3σ))

CREATE OR REPLACE FUNCTION best_practice_manufacturing_spc.calc_cpk( avg_val DOUBLE, std_val DOUBLE, ucl DOUBLE, lcl DOUBLE ) RETURNS DOUBLE AS CASE WHEN std_val <= 0 THEN NULL ELSE LEAST((ucl - avg_val) / (3.0 * std_val), (avg_val - lcl) / (3.0 * std_val)) END;

Validate the function (target value centered, process standard deviation 1.2, specification range ±5):

SELECT best_practice_manufacturing_spc.calc_cpk(100.5, 1.2, 105.0, 95.0) AS cpk_sample;

cpk_sample ---------- 1.25

Severity Scoring UDF

Maps text-based severity to numeric scores for weighted risk aggregation in the Silver layer:

CREATE OR REPLACE FUNCTION best_practice_manufacturing_spc.severity_score( severity STRING ) RETURNS INT AS CASE severity WHEN 'Critical' THEN 3 WHEN 'Moderate' THEN 2 WHEN 'Minor' THEN 1 ELSE 0 END;


Silver Layer Dynamic Table: Cleansing and Dimension Enrichment

The Silver layer performs two tasks on top of Bronze raw defect events:

  1. LEFT JOIN doc_product_master to enrich each event with production line, product category, and specification limits
  2. Compute severity_score and is_critical flags to allow direct aggregation in the Gold layer

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_manufacturing_spc.silver_defect_enriched REFRESH INTERVAL 10 MINUTE VCLUSTER DEFAULT AS SELECT e.defect_id, e.product_id, e.defect_type, e.defect_date, e.defect_location, e.severity, e.inspection_method, e.repair_cost, e.ingest_time, p.product_name, p.production_line, p.product_category, p.spec_ucl, p.spec_lcl, p.spec_target, best_practice_manufacturing_spc.severity_score(e.severity) AS severity_score, CASE WHEN e.severity = 'Critical' THEN 1 ELSE 0 END AS is_critical, DATE_TRUNC('month', e.defect_date) AS defect_month FROM best_practice_manufacturing_spc.doc_defect_events e LEFT JOIN best_practice_manufacturing_spc.doc_product_master p ON e.product_id = p.product_id;

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_manufacturing_spc.silver_defect_enriched; SELECT COUNT(*) AS silver_count FROM best_practice_manufacturing_spc.silver_defect_enriched;

silver_count ------------ 1000

View distribution by production line and defect type (direct query on Silver layer):

SELECT production_line, defect_type, COUNT(*) AS defect_count, SUM(is_critical) AS critical_count, ROUND(AVG(repair_cost), 2) AS avg_repair_cost FROM best_practice_manufacturing_spc.silver_defect_enriched GROUP BY production_line, defect_type ORDER BY production_line, defect_count DESC;

production_line | defect_type | defect_count | critical_count | avg_repair_cost ----------------+-------------+--------------+----------------+---------------- Line-1 | Structural | 114 | 40 | 494.87 Line-1 | Functional | 111 | 38 | 517.15 Line-1 | Cosmetic | 94 | 35 | 497.84 Line-2 | Structural | 116 | 30 | 470.31 Line-2 | Cosmetic | 113 | 35 | 499.46 Line-2 | Functional | 104 | 35 | 521.81 Line-3 | Functional | 124 | 41 | 485.75 Line-3 | Structural | 122 | 42 | 540.60 Line-3 | Cosmetic | 102 | 37 | 544.72

Result interpretation: Line-3 Structural defects have the highest average repair cost (540.60), and the Critical ratio (42/122 = 34.4%) is also higher than the same defect type on Line-1 and Line-2 — this is the priority target for rework cost management.


SPC Control Charts: Sliding Window UCL/LCL Calculation

SPC control charts calculate process control upper and lower limits (UCL/LCL) using process mean (μ) and standard deviation (σ) to identify out-of-control points. This section uses window functions to implement a c control chart (count-type), suitable for defect count data.

Control limit formulas:

  • UCL = μ + 3σ
  • LCL = max(0, μ − 3σ) (count data lower limit cannot be negative)

WITH monthly_stats AS ( -- First aggregate by product + month to avoid nested aggregate function errors SELECT product_id, defect_month, COUNT(*) AS monthly_defects FROM best_practice_manufacturing_spc.silver_defect_enriched GROUP BY product_id, defect_month ) SELECT product_id, defect_month, monthly_defects, -- Rolling 3-month average (moving average) ROUND(AVG(monthly_defects) OVER ( PARTITION BY product_id ORDER BY defect_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ), 2) AS rolling_3m_avg, -- Full historical process mean (center line of control chart) ROUND(AVG(monthly_defects) OVER (PARTITION BY product_id), 2) AS process_mean, ROUND(STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id), 2) AS process_std, -- UCL / LCL ROUND(AVG(monthly_defects) OVER (PARTITION BY product_id) + 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id), 2) AS ucl, ROUND(GREATEST(0, AVG(monthly_defects) OVER (PARTITION BY product_id) - 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id)), 2) AS lcl, -- Process status determination CASE WHEN monthly_defects > AVG(monthly_defects) OVER (PARTITION BY product_id) + 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id) THEN 'OUT_OF_CONTROL' WHEN monthly_defects < GREATEST(0, AVG(monthly_defects) OVER (PARTITION BY product_id) - 3 * STDDEV_SAMP(monthly_defects) OVER (PARTITION BY product_id)) THEN 'OUT_OF_CONTROL' ELSE 'IN_CONTROL' END AS spc_status FROM monthly_stats WHERE product_id IN (10, 14, 15) ORDER BY product_id, defect_month;

product_id | defect_month | monthly_defects | rolling_3m_avg | process_mean | process_std | ucl | lcl | spc_status -----------+---------------------+-----------------+----------------+--------------+-------------+------+-----+----------- 10 | 2024-01-01T00:00:00 | 2 | 2.0 | 2.67 | 1.21 | 6.30 | 0 | IN_CONTROL 10 | 2024-02-01T00:00:00 | 2 | 2.0 | 2.67 | 1.21 | 6.30 | 0 | IN_CONTROL 10 | 2024-03-01T00:00:00 | 3 | 2.33 | 2.67 | 1.21 | 6.30 | 0 | IN_CONTROL 10 | 2024-04-01T00:00:00 | 1 | 2.0 | 2.67 | 1.21 | 6.30 | 0 | IN_CONTROL 10 | 2024-05-01T00:00:00 | 4 | 2.67 | 2.67 | 1.21 | 6.30 | 0 | IN_CONTROL 10 | 2024-06-01T00:00:00 | 4 | 3.0 | 2.67 | 1.21 | 6.30 | 0 | IN_CONTROL 14 | 2024-01-01T00:00:00 | 1 | 1.0 | 2.20 | 1.10 | 5.49 | 0 | IN_CONTROL 14 | 2024-02-01T00:00:00 | 2 | 1.50 | 2.20 | 1.10 | 5.49 | 0 | IN_CONTROL 14 | 2024-04-01T00:00:00 | 2 | 1.67 | 2.20 | 1.10 | 5.49 | 0 | IN_CONTROL 14 | 2024-05-01T00:00:00 | 2 | 2.0 | 2.20 | 1.10 | 5.49 | 0 | IN_CONTROL 14 | 2024-06-01T00:00:00 | 4 | 2.67 | 2.20 | 1.10 | 5.49 | 0 | IN_CONTROL 15 | 2024-02-01T00:00:00 | 2 | 2.0 | 2.0 | 0.71 | 4.12 | 0 | IN_CONTROL 15 | 2024-03-01T00:00:00 | 2 | 2.0 | 2.0 | 0.71 | 4.12 | 0 | IN_CONTROL 15 | 2024-04-01T00:00:00 | 1 | 1.67 | 2.0 | 0.71 | 4.12 | 0 | IN_CONTROL 15 | 2024-05-01T00:00:00 | 3 | 2.0 | 2.0 | 0.71 | 4.12 | 0 | IN_CONTROL 15 | 2024-06-01T00:00:00 | 2 | 2.0 | 2.0 | 0.71 | 4.12 | 0 | IN_CONTROL

Result interpretation: The sample products (10, 14, 15) are all IN_CONTROL during the observation period, with monthly defect counts not exceeding the UCL. Product 15 has the smallest process variation (σ = 0.71), indicating the most stable inspection process. Product 14's June defect count of 4 is approaching the UCL (5.49) — it is advisable to monitor subsequent trends.


Cpk Analysis: Process Capability by Production Line

Apply the calc_cpk UDF to repair cost data aggregated by production line (using repair cost as a substitute for physical measurement values in this demonstration):

WITH line_stats AS ( SELECT production_line, COUNT(*) AS total_defects, ROUND(AVG(repair_cost), 2) AS avg_repair_cost, ROUND(STDDEV_SAMP(repair_cost), 2) AS std_repair_cost FROM best_practice_manufacturing_spc.silver_defect_enriched GROUP BY production_line ) SELECT production_line, total_defects, avg_repair_cost, std_repair_cost, ROUND(best_practice_manufacturing_spc.calc_cpk( avg_repair_cost, std_repair_cost, 1000.0, -- Upper specification limit (maximum acceptable repair cost) 0.0 -- Lower specification limit ), 3) AS repair_cost_cpk FROM line_stats ORDER BY production_line;

production_line | total_defects | avg_repair_cost | std_repair_cost | repair_cost_cpk ----------------+---------------+-----------------+-----------------+---------------- Line-1 | 319 | 503.50 | 302.99 | 0.546 Line-2 | 333 | 496.29 | 281.61 | 0.587 Line-3 | 348 | 522.27 | 284.83 | 0.559

Result interpretation: All three production lines have repair_cost_cpk well below 1.0, indicating that the repair cost distribution has excessive variation relative to the specification range (0–1000). The main reason is the wide cost distribution range (10–1000), with standard deviation around 290. Line-2 has the highest Cpk (0.587), indicating slightly better cost concentration than the other two lines.


Gold Layer Dynamic Table: Monthly Production Line Aggregation

The Gold layer aggregates Silver layer data at production_line + defect_month granularity, outputting monthly defect trends and Critical ratios for quality management dashboards.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_manufacturing_spc.gold_production_line_monthly PARTITIONED BY (production_line) REFRESH INTERVAL 10 MINUTE VCLUSTER DEFAULT TBLPROPERTIES ('static_partitions' = 'true') AS SELECT production_line, defect_month, COUNT(*) AS total_defects, SUM(is_critical) AS critical_defects, ROUND(SUM(is_critical)*100.0/COUNT(*), 2) AS critical_rate_pct, ROUND(SUM(repair_cost), 2) AS total_repair_cost, ROUND(AVG(repair_cost), 2) AS avg_repair_cost, COUNT(DISTINCT product_id) AS affected_products FROM best_practice_manufacturing_spc.silver_defect_enriched GROUP BY production_line, defect_month;

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_manufacturing_spc.gold_production_line_monthly;

View Line-3 monthly trends (the production line with the largest Critical ratio fluctuation):

SELECT production_line, defect_month, total_defects, critical_rate_pct, total_repair_cost FROM best_practice_manufacturing_spc.gold_production_line_monthly WHERE production_line = 'Line-3' ORDER BY defect_month;

production_line | defect_month | total_defects | critical_rate_pct | total_repair_cost ----------------+---------------------+---------------+-------------------+----------------- Line-3 | 2024-01-01T00:00:00 | 86 | 38.37 | 43501.08 Line-3 | 2024-02-01T00:00:00 | 49 | 44.90 | 24149.23 Line-3 | 2024-03-01T00:00:00 | 60 | 25.00 | 31843.13 Line-3 | 2024-04-01T00:00:00 | 45 | 48.89 | 23162.77 Line-3 | 2024-05-01T00:00:00 | 61 | 26.23 | 32361.05 Line-3 | 2024-06-01T00:00:00 | 47 | 25.53 | 26731.13

Result interpretation: Line-3 had the highest defect count in January (86), but the Critical ratio (38.37%) was not the highest for the period — April reached 48.89% with a relatively low total defect count (45). This "low volume but high severity" pattern may indicate a raw material batch issue, requiring further investigation by product_id dimension.


Gold Layer Dynamic Table: Defect Pareto Analysis

Pareto analysis applies the "80/20 rule" to identify the small number of root cause categories responsible for the majority of defect costs.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_manufacturing_spc.gold_defect_pareto REFRESH INTERVAL 10 MINUTE VCLUSTER DEFAULT AS SELECT defect_type, severity, defect_location, COUNT(*) AS defect_count, ROUND(SUM(repair_cost), 2) AS total_repair_cost, ROUND(COUNT(*)*100.0 / SUM(COUNT(*)) OVER (), 2) AS defect_pct, SUM(COUNT(*)) OVER ( ORDER BY COUNT(*) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_count, ROUND(SUM(COUNT(*)) OVER ( ORDER BY COUNT(*) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS cumulative_pct FROM best_practice_manufacturing_spc.silver_defect_enriched GROUP BY defect_type, severity, defect_location ORDER BY defect_count DESC;

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_manufacturing_spc.gold_defect_pareto;

View Top 10 defect categories (sorted by count descending, with cumulative percentage):

SELECT defect_type, severity, defect_location, defect_count, total_repair_cost, defect_pct, cumulative_pct FROM best_practice_manufacturing_spc.gold_defect_pareto ORDER BY defect_count DESC LIMIT 10;

defect_type | severity | defect_location | defect_count | total_repair_cost | defect_pct | cumulative_pct ------------+----------+-----------------+--------------+-------------------+------------+--------------- Structural | Minor | Surface | 51 | 25935.06 | 5.10 | 5.10 Structural | Critical | Surface | 46 | 21754.09 | 4.60 | 9.70 Structural | Minor | Internal | 44 | 25146.07 | 4.40 | 22.90 Structural | Minor | Component | 44 | 22656.89 | 4.40 | 14.10 Functional | Critical | Internal | 44 | 22739.99 | 4.40 | 18.50 Functional | Moderate | Component | 41 | 21378.39 | 4.10 | 27.00 Cosmetic | Moderate | Surface | 40 | 18387.07 | 4.00 | 31.00 Functional | Critical | Component | 39 | 23047.16 | 3.90 | 38.80 Cosmetic | Minor | Surface | 39 | 19181.07 | 3.90 | 34.90 Functional | Minor | Surface | 38 | 18227.96 | 3.80 | 42.60

Result interpretation: The Top 10 categories account for 42.60% of total defects. Surface Structural defects have the highest count, but if sorted by repair cost, Functional × Critical × Component (39 records, total cost 23,047) has the highest per-unit cost (591 per unit). Recommended Pareto remediation priority: first address Surface Structural Minor defects (highest count), then address Component Functional Critical defects (highest per-unit cost).

View total comparison of three main defect types:

SELECT defect_type, SUM(defect_count) AS total, ROUND(SUM(total_repair_cost), 2) AS total_cost FROM best_practice_manufacturing_spc.gold_defect_pareto GROUP BY defect_type ORDER BY total DESC;

defect_type | total | total_cost ------------+-------+----------- Structural | 352 | 176923.85 Functional | 339 | 171905.58 Cosmetic | 309 | 158797.72


PPM Defect Rate Calculation

PPM (Parts Per Million) measures defects per million units and is the standardized quality metric in SPC and Six Sigma systems:

SELECT defect_type, COUNT(*) AS defect_count, ROUND(COUNT(*) * 1000000.0 / 1000, 0) AS ppm_rate FROM best_practice_manufacturing_spc.doc_defect_events GROUP BY defect_type ORDER BY ppm_rate DESC;

defect_type | defect_count | ppm_rate ------------+--------------+--------- Structural | 352 | 352000 Functional | 339 | 339000 Cosmetic | 309 | 309000

Result interpretation: PPM for all three defect types is above 300,000, corresponding to a sigma level of approximately 2σ (Six Sigma target is 3.4 PPM, i.e., 6σ). This is a characteristic of the dataset — 1,000 out of 1,000 items have defects. In production use, the denominator should be the total number of inspected units (including conforming units), typically sourced from MES production completion records, requiring an additional table join.


Studio Refresh Task Configuration

Periodic refresh of Dynamic Tables is managed via Lakehouse Studio Tasks, to which monitoring alerts and data quality rules can be added on the same task.

Use cz-cli task commands to create refresh tasks (equivalent to operating in the Studio UI):

# 1. Create task folder cz-cli task create-folder "manufacturing_spc" --parent 186117 -p skill_test # Returns: {"data":187106} <- record folder id # 2. Create Silver layer refresh task cz-cli task create "refresh_silver_defect_enriched" \ --type SQL --folder 187106 -p skill_test # Returns: {"data":{"id":10354655, ...}} # 3. Set refresh SQL content cz-cli task save-content "refresh_silver_defect_enriched" \ --content "REFRESH DYNAMIC TABLE best_practice_manufacturing_spc.silver_defect_enriched;" \ -p skill_test # 4. Configure Cron schedule (every 10 minutes) cz-cli task save-cron "refresh_silver_defect_enriched" \ --cron "*/10 * * * *" -p skill_test # 5. Create Gold layer refresh task (same as above) cz-cli task create "refresh_gold_production_line_monthly" \ --type SQL --folder 187106 -p skill_test cz-cli task save-content "refresh_gold_production_line_monthly" \ --content "REFRESH DYNAMIC TABLE best_practice_manufacturing_spc.gold_production_line_monthly;" \ -p skill_test cz-cli task save-cron "refresh_gold_production_line_monthly" \ --cron "*/10 * * * *" -p skill_test

After tasks are created, you can see the two refresh tasks under best_practices/manufacturing_spc/ in the Studio UI. Click a task → Alert Configuration to bind "refresh failure notification" or "data row count is 0 alert" rules. Once ready, publish the tasks:

cz-cli task deploy "refresh_silver_defect_enriched" -p skill_test cz-cli task deploy "refresh_gold_production_line_monthly" -p skill_test


Data Warehouse Object Summary

SHOW TABLES IN best_practice_manufacturing_spc;

schema_name | table_name | is_dynamic -----------------------------------+---------------------------------+----------- best_practice_manufacturing_spc | doc_defect_events | false best_practice_manufacturing_spc | doc_product_master | false best_practice_manufacturing_spc | silver_defect_enriched | true best_practice_manufacturing_spc | gold_production_line_monthly | true best_practice_manufacturing_spc | gold_defect_pareto | true

Data flow overview:

MES System (Kafka) Manual Sampling (CSV) | | v Kafka PIPE (60s batch) v COPY INTO (Volume) kafka_raw_defects doc_defect_events (Bronze) | BloomFilter Index (product_id) doc_product_master ----+ LEFT JOIN (production_line | UCL / LCL / spec_target) | v Studio Task: refresh every 10 min silver_defect_enriched (Dynamic Table) severity_score UDF · is_critical · defect_month | | +-----------------+ +------------------+ v Studio Task: refresh every 10 min v gold_production_line_monthly (DT) gold_defect_pareto (DT) PARTITIONED BY production_line cumulative_pct (Pareto 80%) static_partitions = true Window Function ORDER BY | | v v Quality Dashboard Root Cause Analysis Cpk · UCL/LCL 80/20 Defect Focus


Notes

  • Window functions do not support nested aggregation: Writing STDDEV_SAMP(COUNT(*)) OVER (...) (nesting an aggregate function inside a window function) will produce an "aggregate function cannot contain another aggregate function" error. The correct approach is to first complete the GROUP BY aggregation in a CTE, then apply window functions to the result columns in the outer query.

  • Partitioned Dynamic Tables must declare static_partitions: A DT with PARTITIONED BY must set TBLPROPERTIES ('static_partitions' = 'true'). Without this declaration, the system uses dynamic partition inference, which may cause old partition data to be overwritten or lost during incremental refresh.

  • BloomFilter Index does not auto-apply to existing data: CREATE BLOOMFILTER INDEX only takes effect for data written after creation and does not support BUILD INDEX to cover existing data (BLOOMFILTER type does not have this capability; rebuilding the table and re-inserting is required to cover existing data).

  • Dynamic Table refresh scheduling managed via Lakehouse Studio Tasks: Do not rely on REFRESH INTERVAL in the DDL for production scheduling. Configure Cron in Studio Tasks instead, which allows binding alert rules and data quality checks to the same task for unified observability.

  • Denominator selection in PPM calculation: The PPM demonstration in this guide uses 1,000 inspection records as the denominator, for illustration only. In production, the PPM denominator should be the total number of inspected units (including conforming units), typically sourced from MES production completion data and requiring an additional table.

  • Handling std_val = 0 in Cpk: The calc_cpk UDF returns NULL when std_val <= 0 to avoid division by zero. This occurs with a sample size of 1. Before Gold layer aggregation, filter out groups with insufficient sample size (HAVING COUNT(*) > 1).