Building a Digital Marketing CDP Unified User ID Resolution Data Warehouse

Integrate user data from multiple channels — CRM, mini-programs, APP, and offline retail — into a unified OneID framework, then compute RFM tags and build audience segments to output the target audiences needed for precision marketing campaigns. Using the Online Retail II (Kaggle) retail transaction dataset as a base, this guide provides a complete end-to-end walkthrough of the MySQL CDC → Kafka real-time ingestion → ID Mapping → Dynamic Table RFM → BITMAP audience segmentation pipeline, covering key platform capabilities including MERGE INTO incremental updates and External Function for external ID graph calls.


Overview

The core challenge of building a multi-channel CDP is that the same natural person leaves different IDs across different channels (mobile number, WeChat union_id, CRM member ID, device ID). Without resolving these, it is impossible to accurately compute user lifetime value or perform cross-channel attribution.

ChallengeSingdata Solution
CRM member changes synced in real time to the data warehouseMySQL CDC PIPE, captures binlog and writes to ODS
Real-time behavior event ingestion from mini-programs / APPStudio Kafka real-time sync task (single-table ingestion)
Incremental merging of multi-source IDs — same OneID not inserted twiceMERGE INTO: UPDATE on match, INSERT when no match
OneID matching calls an external ID graph serviceExternal Function encapsulating the HTTP call for use directly in SQL
RFM tags and user segmentation auto-refreshDynamic Table with automatic incremental computation, scheduled via Studio Task
Audience segmentation (intersection / difference / union)BITMAP function family for millisecond-level set operations on hundreds of millions of IDs

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate ODS / DWD / ADS static tablesUpstream of Dynamic Tables or final output
MERGE INTOIncremental updates to the ID Mapping tableInsert new IDs; update last_seen and confidence for existing ones
CREATE DYNAMIC TABLECreate DWS RFM and segmentation tablesAutomatic incremental computation, no REFRESH INTERVAL
REFRESH DYNAMIC TABLEManually trigger initial full refreshRun after table creation; routine scheduling via Studio Task
GROUP_BITMAP_STATEAggregate integer user_ids into a BITMAP objectBuild the bitmap for each segment; returns a BITMAP type
GROUP_BITMAP_ANDCompute intersection of multiple BITMAP rows, returns cardinalitySingle scan to compute AND result for multiple segments
GROUP_BITMAP_ORCompute union of multiple BITMAP rows, returns cardinalityCompute deduplicated total users across multiple segments
BITMAP_ANDIntersection of two BITMAP objectsReturns a BITMAP object for further operations or array conversion
BITMAP_ORUnion of two BITMAP objectsReturns a BITMAP object
BITMAP_ANDNOTDifference operation (exists in A but not in B)Returns a BITMAP object, used to exclude specific segments
BITMAP_COUNTCount the number of IDs in a BITMAPRead cardinality from a bitmap object
BITMAP_TO_ARRAYExpand a BITMAP object into an integer arrayUsed with EXPLODE to export audience ID lists
CREATE FUNCTIONCreate SQL UDF (ID normalization example)Encapsulates ID transformation logic; replace with External Function in production

Prerequisites

CREATE SCHEMA IF NOT EXISTS best_practice_marketing_cdp;


ODS Layer: Multi-Channel Raw Data Ingestion

CRM Member Table

The CRM system runs on MySQL; member information changes are synced in real time to the ODS layer via MySQL CDC.

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ods_crm_members ( member_id STRING, mobile_hash STRING, -- mobile number SHA256, plaintext not stored email_hash STRING, -- email SHA256 real_name STRING, gender STRING, birthday DATE, register_date DATE, channel STRING, -- registration channel: offline / online / miniapp level STRING, -- member tier: bronze / silver / gold total_points INT, updated_at TIMESTAMP );

MySQL CDC configuration

In Studio, create a "MySQL Real-time Sync Task" at: Studio → Data Integration → Real-time Tasks → New Task. Configure as follows:

ParameterValue
Data sourceMySQL (configure host, port, database, username, password)
Sync modeCDC (binlog real-time capture)
Source tablecrm.members
Target schemabest_practice_marketing_cdp
Target tableods_crm_members
Write strategyUPSERT (primary key member_id)

Studio task path: best_practices/marketing_cdp/

Write sample data (use direct INSERT when MySQL is not available).

Import from local CSV (recommended)

Save CRM member data as a CSV file and import in bulk via User Volume:

-- Step 1: Upload the local CSV file to User Volume via SQL PUT PUT '/path/to/crm_members.csv' TO USER VOLUME FILE 'crm_members.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_marketing_cdp.ods_crm_members FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('crm_members.csv');

You can also insert a small batch of test data inline (no CSV file required):

INSERT INTO best_practice_marketing_cdp.ods_crm_members VALUES ('MBR001','hash_mobile_001','hash_email_001','Alice Wang', 'F',CAST('1990-05-12' AS DATE),CAST('2020-01-15' AS DATE),'offline','gold', 1200,CAST('2024-11-01 10:00:00' AS TIMESTAMP)), ('MBR002','hash_mobile_002','hash_email_002','Bob Chen', 'M',CAST('1985-08-23' AS DATE),CAST('2019-06-20' AS DATE),'online', 'silver',800, CAST('2024-10-15 14:30:00' AS TIMESTAMP)), ('MBR003','hash_mobile_003','hash_email_003','Carol Liu', 'F',CAST('1995-03-07' AS DATE),CAST('2021-09-10' AS DATE),'miniapp','bronze',350, CAST('2024-11-10 09:15:00' AS TIMESTAMP)), ('MBR004','hash_mobile_004','hash_email_004','David Zhang','M',CAST('1988-12-01' AS DATE),CAST('2018-03-05' AS DATE),'offline','gold', 2500,CAST('2024-09-20 16:45:00' AS TIMESTAMP)), ('MBR005','hash_mobile_005','hash_email_005','Eve Li', 'F',CAST('1992-07-19' AS DATE),CAST('2022-11-08' AS DATE),'online', 'silver',620, CAST('2024-11-05 11:00:00' AS TIMESTAMP));

Mini-Program / APP Behavior Event Table

User behavior events from mini-programs and APPs are ingested in real time via Kafka. Each message corresponds to a user action (page view, add to cart, payment, etc.).

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ods_app_events ( event_id STRING, device_id STRING, union_id STRING, -- WeChat union_id, cross-mini-program resolution open_id STRING, -- WeChat open_id, unique within a single app event_type STRING, -- page_view / add_cart / purchase page_name STRING, item_id STRING, item_price DOUBLE, channel STRING, -- miniapp / app platform STRING, -- wechat / ios / android event_time TIMESTAMP, session_id STRING, ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

Option 1: Write via actual Kafka (recommended)

In Studio, create a "Kafka Single-Table Real-time Sync Task" at task path best_practices/marketing_cdp/, configured as follows:

ParameterValue
Kafka Broker<broker>:9092
Topicminiapp_app_events
Consumer Groupcz_cdp_consumer
Target tablebest_practice_marketing_cdp.ods_app_events
Message formatJSON

Python producer example:

from kafka import KafkaProducer import json, time, uuid producer = KafkaProducer( bootstrap_servers=['<broker>:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) event = { "event_id": str(uuid.uuid4()), "device_id": "DEV001", "union_id": "union_001", "open_id": "open_001", "event_type": "purchase", "page_name": "checkout", "item_id": "SKU001", "item_price": 89.9, "channel": "miniapp", "platform": "wechat", "event_time": "2024-11-01 08:15:00", "session_id": "sess_001" } producer.send('miniapp_app_events', event) producer.flush()

Option 2: INSERT simulation (when Kafka is not available)

If Kafka is not configured, write directly to the target table via INSERT INTO to simulate the effect of messages being parsed and written, making it easy to verify downstream Dynamic Table and BITMAP logic:

INSERT INTO best_practice_marketing_cdp.ods_app_events (event_id, device_id, union_id, open_id, event_type, page_name, item_id, item_price, channel, platform, event_time, session_id) VALUES ('EVT001','DEV001','union_001','open_001','purchase','checkout','SKU001',89.9, 'miniapp','wechat', CAST('2024-11-01 08:15:00' AS TIMESTAMP),'sess_001'), ('EVT002','DEV002','union_002','open_002','purchase','checkout','SKU002',199.0,'app', 'ios', CAST('2024-11-01 09:20:00' AS TIMESTAMP),'sess_002'), ('EVT003','DEV003','union_003','open_003','add_cart', 'product','SKU003',350.0,'miniapp','wechat', CAST('2024-11-02 10:05:00' AS TIMESTAMP),'sess_003'), ('EVT004','DEV004','union_004','open_004','purchase','checkout','SKU004',599.0,'app', 'android',CAST('2024-11-02 11:30:00' AS TIMESTAMP),'sess_004'), ('EVT005','DEV005','union_005','open_005','purchase','checkout','SKU005',129.0,'miniapp','wechat', CAST('2024-11-03 14:00:00' AS TIMESTAMP),'sess_005'), ('EVT006','DEV001','union_001','open_001','purchase','checkout','SKU006',75.0, 'miniapp','wechat', CAST('2024-11-05 16:30:00' AS TIMESTAMP),'sess_006'), ('EVT007','DEV006','union_006','open_006','purchase','checkout','SKU007',420.0,'app', 'ios', CAST('2024-11-10 09:00:00' AS TIMESTAMP),'sess_007'), ('EVT008','DEV008','union_008','open_008','purchase','checkout','SKU008',259.0,'miniapp','wechat', CAST('2024-11-12 10:20:00' AS TIMESTAMP),'sess_008'), ('EVT009','DEV009','union_009','open_009','purchase','checkout','SKU009',88.0, 'app', 'ios', CAST('2024-11-13 13:45:00' AS TIMESTAMP),'sess_009'), ('EVT010','DEV010','union_010','open_010','purchase','checkout','SKU010',315.0,'miniapp','wechat', CAST('2024-11-14 11:00:00' AS TIMESTAMP),'sess_010');

Online Retail Transaction Table (Kaggle Online Retail II)

Use the Online Retail II UCI dataset (Kaggle) as ODS raw transaction data for the online retail channel, simulating historical orders ingested from a third-party e-commerce platform.

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ods_retail_transactions ( invoice STRING, stock_code STRING, description STRING, quantity INT, invoice_date TIMESTAMP, price DOUBLE, customer_id STRING, country STRING, ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

Two options for loading data:

Option 1: Full import from CSV (recommended)

Download the dataset from Kaggle:

kaggle datasets download -d mashlyn/online-retail-ii-uci --unzip -p /tmp/marketing_cdp/

After downloading you get online_retail_II.xlsx; convert it to CSV (using Python's pandas or Excel Save As):

import pandas as pd df = pd.read_excel('/tmp/marketing_cdp/online_retail_II.xlsx', sheet_name='Year 2009-2010') df.to_csv('/tmp/marketing_cdp/online_retail_II.csv', index=False)

Then upload the CSV to Lakehouse via User Volume and import:

-- Step 1: Upload the local CSV file to User Volume PUT '/tmp/marketing_cdp/online_retail_II.csv' TO USER VOLUME FILE 'online_retail_II.csv';

-- Step 2: COPY INTO the table from User Volume (full import) COPY INTO best_practice_marketing_cdp.ods_retail_transactions (invoice, stock_code, description, quantity, invoice_date, price, customer_id, country) FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='', 'timestampFormat'='M/d/yyyy H:mm') FILES ('online_retail_II.csv');

Option 2: INSERT a representative subset

If the full CSV is not available, insert a representative subset to verify downstream RFM and BITMAP logic:

INSERT INTO best_practice_marketing_cdp.ods_retail_transactions (invoice, stock_code, description, quantity, invoice_date, price, customer_id, country) VALUES ('489434','85048','15CM CHRISTMAS GLASS BALL 20 LIGHTS',12,CAST('2009-12-01 07:45:00' AS TIMESTAMP),6.95, 'CUS013085','United Kingdom'), ('489434','79323P','PINK CHERRY LIGHTS', 12,CAST('2009-12-01 07:45:00' AS TIMESTAMP),6.75, 'CUS013085','United Kingdom'), ('489435','22111','SCOTTIE DOG HOT WATER BOTTLE', 24,CAST('2009-12-01 07:45:00' AS TIMESTAMP),3.45, 'CUS013748','United Kingdom'), ('489436','48173C','DOOR MAT UNION JACK CARS', 10,CAST('2009-12-01 09:00:00' AS TIMESTAMP),5.95, 'CUS014085','United Kingdom'), ('489437','21080','SET OF 6 NAUTICAL PAPER PLATES', 12,CAST('2009-12-01 09:30:00' AS TIMESTAMP),3.25, 'CUS012583','United Kingdom'), ('489437','22423','REGENCY CAKESTAND 3 TIER', 12,CAST('2009-12-01 09:30:00' AS TIMESTAMP),12.75,'CUS012583','United Kingdom'), ('489438','84970L','SINGLE HEART ZINC T-LIGHT HOLDER', 12,CAST('2009-12-01 10:00:00' AS TIMESTAMP),1.25, 'CUS012431','United Kingdom'), ('489440','23256','CHILDRENS CUTLERY SPACEBOY', 12,CAST('2009-12-01 10:30:00' AS TIMESTAMP),4.15, 'CUS013047','United Kingdom'), ('490100','22421','GINGHAM HEART', 6, CAST('2009-12-10 09:30:00' AS TIMESTAMP),4.95, 'CUS013085','United Kingdom'), ('490200','84029E','TREE TOP STAR', 12,CAST('2009-12-15 10:00:00' AS TIMESTAMP),1.65, 'CUS013748','United Kingdom') -- Full import includes 30 rows; first 10 shown here for illustration ;

Verify data volumes across all three ODS tables:

SELECT 'ods_crm_members' AS tbl, COUNT(*) AS cnt FROM best_practice_marketing_cdp.ods_crm_members UNION ALL SELECT 'ods_app_events', COUNT(*) FROM best_practice_marketing_cdp.ods_app_events UNION ALL SELECT 'ods_retail_transactions', COUNT(*) FROM best_practice_marketing_cdp.ods_retail_transactions;

tbl cnt ------------------------ --- ods_crm_members 10 ods_app_events 15 ods_retail_transactions 30


DWD Layer: ID Mapping and Unified Events

ID Mapping Table and External Function

The ID Mapping table records the mapping relationship between each OneID and the original IDs from each channel. In production, matching new IDs to OneIDs is performed by an external ID graph service (called via External Function).

Create External Function (production approach)

In Studio, create an External Function that calls an external ID graph API via Alibaba Cloud FC / AWS Lambda:

-- First create an API CONNECTION (one-time setup, connects to the cloud function environment) CREATE API CONNECTION IF NOT EXISTS conn_id_graph TYPE = 'ALIYUN' REGION = 'cn-hangzhou' ROLE_ARN = '<your-role-arn>' NAMESPACE = 'default' CODE_BUCKET = '<your-code-bucket>'; -- Create the External Function (packages the ID graph HTTP call logic) CREATE EXTERNAL FUNCTION IF NOT EXISTS best_practice_marketing_cdp.call_id_graph( id_value STRING, id_type STRING ) RETURNS STRING LANGUAGE PYTHON CONNECTION = conn_id_graph RESOURCE_URIS = 'volume://func_volume/id_graph.zip';

SQL UDF alternative (test environment)

If the external ID graph service is not yet connected, use a SQL UDF to simulate ID normalization logic and verify the downstream Mapping table structure:

CREATE OR REPLACE FUNCTION best_practice_marketing_cdp.normalize_id( id_value STRING, id_type STRING ) RETURNS STRING AS CASE WHEN id_type = 'mobile_hash' THEN CONCAT('ONE_PHONE_', SUBSTR(id_value, -6)) WHEN id_type = 'email_hash' THEN CONCAT('ONE_EMAIL_', SUBSTR(id_value, -6)) WHEN id_type = 'union_id' THEN CONCAT('ONE_WX_', SUBSTR(id_value, -6)) WHEN id_type = 'device_id' THEN CONCAT('ONE_DEV_', SUBSTR(id_value, -6)) ELSE CONCAT('ONE_UNKNOWN_', id_value) END;

Verify the UDF:

SELECT id_value, id_type, best_practice_marketing_cdp.normalize_id(id_value, id_type) AS normalized_id FROM best_practice_marketing_cdp.dwd_id_mapping WHERE one_id = 'ONE001';

id_value id_type normalized_id ---------------- ------------ ----------------- hash_email_001 email_hash ONE_EMAIL_il_001 MBR001 member_id ONE_UNKNOWN_MBR001 hash_mobile_001 mobile_hash ONE_PHONE_le_001 union_001 union_id ONE_WX_on_001

Create ID Mapping Table

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.dwd_id_mapping ( one_id STRING, id_type STRING, -- member_id / mobile_hash / email_hash / union_id / device_id id_value STRING, source_channel STRING, -- crm / miniapp / app / pos confidence DOUBLE, -- match confidence: 1.0 = deterministic, < 1.0 = probabilistic first_seen TIMESTAMP, last_seen TIMESTAMP, is_active BOOLEAN );

Load initial ID Mapping data:

INSERT INTO best_practice_marketing_cdp.dwd_id_mapping VALUES ('ONE001','member_id', 'MBR001','crm', 1.0, CAST('2020-01-15' AS TIMESTAMP),CAST('2024-11-01' AS TIMESTAMP),true), ('ONE001','mobile_hash','hash_mobile_001','crm',1.0,CAST('2020-01-15' AS TIMESTAMP),CAST('2024-11-01' AS TIMESTAMP),true), ('ONE001','union_id', 'union_001','miniapp',0.95,CAST('2021-03-10' AS TIMESTAMP),CAST('2024-11-05' AS TIMESTAMP),true), ('ONE002','member_id', 'MBR002','crm', 1.0, CAST('2019-06-20' AS TIMESTAMP),CAST('2024-10-15' AS TIMESTAMP),true), ('ONE002','union_id', 'union_002','app', 0.9, CAST('2021-08-15' AS TIMESTAMP),CAST('2024-11-01' AS TIMESTAMP),true) -- Full 22 rows ... ;

MERGE INTO: Incremental ID Mapping Updates

When the ID graph service discovers new ID associations or the confidence of existing mappings changes, use MERGE INTO for incremental upserts: update if the same (one_id, id_type, id_value) triplet already exists, otherwise insert.

MERGE INTO best_practice_marketing_cdp.dwd_id_mapping AS t USING ( -- Newly discovered email mapping (from ID graph service return result) SELECT 'ONE001' AS one_id, 'email_hash' AS id_type, 'hash_email_001' AS id_value, 'crm' AS source_channel, 1.0 AS confidence, CAST('2020-01-15 00:00:00' AS TIMESTAMP) AS first_seen, CAST('2024-11-20 10:00:00' AS TIMESTAMP) AS last_seen, true AS is_active ) AS s ON t.one_id = s.one_id AND t.id_type = s.id_type AND t.id_value = s.id_value WHEN MATCHED THEN UPDATE SET last_seen = s.last_seen, confidence = s.confidence WHEN NOT MATCHED THEN INSERT (one_id, id_type, id_value, source_channel, confidence, first_seen, last_seen, is_active) VALUES (s.one_id, s.id_type, s.id_value, s.source_channel, s.confidence, s.first_seen, s.last_seen, s.is_active);

After execution, verify all mappings for ONE001 (the new email_hash row was inserted; existing rows remain unchanged):

SELECT one_id, id_type, id_value, confidence, last_seen FROM best_practice_marketing_cdp.dwd_id_mapping WHERE one_id = 'ONE001' ORDER BY id_type;

one_id id_type id_value confidence last_seen ------- ------------ --------------- ---------- -------------------- ONE001 email_hash hash_email_001 1 2024-11-20T10:00:00 ONE001 member_id MBR001 1 2024-11-01T10:00:00 ONE001 mobile_hash hash_mobile_001 1 2024-11-01T10:00:00 ONE001 union_id union_001 0.95 2024-11-05T16:30:00

The email_hash row was newly inserted; last_seen for member_id, mobile_hash, and union_id rows was not overwritten, as expected.

Unified User Event Table (DWD)

Resolve ODS events from each channel to one_id via ID Mapping, and consolidate into a unified user event table:

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.dwd_user_events ( event_id STRING, one_id STRING, event_type STRING, channel STRING, platform STRING, item_id STRING, item_price DOUBLE, quantity INT, revenue DOUBLE, event_time TIMESTAMP, event_date DATE, session_id STRING );

Write mini-program / APP events resolved via ID Mapping:

INSERT INTO best_practice_marketing_cdp.dwd_user_events SELECT e.event_id, m.one_id, e.event_type, e.channel, e.platform, e.item_id, e.item_price, 1 AS quantity, e.item_price AS revenue, e.event_time, CAST(e.event_time AS DATE) AS event_date, e.session_id FROM best_practice_marketing_cdp.ods_app_events e JOIN best_practice_marketing_cdp.dwd_id_mapping m ON m.id_value = e.union_id AND m.id_type = 'union_id' WHERE e.event_type = 'purchase' AND e.item_id IS NOT NULL;

Write retail transactions with direct mapping (customer_id as one_id):

INSERT INTO best_practice_marketing_cdp.dwd_user_events SELECT CONCAT('ORT-', invoice, '-', stock_code) AS event_id, customer_id AS one_id, 'purchase' AS event_type, 'online_retail' AS channel, 'web' AS platform, stock_code AS item_id, price AS item_price, quantity, ROUND(price * quantity, 2) AS revenue, invoice_date AS event_time, CAST(invoice_date AS DATE) AS event_date, invoice AS session_id FROM best_practice_marketing_cdp.ods_retail_transactions;

View cross-channel purchase distribution:

SELECT channel, COUNT(DISTINCT one_id) AS unique_users, COUNT(*) AS purchase_count, ROUND(SUM(revenue), 2) AS total_revenue, ROUND(AVG(revenue), 2) AS avg_order_value FROM best_practice_marketing_cdp.dwd_user_events WHERE event_type = 'purchase' GROUP BY channel ORDER BY total_revenue DESC;

channel unique_users purchase_count total_revenue avg_order_value -------------- ------------ -------------- ------------- --------------- online_retail 16 30 1628.64 54.29 app 4 4 1306 326.5 miniapp 4 5 867.9 173.58

The APP channel's average order value (326.5) is significantly higher than online retail (54.29), indicating APP users prefer high-value items — suitable for pushing premium new products.


DWS Layer: RFM Metrics and User Segmentation

User RFM Dynamic Table

RFM (Recency / Frequency / Monetary) is the core metric for measuring user value. Use a Dynamic Table to automatically maintain up-to-date RFM values for each one_id:

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_marketing_cdp.dws_user_rfm AS SELECT one_id, DATEDIFF(CURRENT_DATE(), MAX(event_date)) AS recency_days, COUNT(DISTINCT DATE_TRUNC('day', event_time)) AS frequency, ROUND(SUM(revenue), 2) AS monetary, MAX(event_date) AS last_purchase_date, MIN(event_date) AS first_purchase_date FROM best_practice_marketing_cdp.dwd_user_events WHERE event_type = 'purchase' GROUP BY one_id;

Manually trigger the initial full refresh:

REFRESH DYNAMIC TABLE best_practice_marketing_cdp.dws_user_rfm;

View RFM distribution (sorted by monetary descending, showing high-value users):

SELECT one_id, recency_days, frequency, monetary, last_purchase_date FROM best_practice_marketing_cdp.dws_user_rfm ORDER BY monetary DESC LIMIT 10;

one_id recency_days frequency monetary last_purchase_date ---------- ------------ --------- -------- ------------------ ONE004 581 1 599 2024-11-02 ONE006 573 1 420 2024-11-10 ONE010 569 1 315 2024-11-14 CUS013085 6022 2 294.9 2009-12-10 CUS013241 6010 2 292.5 2009-12-22 CUS012583 5996 2 268.5 2010-01-05 ONE008 571 1 259 2024-11-12 ONE002 582 1 199 2024-11-01 CUS014085 6012 2 177.1 2009-12-20 ONE001 578 2 164.9 2024-11-05

ONE004 (David Zhang) has the highest monetary value (599), with last purchase 581 days ago — a high-value but at-risk churned user, prioritize for reactivation. CUS013085 has historical spend of 294.9 but last purchased over 16 years ago — this is 2009 UK retail historical data, in a different time dimension from current ONE-prefixed users.

User Segmentation Dynamic Table

Segment users based on RFM values for use in subsequent BITMAP audience segmentation:

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_marketing_cdp.dws_user_segment AS SELECT r.one_id, r.recency_days, r.frequency, r.monetary, CASE WHEN r.recency_days <= 200 AND r.frequency >= 2 AND r.monetary >= 300 THEN 'Champions' WHEN r.recency_days <= 600 AND r.frequency >= 2 THEN 'Loyal Customers' WHEN r.recency_days <= 600 THEN 'At Risk' WHEN r.recency_days <= 2000 THEN 'Hibernating' ELSE 'Lost' END AS rfm_segment, r.last_purchase_date, r.first_purchase_date FROM best_practice_marketing_cdp.dws_user_rfm r;

REFRESH DYNAMIC TABLE best_practice_marketing_cdp.dws_user_segment;

View segment distribution:

SELECT rfm_segment, COUNT(*) AS user_count, ROUND(AVG(monetary), 2) AS avg_monetary, ROUND(AVG(frequency), 1) AS avg_frequency, ROUND(AVG(recency_days), 0) AS avg_recency_days FROM best_practice_marketing_cdp.dws_user_segment GROUP BY rfm_segment ORDER BY avg_monetary DESC;

rfm_segment user_count avg_monetary avg_frequency avg_recency_days ------------- ---------- ------------ ------------- ---------------- At Risk 7 287 1 575 Loyal Customers 1 164.9 2 578 Lost 16 101.79 1.3 6025

At Risk users (7 people) have the highest average monetary value (287) but their last purchase was about 575 days ago — the priority group for reactivation.

Configure Studio Scheduled Tasks

Create scheduled tasks in Studio for both Dynamic Tables, at path best_practices/marketing_cdp/.

# Create folder cz-cli task create-folder "best_practices" -p skill_test cz-cli task create-folder "marketing_cdp" --parent <best_practices-folder-id> -p skill_test # Create RFM refresh task cz-cli task create "refresh_dws_user_rfm" --type SQL --folder <folder-id> -p skill_test cz-cli task save-content "refresh_dws_user_rfm" \ --content "REFRESH DYNAMIC TABLE best_practice_marketing_cdp.dws_user_rfm;" -p skill_test cz-cli task save-cron "refresh_dws_user_rfm" --cron "0 00 02 * * ? *" -p skill_test # Create segmentation refresh task (runs 30 minutes after RFM completes) cz-cli task create "refresh_dws_user_segment" --type SQL --folder <folder-id> -p skill_test cz-cli task save-content "refresh_dws_user_segment" \ --content "REFRESH DYNAMIC TABLE best_practice_marketing_cdp.dws_user_rfm; REFRESH DYNAMIC TABLE best_practice_marketing_cdp.dws_user_segment;" -p skill_test cz-cli task save-cron "refresh_dws_user_segment" --cron "0 30 02 * * ? *" -p skill_test

Both tasks are created in Studio under best_practices/marketing_cdp/ (task_id 10354650 / 10354651), running sequentially at 02:00 and 02:30 each day. You can add data quality alert rules in the task configuration (e.g., send a Lark notification if the RFM table row count drops to zero).


ADS Layer: BITMAP Audience Segmentation and Audience Package Export

Build User BITMAPs

The core of audience segmentation is first building BITMAP indexes for each segment (via GROUP_BITMAP_STATE aggregation), then performing set operations at the BITMAP level to avoid full-table JOINs:

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ads_user_bitmap ( segment_tag STRING, user_bitmap BITMAP );

Build BITMAPs separately by RFM segment and channel (user_id uses the numeric portion of ONE IDs 1–10):

-- Truncate before rebuilding to ensure exactly one row per segment_tag (idempotent) TRUNCATE TABLE best_practice_marketing_cdp.ads_user_bitmap;

-- Build by RFM segment INSERT INTO best_practice_marketing_cdp.ads_user_bitmap SELECT rfm_segment AS segment_tag, GROUP_BITMAP_STATE(CAST(SUBSTRING(one_id, 4) AS INT)) AS user_bitmap FROM best_practice_marketing_cdp.dws_user_segment WHERE one_id LIKE 'ONE%' GROUP BY rfm_segment; -- Build by purchase channel INSERT INTO best_practice_marketing_cdp.ads_user_bitmap SELECT CONCAT('channel_', channel) AS segment_tag, GROUP_BITMAP_STATE(CAST(SUBSTRING(one_id, 4) AS INT)) AS user_bitmap FROM best_practice_marketing_cdp.dwd_user_events WHERE event_type = 'purchase' AND one_id LIKE 'ONE%' GROUP BY channel; -- High-value users (monetary >= 300) INSERT INTO best_practice_marketing_cdp.ads_user_bitmap SELECT 'High Value' AS segment_tag, GROUP_BITMAP_STATE(CAST(SUBSTRING(one_id, 4) AS INT)) AS user_bitmap FROM best_practice_marketing_cdp.dws_user_rfm WHERE one_id LIKE 'ONE%' AND monetary >= 300;

Verify user counts per segment:

SELECT segment_tag, BITMAP_COUNT(user_bitmap) AS user_count FROM best_practice_marketing_cdp.ads_user_bitmap ORDER BY user_count DESC;

segment_tag user_count ---------------- ---------- At Risk 7 channel_app 4 channel_miniapp 4 High Value 3 Loyal Customers 1

Set Operations: Intersection / Union / Difference

Scenario 1: Intersection across all segments (GROUP_BITMAP_AND)

Count users appearing in both the At Risk and High Value segments — equivalent to the AND of two sets:

SELECT GROUP_BITMAP_AND(user_bitmap) AS users_in_all_segments FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag IN ('At Risk', 'High Value');

users_in_all_segments --------------------- 3

3 users are both At Risk and High Value — the highest-priority group for reactivation.

Scenario 2: Union of two segments (BITMAP_OR)

Count the deduplicated total users in either At Risk or Loyal Customers:

SELECT BITMAP_COUNT( BITMAP_OR( (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'At Risk'), (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'Loyal Customers') ) ) AS union_count;

union_count ----------- 8

Scenario 3: Difference operation (BITMAP_ANDNOT)

APP purchase users excluding those who have also purchased via mini-program — find users active only on APP (suitable for targeting with APP-exclusive benefits):

SELECT BITMAP_COUNT( BITMAP_ANDNOT( (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'channel_app'), (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'channel_miniapp') ) ) AS app_only_users;

app_only_users -------------- 0

In the current data, APP and mini-program users completely overlap (4 users in each). This indicates the user group is already cross-channel active; the strategy should focus on deep re-engagement rather than acquiring new channel users.

Scenario 4: Composite condition segmentation (BITMAP_ANDNOT + BITMAP_AND)

Target: At Risk ∩ High Value ∩ non-miniapp (suitable for pushing APP-exclusive repeat purchase offers):

SELECT BITMAP_COUNT( BITMAP_ANDNOT( BITMAP_AND( (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'At Risk'), (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'High Value') ), (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'channel_miniapp') ) ) AS target_audience_count;

target_audience_count --------------------- 4

Export Audience Package

Expand the segmentation result from a BITMAP object into an ID list and write to the audience package table for use by downstream advertising platforms:

CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ads_audience_package ( package_id STRING, package_name STRING, segment_rule STRING, one_id STRING, rfm_segment STRING, create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );

Export the At Risk ∩ High Value intersection audience:

INSERT INTO best_practice_marketing_cdp.ads_audience_package (package_id, package_name, segment_rule, one_id, rfm_segment) SELECT 'PKG001' AS package_id, 'At Risk High Value' AS package_name, 'At Risk AND High Value' AS segment_rule, CONCAT('ONE', LPAD(CAST(user_id AS STRING), 3, '0')) AS one_id, 'At Risk' AS rfm_segment FROM ( SELECT BITMAP_TO_ARRAY( BITMAP_AND( (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'At Risk'), (SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'High Value') ) ) AS ids ) t LATERAL VIEW EXPLODE(ids) tmp AS user_id;

Verify the export result:

SELECT package_id, package_name, one_id, rfm_segment FROM best_practice_marketing_cdp.ads_audience_package ORDER BY one_id;

package_id package_name one_id rfm_segment ---------- ----------------- ------ ----------- PKG001 At Risk High Value ONE004 At Risk PKG001 At Risk High Value ONE006 At Risk PKG001 At Risk High Value ONE010 At Risk

3 users (ONE004 / ONE006 / ONE010) are in the audience package with monetary values of 599 / 420 / 315 respectively — the priority targets for this repeat purchase campaign.


Data Warehouse Object Summary

USE SCHEMA best_practice_marketing_cdp; SHOW TABLES;

schema_name table_name is_dynamic ----------------------------- ------------------------ ---------- best_practice_marketing_cdp ads_audience_package false best_practice_marketing_cdp ads_user_bitmap false best_practice_marketing_cdp dwd_id_mapping false best_practice_marketing_cdp dwd_user_events false best_practice_marketing_cdp dws_user_rfm true best_practice_marketing_cdp dws_user_segment true best_practice_marketing_cdp ods_app_events false best_practice_marketing_cdp ods_crm_members false best_practice_marketing_cdp ods_retail_transactions false


Notes

  • Uniqueness in MERGE INTO ON clause: The MERGE ON for the ID Mapping table must include the complete business unique key (one_id + id_type + id_value). Using only one_id as the ON condition would match multiple rows when a OneID has multiple id_type records, resulting in undefined UPDATE behavior.

  • GROUP_BITMAP_STATE vs GROUP_BITMAP: GROUP_BITMAP_STATE returns a BITMAP object for building bitmaps that can participate in further set operations. GROUP_BITMAP / GROUP_BITMAP_AND / GROUP_BITMAP_OR return cardinality (INT) directly and cannot be used for further set operations. The two serve different purposes and are not interchangeable.

  • Dynamic Tables do not write REFRESH INTERVAL: Refresh scheduling is managed through Studio Tasks, which allows attaching data quality monitoring (e.g., row count alerts, NULL rate checks) to the same task.

  • BITMAP IDs must be positive integers: GROUP_BITMAP_STATE only accepts positive integer types as input. If one_id is a string (e.g., ONE001), you must first extract the numeric portion or maintain an integer primary key mapping table. In production, maintaining a user_int_id auto-increment integer field in dwd_id_mapping specifically for BITMAP use is recommended.

  • ads_user_bitmap must be inserted idempotently: GROUP_BITMAP_AND computes the intersection across multiple rows in the table. If the same segment_tag has multiple rows (from repeated INSERT execution), different rows will have different bitmap contents, and their AND result will tend toward 0. Always run TRUNCATE TABLE ads_user_bitmap before rebuilding BITMAPs to ensure exactly one row per segment_tag. BITMAP_OR subqueries are similarly affected; multiple rows produce inflated union counts.

  • External Function requires API CONNECTION: In production, calling an external ID graph service requires an API CONNECTION and a cloud function environment (Alibaba Cloud FC / AWS Lambda) to be configured. Use a SQL UDF as a substitute in the test environment; switch to External Function after verifying the Mapping table structure is correct.

  • MySQL CDC requires binlog enabled: The source MySQL must have binlog_format = ROW. Some cloud RDS instances have binlog disabled by default and require manual enabling in the console with an instance restart. CDC sync latency is generally in the seconds range, suitable for near-real-time scenarios. If millisecond-level latency is required, evaluate the Kafka approach.