Building a Digital Marketing CDP Unified User ID Resolution Data Warehouse
Integrate user data from multiple channels — CRM, mini-programs, APP, and offline retail — into a unified OneID framework, then compute RFM tags and build audience segments to output the target audiences needed for precision marketing campaigns. Using the Online Retail II (Kaggle) retail transaction dataset as a base, this guide provides a complete end-to-end walkthrough of the MySQL CDC → Kafka real-time ingestion → ID Mapping → Dynamic Table RFM → BITMAP audience segmentation pipeline, covering key platform capabilities including MERGE INTO incremental updates and External Function for external ID graph calls.
Overview
The core challenge of building a multi-channel CDP is that the same natural person leaves different IDs across different channels (mobile number, WeChat union_id, CRM member ID, device ID). Without resolving these, it is impossible to accurately compute user lifetime value or perform cross-channel attribution.
Challenge
Singdata Solution
CRM member changes synced in real time to the data warehouse
MySQL CDC PIPE, captures binlog and writes to ODS
Real-time behavior event ingestion from mini-programs / APP
Studio Kafka real-time sync task (single-table ingestion)
Incremental merging of multi-source IDs — same OneID not inserted twice
MERGE INTO: UPDATE on match, INSERT when no match
OneID matching calls an external ID graph service
External Function encapsulating the HTTP call for use directly in SQL
RFM tags and user segmentation auto-refresh
Dynamic Table with automatic incremental computation, scheduled via Studio Task
BITMAP function family for millisecond-level set operations on hundreds of millions of IDs
SQL Commands Used
Command / Function
Purpose
Notes
CREATE TABLE
Create ODS / DWD / ADS static tables
Upstream of Dynamic Tables or final output
MERGE INTO
Incremental updates to the ID Mapping table
Insert new IDs; update last_seen and confidence for existing ones
CREATE DYNAMIC TABLE
Create DWS RFM and segmentation tables
Automatic incremental computation, no REFRESH INTERVAL
REFRESH DYNAMIC TABLE
Manually trigger initial full refresh
Run after table creation; routine scheduling via Studio Task
GROUP_BITMAP_STATE
Aggregate integer user_ids into a BITMAP object
Build the bitmap for each segment; returns a BITMAP type
GROUP_BITMAP_AND
Compute intersection of multiple BITMAP rows, returns cardinality
Single scan to compute AND result for multiple segments
GROUP_BITMAP_OR
Compute union of multiple BITMAP rows, returns cardinality
Compute deduplicated total users across multiple segments
BITMAP_AND
Intersection of two BITMAP objects
Returns a BITMAP object for further operations or array conversion
BITMAP_OR
Union of two BITMAP objects
Returns a BITMAP object
BITMAP_ANDNOT
Difference operation (exists in A but not in B)
Returns a BITMAP object, used to exclude specific segments
BITMAP_COUNT
Count the number of IDs in a BITMAP
Read cardinality from a bitmap object
BITMAP_TO_ARRAY
Expand a BITMAP object into an integer array
Used with EXPLODE to export audience ID lists
CREATE FUNCTION
Create SQL UDF (ID normalization example)
Encapsulates ID transformation logic; replace with External Function in production
Prerequisites
CREATE SCHEMA IF NOT EXISTS best_practice_marketing_cdp;
ODS Layer: Multi-Channel Raw Data Ingestion
CRM Member Table
The CRM system runs on MySQL; member information changes are synced in real time to the ODS layer via MySQL CDC.
CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ods_crm_members (
member_id STRING,
mobile_hash STRING, -- mobile number SHA256, plaintext not stored
email_hash STRING, -- email SHA256
real_name STRING,
gender STRING,
birthday DATE,
register_date DATE,
channel STRING, -- registration channel: offline / online / miniapp
level STRING, -- member tier: bronze / silver / gold
total_points INT,
updated_at TIMESTAMP
);
MySQL CDC configuration
In Studio, create a "MySQL Real-time Sync Task" at: Studio → Data Integration → Real-time Tasks → New Task.
Configure as follows:
Parameter
Value
Data source
MySQL (configure host, port, database, username, password)
Sync mode
CDC (binlog real-time capture)
Source table
crm.members
Target schema
best_practice_marketing_cdp
Target table
ods_crm_members
Write strategy
UPSERT (primary key member_id)
Studio task path: best_practices/marketing_cdp/
⚠️ Note: MySQL CDC requires binlog to be enabled on the source database with binlog_format = ROW. For RDS MySQL, enable the binlog parameters in the console and restart the instance.
Write sample data (use direct INSERT when MySQL is not available).
Import from local CSV (recommended)
Save CRM member data as a CSV file and import in bulk via User Volume:
-- Step 1: Upload the local CSV file to User Volume via SQL PUT
PUT '/path/to/crm_members.csv' TO USER VOLUME FILE 'crm_members.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_marketing_cdp.ods_crm_members
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('crm_members.csv');
You can also insert a small batch of test data inline (no CSV file required):
INSERT INTO best_practice_marketing_cdp.ods_crm_members VALUES
('MBR001','hash_mobile_001','hash_email_001','Alice Wang', 'F',CAST('1990-05-12' AS DATE),CAST('2020-01-15' AS DATE),'offline','gold', 1200,CAST('2024-11-01 10:00:00' AS TIMESTAMP)),
('MBR002','hash_mobile_002','hash_email_002','Bob Chen', 'M',CAST('1985-08-23' AS DATE),CAST('2019-06-20' AS DATE),'online', 'silver',800, CAST('2024-10-15 14:30:00' AS TIMESTAMP)),
('MBR003','hash_mobile_003','hash_email_003','Carol Liu', 'F',CAST('1995-03-07' AS DATE),CAST('2021-09-10' AS DATE),'miniapp','bronze',350, CAST('2024-11-10 09:15:00' AS TIMESTAMP)),
('MBR004','hash_mobile_004','hash_email_004','David Zhang','M',CAST('1988-12-01' AS DATE),CAST('2018-03-05' AS DATE),'offline','gold', 2500,CAST('2024-09-20 16:45:00' AS TIMESTAMP)),
('MBR005','hash_mobile_005','hash_email_005','Eve Li', 'F',CAST('1992-07-19' AS DATE),CAST('2022-11-08' AS DATE),'online', 'silver',620, CAST('2024-11-05 11:00:00' AS TIMESTAMP));
Mini-Program / APP Behavior Event Table
User behavior events from mini-programs and APPs are ingested in real time via Kafka. Each message corresponds to a user action (page view, add to cart, payment, etc.).
Option 2: INSERT simulation (when Kafka is not available)
If Kafka is not configured, write directly to the target table via INSERT INTO to simulate the effect of messages being parsed and written, making it easy to verify downstream Dynamic Table and BITMAP logic:
INSERT INTO best_practice_marketing_cdp.ods_app_events
(event_id, device_id, union_id, open_id, event_type, page_name,
item_id, item_price, channel, platform, event_time, session_id)
VALUES
('EVT001','DEV001','union_001','open_001','purchase','checkout','SKU001',89.9, 'miniapp','wechat', CAST('2024-11-01 08:15:00' AS TIMESTAMP),'sess_001'),
('EVT002','DEV002','union_002','open_002','purchase','checkout','SKU002',199.0,'app', 'ios', CAST('2024-11-01 09:20:00' AS TIMESTAMP),'sess_002'),
('EVT003','DEV003','union_003','open_003','add_cart', 'product','SKU003',350.0,'miniapp','wechat', CAST('2024-11-02 10:05:00' AS TIMESTAMP),'sess_003'),
('EVT004','DEV004','union_004','open_004','purchase','checkout','SKU004',599.0,'app', 'android',CAST('2024-11-02 11:30:00' AS TIMESTAMP),'sess_004'),
('EVT005','DEV005','union_005','open_005','purchase','checkout','SKU005',129.0,'miniapp','wechat', CAST('2024-11-03 14:00:00' AS TIMESTAMP),'sess_005'),
('EVT006','DEV001','union_001','open_001','purchase','checkout','SKU006',75.0, 'miniapp','wechat', CAST('2024-11-05 16:30:00' AS TIMESTAMP),'sess_006'),
('EVT007','DEV006','union_006','open_006','purchase','checkout','SKU007',420.0,'app', 'ios', CAST('2024-11-10 09:00:00' AS TIMESTAMP),'sess_007'),
('EVT008','DEV008','union_008','open_008','purchase','checkout','SKU008',259.0,'miniapp','wechat', CAST('2024-11-12 10:20:00' AS TIMESTAMP),'sess_008'),
('EVT009','DEV009','union_009','open_009','purchase','checkout','SKU009',88.0, 'app', 'ios', CAST('2024-11-13 13:45:00' AS TIMESTAMP),'sess_009'),
('EVT010','DEV010','union_010','open_010','purchase','checkout','SKU010',315.0,'miniapp','wechat', CAST('2024-11-14 11:00:00' AS TIMESTAMP),'sess_010');
Use the Online Retail II UCI dataset (Kaggle) as ODS raw transaction data for the online retail channel, simulating historical orders ingested from a third-party e-commerce platform.
CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ods_retail_transactions (
invoice STRING,
stock_code STRING,
description STRING,
quantity INT,
invoice_date TIMESTAMP,
price DOUBLE,
customer_id STRING,
country STRING,
ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
Then upload the CSV to Lakehouse via User Volume and import:
-- Step 1: Upload the local CSV file to User Volume
PUT '/tmp/marketing_cdp/online_retail_II.csv' TO USER VOLUME FILE 'online_retail_II.csv';
-- Step 2: COPY INTO the table from User Volume (full import)
COPY INTO best_practice_marketing_cdp.ods_retail_transactions
(invoice, stock_code, description, quantity, invoice_date, price, customer_id, country)
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='', 'timestampFormat'='M/d/yyyy H:mm')
FILES ('online_retail_II.csv');
Option 2: INSERT a representative subset
If the full CSV is not available, insert a representative subset to verify downstream RFM and BITMAP logic:
INSERT INTO best_practice_marketing_cdp.ods_retail_transactions
(invoice, stock_code, description, quantity, invoice_date, price, customer_id, country)
VALUES
('489434','85048','15CM CHRISTMAS GLASS BALL 20 LIGHTS',12,CAST('2009-12-01 07:45:00' AS TIMESTAMP),6.95, 'CUS013085','United Kingdom'),
('489434','79323P','PINK CHERRY LIGHTS', 12,CAST('2009-12-01 07:45:00' AS TIMESTAMP),6.75, 'CUS013085','United Kingdom'),
('489435','22111','SCOTTIE DOG HOT WATER BOTTLE', 24,CAST('2009-12-01 07:45:00' AS TIMESTAMP),3.45, 'CUS013748','United Kingdom'),
('489436','48173C','DOOR MAT UNION JACK CARS', 10,CAST('2009-12-01 09:00:00' AS TIMESTAMP),5.95, 'CUS014085','United Kingdom'),
('489437','21080','SET OF 6 NAUTICAL PAPER PLATES', 12,CAST('2009-12-01 09:30:00' AS TIMESTAMP),3.25, 'CUS012583','United Kingdom'),
('489437','22423','REGENCY CAKESTAND 3 TIER', 12,CAST('2009-12-01 09:30:00' AS TIMESTAMP),12.75,'CUS012583','United Kingdom'),
('489438','84970L','SINGLE HEART ZINC T-LIGHT HOLDER', 12,CAST('2009-12-01 10:00:00' AS TIMESTAMP),1.25, 'CUS012431','United Kingdom'),
('489440','23256','CHILDRENS CUTLERY SPACEBOY', 12,CAST('2009-12-01 10:30:00' AS TIMESTAMP),4.15, 'CUS013047','United Kingdom'),
('490100','22421','GINGHAM HEART', 6, CAST('2009-12-10 09:30:00' AS TIMESTAMP),4.95, 'CUS013085','United Kingdom'),
('490200','84029E','TREE TOP STAR', 12,CAST('2009-12-15 10:00:00' AS TIMESTAMP),1.65, 'CUS013748','United Kingdom')
-- Full import includes 30 rows; first 10 shown here for illustration
;
Verify data volumes across all three ODS tables:
SELECT 'ods_crm_members' AS tbl, COUNT(*) AS cnt FROM best_practice_marketing_cdp.ods_crm_members
UNION ALL
SELECT 'ods_app_events', COUNT(*) FROM best_practice_marketing_cdp.ods_app_events
UNION ALL
SELECT 'ods_retail_transactions', COUNT(*) FROM best_practice_marketing_cdp.ods_retail_transactions;
The ID Mapping table records the mapping relationship between each OneID and the original IDs from each channel. In production, matching new IDs to OneIDs is performed by an external ID graph service (called via External Function).
Create External Function (production approach)
In Studio, create an External Function that calls an external ID graph API via Alibaba Cloud FC / AWS Lambda:
-- First create an API CONNECTION (one-time setup, connects to the cloud function environment)
CREATE API CONNECTION IF NOT EXISTS conn_id_graph
TYPE = 'ALIYUN'
REGION = 'cn-hangzhou'
ROLE_ARN = '<your-role-arn>'
NAMESPACE = 'default'
CODE_BUCKET = '<your-code-bucket>';
-- Create the External Function (packages the ID graph HTTP call logic)
CREATE EXTERNAL FUNCTION IF NOT EXISTS best_practice_marketing_cdp.call_id_graph(
id_value STRING,
id_type STRING
)
RETURNS STRING
LANGUAGE PYTHON
CONNECTION = conn_id_graph
RESOURCE_URIS = 'volume://func_volume/id_graph.zip';
⚠️ Note: External Functions require the function code to be packaged as a .zip file and uploaded to a Volume in advance. The function class must use module.ClassName format and declare parameter types via @annotate decorator. See the External Function development guide for details.
SQL UDF alternative (test environment)
If the external ID graph service is not yet connected, use a SQL UDF to simulate ID normalization logic and verify the downstream Mapping table structure:
CREATE OR REPLACE FUNCTION best_practice_marketing_cdp.normalize_id(
id_value STRING,
id_type STRING
)
RETURNS STRING
AS CASE
WHEN id_type = 'mobile_hash' THEN CONCAT('ONE_PHONE_', SUBSTR(id_value, -6))
WHEN id_type = 'email_hash' THEN CONCAT('ONE_EMAIL_', SUBSTR(id_value, -6))
WHEN id_type = 'union_id' THEN CONCAT('ONE_WX_', SUBSTR(id_value, -6))
WHEN id_type = 'device_id' THEN CONCAT('ONE_DEV_', SUBSTR(id_value, -6))
ELSE CONCAT('ONE_UNKNOWN_', id_value)
END;
Verify the UDF:
SELECT
id_value,
id_type,
best_practice_marketing_cdp.normalize_id(id_value, id_type) AS normalized_id
FROM best_practice_marketing_cdp.dwd_id_mapping
WHERE one_id = 'ONE001';
INSERT INTO best_practice_marketing_cdp.dwd_id_mapping VALUES
('ONE001','member_id', 'MBR001','crm', 1.0, CAST('2020-01-15' AS TIMESTAMP),CAST('2024-11-01' AS TIMESTAMP),true),
('ONE001','mobile_hash','hash_mobile_001','crm',1.0,CAST('2020-01-15' AS TIMESTAMP),CAST('2024-11-01' AS TIMESTAMP),true),
('ONE001','union_id', 'union_001','miniapp',0.95,CAST('2021-03-10' AS TIMESTAMP),CAST('2024-11-05' AS TIMESTAMP),true),
('ONE002','member_id', 'MBR002','crm', 1.0, CAST('2019-06-20' AS TIMESTAMP),CAST('2024-10-15' AS TIMESTAMP),true),
('ONE002','union_id', 'union_002','app', 0.9, CAST('2021-08-15' AS TIMESTAMP),CAST('2024-11-01' AS TIMESTAMP),true)
-- Full 22 rows ...
;
MERGE INTO: Incremental ID Mapping Updates
When the ID graph service discovers new ID associations or the confidence of existing mappings changes, use MERGE INTO for incremental upserts: update if the same (one_id, id_type, id_value) triplet already exists, otherwise insert.
MERGE INTO best_practice_marketing_cdp.dwd_id_mapping AS t
USING (
-- Newly discovered email mapping (from ID graph service return result)
SELECT
'ONE001' AS one_id,
'email_hash' AS id_type,
'hash_email_001' AS id_value,
'crm' AS source_channel,
1.0 AS confidence,
CAST('2020-01-15 00:00:00' AS TIMESTAMP) AS first_seen,
CAST('2024-11-20 10:00:00' AS TIMESTAMP) AS last_seen,
true AS is_active
) AS s
ON t.one_id = s.one_id
AND t.id_type = s.id_type
AND t.id_value = s.id_value
WHEN MATCHED THEN
UPDATE SET
last_seen = s.last_seen,
confidence = s.confidence
WHEN NOT MATCHED THEN
INSERT (one_id, id_type, id_value, source_channel, confidence, first_seen, last_seen, is_active)
VALUES (s.one_id, s.id_type, s.id_value, s.source_channel, s.confidence, s.first_seen, s.last_seen, s.is_active);
After execution, verify all mappings for ONE001 (the new email_hash row was inserted; existing rows remain unchanged):
SELECT one_id, id_type, id_value, confidence, last_seen
FROM best_practice_marketing_cdp.dwd_id_mapping
WHERE one_id = 'ONE001'
ORDER BY id_type;
The email_hash row was newly inserted; last_seen for member_id, mobile_hash, and union_id rows was not overwritten, as expected.
⚠️ Note: The MERGE INTO ON clause must cover the complete business unique key (here: one_id + id_type + id_value three columns). If only one_id is used as the ON condition, when a OneID has multiple id_type records, WHEN MATCHED will match multiple rows, resulting in undefined UPDATE behavior.
Unified User Event Table (DWD)
Resolve ODS events from each channel to one_id via ID Mapping, and consolidate into a unified user event table:
Write mini-program / APP events resolved via ID Mapping:
INSERT INTO best_practice_marketing_cdp.dwd_user_events
SELECT
e.event_id,
m.one_id,
e.event_type,
e.channel,
e.platform,
e.item_id,
e.item_price,
1 AS quantity,
e.item_price AS revenue,
e.event_time,
CAST(e.event_time AS DATE) AS event_date,
e.session_id
FROM best_practice_marketing_cdp.ods_app_events e
JOIN best_practice_marketing_cdp.dwd_id_mapping m
ON m.id_value = e.union_id AND m.id_type = 'union_id'
WHERE e.event_type = 'purchase' AND e.item_id IS NOT NULL;
Write retail transactions with direct mapping (customer_id as one_id):
INSERT INTO best_practice_marketing_cdp.dwd_user_events
SELECT
CONCAT('ORT-', invoice, '-', stock_code) AS event_id,
customer_id AS one_id,
'purchase' AS event_type,
'online_retail' AS channel,
'web' AS platform,
stock_code AS item_id,
price AS item_price,
quantity,
ROUND(price * quantity, 2) AS revenue,
invoice_date AS event_time,
CAST(invoice_date AS DATE) AS event_date,
invoice AS session_id
FROM best_practice_marketing_cdp.ods_retail_transactions;
View cross-channel purchase distribution:
SELECT
channel,
COUNT(DISTINCT one_id) AS unique_users,
COUNT(*) AS purchase_count,
ROUND(SUM(revenue), 2) AS total_revenue,
ROUND(AVG(revenue), 2) AS avg_order_value
FROM best_practice_marketing_cdp.dwd_user_events
WHERE event_type = 'purchase'
GROUP BY channel
ORDER BY total_revenue DESC;
The APP channel's average order value (326.5) is significantly higher than online retail (54.29), indicating APP users prefer high-value items — suitable for pushing premium new products.
DWS Layer: RFM Metrics and User Segmentation
User RFM Dynamic Table
RFM (Recency / Frequency / Monetary) is the core metric for measuring user value. Use a Dynamic Table to automatically maintain up-to-date RFM values for each one_id:
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_marketing_cdp.dws_user_rfm
AS
SELECT
one_id,
DATEDIFF(CURRENT_DATE(), MAX(event_date)) AS recency_days,
COUNT(DISTINCT DATE_TRUNC('day', event_time)) AS frequency,
ROUND(SUM(revenue), 2) AS monetary,
MAX(event_date) AS last_purchase_date,
MIN(event_date) AS first_purchase_date
FROM best_practice_marketing_cdp.dwd_user_events
WHERE event_type = 'purchase'
GROUP BY one_id;
⚠️ Note: Do not include REFRESH INTERVAL in the CREATE DYNAMIC TABLE DDL. Periodic refresh is managed by creating a Studio scheduled task (see the "Configure Scheduled Tasks" section below), which allows attaching data quality monitoring and alert rules to the same task.
ONE004 (David Zhang) has the highest monetary value (599), with last purchase 581 days ago — a high-value but at-risk churned user, prioritize for reactivation. CUS013085 has historical spend of 294.9 but last purchased over 16 years ago — this is 2009 UK retail historical data, in a different time dimension from current ONE-prefixed users.
User Segmentation Dynamic Table
Segment users based on RFM values for use in subsequent BITMAP audience segmentation:
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_marketing_cdp.dws_user_segment
AS
SELECT
r.one_id,
r.recency_days,
r.frequency,
r.monetary,
CASE
WHEN r.recency_days <= 200 AND r.frequency >= 2 AND r.monetary >= 300 THEN 'Champions'
WHEN r.recency_days <= 600 AND r.frequency >= 2 THEN 'Loyal Customers'
WHEN r.recency_days <= 600 THEN 'At Risk'
WHEN r.recency_days <= 2000 THEN 'Hibernating'
ELSE 'Lost'
END AS rfm_segment,
r.last_purchase_date,
r.first_purchase_date
FROM best_practice_marketing_cdp.dws_user_rfm r;
SELECT
rfm_segment,
COUNT(*) AS user_count,
ROUND(AVG(monetary), 2) AS avg_monetary,
ROUND(AVG(frequency), 1) AS avg_frequency,
ROUND(AVG(recency_days), 0) AS avg_recency_days
FROM best_practice_marketing_cdp.dws_user_segment
GROUP BY rfm_segment
ORDER BY avg_monetary DESC;
At Risk users (7 people) have the highest average monetary value (287) but their last purchase was about 575 days ago — the priority group for reactivation.
Configure Studio Scheduled Tasks
💡 Tip: The examples below use cz-cli (the Singdata ClickZetta Lakehouse command-line tool). If cz-cli is not installed, see the cz-cli setup guide. If you prefer not to use the command line, you can run SQL in Lakehouse Studio's Development → SQL Editor and configure and trigger scheduled tasks in the Studio → Tasks page.
Create scheduled tasks in Studio for both Dynamic Tables, at path best_practices/marketing_cdp/.
Both tasks are created in Studio under best_practices/marketing_cdp/ (task_id 10354650 / 10354651), running sequentially at 02:00 and 02:30 each day. You can add data quality alert rules in the task configuration (e.g., send a Lark notification if the RFM table row count drops to zero).
ADS Layer: BITMAP Audience Segmentation and Audience Package Export
Build User BITMAPs
The core of audience segmentation is first building BITMAP indexes for each segment (via GROUP_BITMAP_STATE aggregation), then performing set operations at the BITMAP level to avoid full-table JOINs:
CREATE TABLE IF NOT EXISTS best_practice_marketing_cdp.ads_user_bitmap (
segment_tag STRING,
user_bitmap BITMAP
);
Build BITMAPs separately by RFM segment and channel (user_id uses the numeric portion of ONE IDs 1–10):
⚠️ Note: Before rebuilding BITMAPs, you must first truncate the table to avoid duplicate rows. GROUP_BITMAP_AND computes the intersection across multiple bitmap rows — if the same segment_tag has multiple rows (from repeated INSERT execution), their AND result will tend toward 0, which is incorrect. BITMAP_OR subqueries are similarly affected; multiple rows produce inflated union counts.
-- Truncate before rebuilding to ensure exactly one row per segment_tag (idempotent)
TRUNCATE TABLE best_practice_marketing_cdp.ads_user_bitmap;
-- Build by RFM segment
INSERT INTO best_practice_marketing_cdp.ads_user_bitmap
SELECT
rfm_segment AS segment_tag,
GROUP_BITMAP_STATE(CAST(SUBSTRING(one_id, 4) AS INT)) AS user_bitmap
FROM best_practice_marketing_cdp.dws_user_segment
WHERE one_id LIKE 'ONE%'
GROUP BY rfm_segment;
-- Build by purchase channel
INSERT INTO best_practice_marketing_cdp.ads_user_bitmap
SELECT
CONCAT('channel_', channel) AS segment_tag,
GROUP_BITMAP_STATE(CAST(SUBSTRING(one_id, 4) AS INT)) AS user_bitmap
FROM best_practice_marketing_cdp.dwd_user_events
WHERE event_type = 'purchase' AND one_id LIKE 'ONE%'
GROUP BY channel;
-- High-value users (monetary >= 300)
INSERT INTO best_practice_marketing_cdp.ads_user_bitmap
SELECT
'High Value' AS segment_tag,
GROUP_BITMAP_STATE(CAST(SUBSTRING(one_id, 4) AS INT)) AS user_bitmap
FROM best_practice_marketing_cdp.dws_user_rfm
WHERE one_id LIKE 'ONE%' AND monetary >= 300;
⚠️ Note: GROUP_BITMAP_STATE returns a BITMAP type (bitmap object), while GROUP_BITMAP / GROUP_BITMAP_AND / GROUP_BITMAP_OR return cardinality (INT) directly, not bitmap objects. Use GROUP_BITMAP_STATE when you need to save the bitmap object for subsequent combination operations.
Verify user counts per segment:
SELECT segment_tag, BITMAP_COUNT(user_bitmap) AS user_count
FROM best_practice_marketing_cdp.ads_user_bitmap
ORDER BY user_count DESC;
segment_tag user_count
---------------- ----------
At Risk 7
channel_app 4
channel_miniapp 4
High Value 3
Loyal Customers 1
Set Operations: Intersection / Union / Difference
Scenario 1: Intersection across all segments (GROUP_BITMAP_AND)
Count users appearing in both the At Risk and High Value segments — equivalent to the AND of two sets:
SELECT GROUP_BITMAP_AND(user_bitmap) AS users_in_all_segments
FROM best_practice_marketing_cdp.ads_user_bitmap
WHERE segment_tag IN ('At Risk', 'High Value');
users_in_all_segments
---------------------
3
3 users are both At Risk and High Value — the highest-priority group for reactivation.
Scenario 2: Union of two segments (BITMAP_OR)
Count the deduplicated total users in either At Risk or Loyal Customers:
SELECT BITMAP_COUNT(
BITMAP_OR(
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'At Risk'),
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'Loyal Customers')
)
) AS union_count;
union_count
-----------
8
Scenario 3: Difference operation (BITMAP_ANDNOT)
APP purchase users excluding those who have also purchased via mini-program — find users active only on APP (suitable for targeting with APP-exclusive benefits):
SELECT BITMAP_COUNT(
BITMAP_ANDNOT(
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'channel_app'),
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'channel_miniapp')
)
) AS app_only_users;
app_only_users
--------------
0
In the current data, APP and mini-program users completely overlap (4 users in each). This indicates the user group is already cross-channel active; the strategy should focus on deep re-engagement rather than acquiring new channel users.
Target: At Risk ∩ High Value ∩ non-miniapp (suitable for pushing APP-exclusive repeat purchase offers):
SELECT BITMAP_COUNT(
BITMAP_ANDNOT(
BITMAP_AND(
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'At Risk'),
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'High Value')
),
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'channel_miniapp')
)
) AS target_audience_count;
target_audience_count
---------------------
4
Export Audience Package
Expand the segmentation result from a BITMAP object into an ID list and write to the audience package table for use by downstream advertising platforms:
Export the At Risk ∩ High Value intersection audience:
INSERT INTO best_practice_marketing_cdp.ads_audience_package
(package_id, package_name, segment_rule, one_id, rfm_segment)
SELECT
'PKG001' AS package_id,
'At Risk High Value' AS package_name,
'At Risk AND High Value' AS segment_rule,
CONCAT('ONE', LPAD(CAST(user_id AS STRING), 3, '0')) AS one_id,
'At Risk' AS rfm_segment
FROM (
SELECT BITMAP_TO_ARRAY(
BITMAP_AND(
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'At Risk'),
(SELECT user_bitmap FROM best_practice_marketing_cdp.ads_user_bitmap WHERE segment_tag = 'High Value')
)
) AS ids
) t
LATERAL VIEW EXPLODE(ids) tmp AS user_id;
Verify the export result:
SELECT package_id, package_name, one_id, rfm_segment
FROM best_practice_marketing_cdp.ads_audience_package
ORDER BY one_id;
package_id package_name one_id rfm_segment
---------- ----------------- ------ -----------
PKG001 At Risk High Value ONE004 At Risk
PKG001 At Risk High Value ONE006 At Risk
PKG001 At Risk High Value ONE010 At Risk
3 users (ONE004 / ONE006 / ONE010) are in the audience package with monetary values of 599 / 420 / 315 respectively — the priority targets for this repeat purchase campaign.
💡 Tip: The integer array expanded by BITMAP_TO_ARRAY is broken into multiple rows via LATERAL VIEW EXPLODE. The exported one_id values can be directly used with WeChat Moments advertising custom audiences, ByteDance DMP upload APIs, and other advertising platform APIs.
Data Warehouse Object Summary
USE SCHEMA best_practice_marketing_cdp;
SHOW TABLES;
Uniqueness in MERGE INTO ON clause: The MERGE ON for the ID Mapping table must include the complete business unique key (one_id + id_type + id_value). Using only one_id as the ON condition would match multiple rows when a OneID has multiple id_type records, resulting in undefined UPDATE behavior.
GROUP_BITMAP_STATE vs GROUP_BITMAP: GROUP_BITMAP_STATE returns a BITMAP object for building bitmaps that can participate in further set operations. GROUP_BITMAP / GROUP_BITMAP_AND / GROUP_BITMAP_OR return cardinality (INT) directly and cannot be used for further set operations. The two serve different purposes and are not interchangeable.
Dynamic Tables do not write REFRESH INTERVAL: Refresh scheduling is managed through Studio Tasks, which allows attaching data quality monitoring (e.g., row count alerts, NULL rate checks) to the same task.
BITMAP IDs must be positive integers: GROUP_BITMAP_STATE only accepts positive integer types as input. If one_id is a string (e.g., ONE001), you must first extract the numeric portion or maintain an integer primary key mapping table. In production, maintaining a user_int_id auto-increment integer field in dwd_id_mapping specifically for BITMAP use is recommended.
ads_user_bitmap must be inserted idempotently: GROUP_BITMAP_AND computes the intersection across multiple rows in the table. If the same segment_tag has multiple rows (from repeated INSERT execution), different rows will have different bitmap contents, and their AND result will tend toward 0. Always run TRUNCATE TABLE ads_user_bitmap before rebuilding BITMAPs to ensure exactly one row per segment_tag. BITMAP_OR subqueries are similarly affected; multiple rows produce inflated union counts.
External Function requires API CONNECTION: In production, calling an external ID graph service requires an API CONNECTION and a cloud function environment (Alibaba Cloud FC / AWS Lambda) to be configured. Use a SQL UDF as a substitute in the test environment; switch to External Function after verifying the Mapping table structure is correct.
MySQL CDC requires binlog enabled: The source MySQL must have binlog_format = ROW. Some cloud RDS instances have binlog disabled by default and require manual enabling in the console with an instance restart. CDC sync latency is generally in the seconds range, suitable for near-real-time scenarios. If millisecond-level latency is required, evaluate the Kafka approach.
Related Documentation
MERGE INTO — syntax reference and MATCHED / NOT MATCHED clause details