Building a SaaS Multi-Tenant Operations Data Warehouse
Integrate SaaS platform feature usage events, account subscription data, and churn records into a four-layer data warehouse to output tenant health scores and churn warning signals. Using the RavenStack SaaS dataset (500 tenants, 5,000 subscription records, 5,000 usage events, 600 churn records), this guide provides a complete end-to-end walkthrough of the Kafka PIPE → ODS → DWD → DWS → ADS pipeline, covering five platform capabilities: Dynamic Table cascading refresh, SQL UDF health scoring, Column Masking, RBAC, and Semantic View.
Overview
The core data challenge for SaaS operations teams is that feature usage events come from a high-frequency Kafka stream while account and subscription data comes from a MySQL business database. Both types of data must be integrated at the tenant dimension before meaningful churn warning signals can be produced.
Singdata Lakehouse addresses several core challenges through the following combination:
Challenge
Singdata Solution
Feature usage events written in real time, large volumes per second
Kafka PIPE continuous ingestion, batch-writing to ODS raw table every 60 seconds
Account, subscription, and churn tables from MySQL, need CDC sync
Studio MySQL CDC real-time sync task, automatically captures inserts, updates, and deletes
Long dependency chain across ODS → DWD → DWS → ADS multi-layer computation
Dynamic Table cascading refresh with declarative SQL; system automatically manages dependency order
Complex health scoring logic reused across multiple layers
SQL UDF calc_tenant_health_score encapsulating the weighted formula
Sales, customer success, and analysts have different permissions on the same table
Column Masking + RBAC, controlling field visibility by role
Analytics Agent needs natural language queries on tenant data
Semantic View encapsulating business semantics for direct Agent use
SQL Commands Used
Command / Function
Purpose
Notes
CREATE TABLE
Create ODS layer raw tables
Static tables, serving as upstream for Dynamic Tables
CREATE PIPE
Create a Kafka continuous ingestion pipeline
Bound to the ODS usage raw table
CREATE FUNCTION
Create SQL UDF calc_tenant_health_score
Encapsulates the tenant health scoring weighted formula
System auto-detects upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLE
Manually trigger a single refresh
Used during initial build or debugging
CREATE VIEW
Create a Semantic View
Encapsulates business semantics for Analytics Agent queries
Prerequisites
All examples in this guide run under the best_practice_saas_dw schema.
CREATE SCHEMA IF NOT EXISTS best_practice_saas_dw;
The dataset comes from the RavenStack SaaS Subscription & Churn Analytics Dataset (MIT license), containing five tables: ravenstack_accounts, ravenstack_subscriptions, ravenstack_feature_usage, ravenstack_churn_events, and ravenstack_support_tickets. This guide uses the first four.
The ODS layer retains raw data from three source types: Kafka real-time usage events, MySQL CDC account subscription data, and MySQL CDC churn events.
Feature usage Kafka receiving table
CREATE TABLE IF NOT EXISTS best_practice_saas_dw.ods_kafka_raw_usage (
value STRING,
ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
ods_kafka_raw_usage receives the raw JSON strings from Kafka messages. The downstream processes parse the value field and write into ods_feature_usage.
Ingest feature usage event streams via Studio's "New Kafka Real-time Sync Task". The Kafka PIPE validates broker connectivity during the DDL phase. Replace KAFKA_BROKER and TOPIC with actual values before running:
CREATE PIPE IF NOT EXISTS best_practice_saas_dw.pipe_feature_usage
VIRTUAL_CLUSTER = 'DEFAULT'
BATCH_INTERVAL_IN_SECONDS = '60'
AS
COPY INTO best_practice_saas_dw.ods_kafka_raw_usage
FROM (
SELECT CAST(value AS STRING) AS value
FROM READ_KAFKA(
'<kafka-broker>:9092', -- replace with your actual broker address
'saas_feature_usage', -- topic name
'',
'cz_saas_consumer', -- consumer group ID
'', '', '', '',
'raw', 'raw',
0,
map()
)
);
⚠️ Note: Positional parameters 5–8 of READ_KAFKA (start/end offsets, timestamps) must be left empty in the PIPE DDL and are managed automatically by the PIPE runtime. Once created, the PIPE is in running state by default, consuming in batches every 60 seconds.
Option 1: Write via actual Kafka (recommended)
In a production Kafka environment, send messages to the topic saas_feature_usage to trigger PIPE ingestion:
Option 2: INSERT simulation (when Kafka is not available)
If Kafka is not configured, write to ods_feature_usage using the following methods.
Import from local CSV (recommended)
-- Step 1: Upload the local CSV file to User Volume via SQL PUT
PUT '/path/to/your/ods_feature_usage.csv' TO USER VOLUME FILE 'ods_feature_usage.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_saas_dw.ods_feature_usage
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('ods_feature_usage.csv');
You can also insert a small batch of test data inline (no CSV file required):
INSERT INTO best_practice_saas_dw.ods_feature_usage
(usage_id, subscription_id, usage_date, feature_name,
usage_count, usage_duration_secs, error_count, is_beta_feature)
VALUES
('U-1c6c24','S-0fcf7d', CAST('2023-07-27' AS DATE),'feature_20',9,5004,0,FALSE),
('U-f07cb8','S-c25263', CAST('2023-08-07' AS DATE),'feature_5', 9,369, 0,FALSE),
('U-a3b9d1','S-8cec59', CAST('2024-03-15' AS DATE),'feature_10',12,3200,1,TRUE),
('U-c4e2f0','S-0f6f44', CAST('2024-06-20' AS DATE),'feature_26',5,2800,0,FALSE);
Verify ODS layer data volumes (using the full dataset in this guide):
SELECT
(SELECT COUNT(*) FROM best_practice_saas_dw.ods_accounts) AS accounts,
(SELECT COUNT(*) FROM best_practice_saas_dw.ods_subscriptions) AS subscriptions,
(SELECT COUNT(*) FROM best_practice_saas_dw.ods_feature_usage) AS feature_usage,
(SELECT COUNT(*) FROM best_practice_saas_dw.ods_churn_events) AS churn_events;
account_name is the customer company name and is sensitive data. The approach: administrators see the original name; other roles see only the first 3 characters plus a mask.
CREATE OR REPLACE FUNCTION best_practice_saas_dw.mask_account_name(name STRING)
RETURNS STRING
AS CASE
WHEN current_user() IN ('privileged_user') THEN name -- replace with actual authorized username
ELSE CONCAT(SUBSTR(name, 1, 3), '****')
END;
💡 Tip: Replace 'privileged_user' with the actual usernames that need to see plaintext data. Column Masking uses the current_user() function to match the current connection's username. All authorized usernames must be explicitly listed in the IN() list.
ALTER TABLE best_practice_saas_dw.ods_accounts
CHANGE COLUMN account_name
SET MASK best_practice_saas_dw.mask_account_name;
⚠️ Note: The Column Masking effect applies transparently to all queries (including Dynamic Tables). When the DWD layer JOINs ods_accounts, non-privileged users will see masked account_name values.
Verify masking behavior (administrator account sees original names):
SELECT account_id, account_name, plan_tier
FROM best_practice_saas_dw.ods_accounts
LIMIT 5;
The account, subscription, and churn tables are continuously ingested via a Studio MySQL CDC real-time sync task. In Studio, go to Data Integration → New Task → MySQL Real-time CDC and fill in:
Source: MySQL address, credentials, tables to sync (accounts, subscriptions, churn_events)
Target: corresponding tables such as best_practice_saas_dw.ods_accounts
Task path: best_practices/saas_dw/
Once the CDC task starts, it performs a full sync of existing data first, then enters incremental capture mode. INSERTs, UPDATEs, and DELETEs are written to the ODS layer in real time.
DWD Layer: Tenant-Feature Usage Wide Table
The DWD layer JOINs ods_feature_usage (usage details), ods_subscriptions (subscriptions), and ods_accounts (accounts) into a single wide table. Downstream DWS and ADS layers aggregate directly from this table without repeated JOINs.
Create Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_saas_dw.dwd_tenant_feature_usage
AS
SELECT
fu.usage_id,
fu.subscription_id,
s.account_id,
a.account_name,
a.industry,
a.country,
a.plan_tier,
a.seats,
a.churn_flag AS account_churn_flag,
fu.usage_date,
fu.feature_name,
fu.usage_count,
fu.usage_duration_secs,
fu.error_count,
fu.is_beta_feature,
s.mrr_amount,
s.billing_frequency,
s.upgrade_flag,
s.downgrade_flag,
s.churn_flag AS subscription_churn_flag,
fu.ingest_time
FROM best_practice_saas_dw.ods_feature_usage fu
JOIN best_practice_saas_dw.ods_subscriptions s ON fu.subscription_id = s.subscription_id
JOIN best_practice_saas_dw.ods_accounts a ON s.account_id = a.account_id;
⚠️ Note: Do not include REFRESH INTERVAL in the Dynamic Table DDL. Periodic refresh is scheduled by creating a "Refresh Dynamic Table" task in Studio (see below), which allows attaching monitoring alerts and data quality checks to the same task.
Create Refresh Task in Studio
In Studio, go to Development → Tasks, path best_practices/saas_dw/, and create a new task:
The DWS layer aggregates DWD data at account_id + month granularity, computing each tenant's monthly feature usage breadth (distinct_features_used), depth (total_usage_count), error rate, MRR, and other core metrics as direct input for the ADS layer health scoring.
Create Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_saas_dw.dws_tenant_monthly_metrics
AS
SELECT
account_id,
MAX(account_name) AS account_name,
MAX(industry) AS industry,
MAX(country) AS country,
MAX(plan_tier) AS plan_tier,
MAX(seats) AS seats,
DATE_TRUNC('month', usage_date) AS usage_month,
COUNT(DISTINCT feature_name) AS distinct_features_used,
COUNT(DISTINCT CASE WHEN is_beta_feature THEN feature_name END) AS beta_features_used,
SUM(usage_count) AS total_usage_count,
SUM(usage_duration_secs) AS total_duration_secs,
ROUND(AVG(usage_duration_secs), 1) AS avg_duration_secs,
SUM(error_count) AS total_errors,
COUNT(*) AS usage_event_count,
MAX(mrr_amount) AS mrr_amount,
MAX(CAST(account_churn_flag AS INT)) AS churn_flag,
MAX(CAST(upgrade_flag AS INT)) AS upgrade_flag,
MAX(CAST(downgrade_flag AS INT)) AS downgrade_flag
FROM best_practice_saas_dw.dwd_tenant_feature_usage
GROUP BY account_id, DATE_TRUNC('month', usage_date);
Similarly create a refresh task in Studio under best_practices/saas_dw/, with an hourly schedule (DWS refreshes after DWD completes).
Manually trigger the initial refresh:
REFRESH DYNAMIC TABLE best_practice_saas_dw.dws_tenant_monthly_metrics;
SELECT COUNT(*) AS dws_row_count
FROM best_practice_saas_dw.dws_tenant_monthly_metrics;
dws_row_count
-------------
4032
View the most active tenants in the most recent month:
SELECT account_id, account_name, plan_tier, usage_month,
distinct_features_used, total_usage_count,
total_errors, mrr_amount, churn_flag
FROM best_practice_saas_dw.dws_tenant_monthly_metrics
ORDER BY usage_month DESC, total_usage_count DESC
LIMIT 8;
Result interpretation: A-503d5a and A-4c38bc are highly active tenants in the current month (high total_usage_count), but churn_flag=1. High usage does not always prevent churn — it may represent a final burst of activity before churning. The ADS layer needs to combine historical trends for a more comprehensive assessment.
ADS Layer: Tenant Health Scoring and Churn Risk Segmentation
Health Score SQL UDF
Encapsulate the scoring logic in a SQL UDF so it can be reused by all downstream Dynamic Tables and ad-hoc queries.
Scoring formula:
Feature breadth (50 points): more features used = higher engagement; full score at 5 features
Usage volume (20 points): higher monthly usage = deeper product integration; full score at 50 usages
Error rate penalty (-15 points): higher error rate = worse user experience; maximum deduction of 15 points
Upgrade bonus (+20 points): an upgrade indicates high customer satisfaction
Downgrade penalty (-20 points): downgrade is a leading indicator of churn
Base score (+10 points): guaranteed baseline for tenants with active usage
SELECT
-- High health: 6 features, 80 usages, no errors, has upgrade
best_practice_saas_dw.calc_tenant_health_score(
CAST(6 AS INT), CAST(80 AS BIGINT), CAST(0 AS BIGINT),
CAST(20 AS BIGINT), CAST(1 AS INT), CAST(0 AS INT)
) AS high_health,
-- Medium: 4 features, 40 usages, 2/15 error rate
best_practice_saas_dw.calc_tenant_health_score(
CAST(4 AS INT), CAST(40 AS BIGINT), CAST(2 AS BIGINT),
CAST(15 AS BIGINT), CAST(0 AS INT), CAST(0 AS INT)
) AS mid_health,
-- Low health: 2 features, 10 usages, 5/10 error rate, has downgrade
best_practice_saas_dw.calc_tenant_health_score(
CAST(2 AS INT), CAST(10 AS BIGINT), CAST(5 AS BIGINT),
CAST(10 AS BIGINT), CAST(0 AS INT), CAST(1 AS INT)
) AS low_health;
💡 Tip: When calling the function, parameter types must match the function signature — pass INT for distinct_features_used, BIGINT for total_usage_count/total_errors/usage_event_count, and INT for upgrade_flag/downgrade_flag. When used in a Dynamic Table's SELECT clause, the DWS layer columns are already the correct types; only distinct_features_used, upgrade_flag, and downgrade_flag need CAST(... AS INT).
Create Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_saas_dw.ads_tenant_health_score
AS
SELECT
m.account_id,
m.account_name,
m.industry,
m.country,
m.plan_tier,
m.seats,
m.usage_month,
m.distinct_features_used,
m.total_usage_count,
m.total_errors,
m.usage_event_count,
m.beta_features_used,
m.mrr_amount,
m.churn_flag,
m.upgrade_flag,
m.downgrade_flag,
ROUND(best_practice_saas_dw.calc_tenant_health_score(
CAST(m.distinct_features_used AS INT),
m.total_usage_count,
m.total_errors,
m.usage_event_count,
CAST(m.upgrade_flag AS INT),
CAST(m.downgrade_flag AS INT)
), 1) AS health_score,
CASE
WHEN best_practice_saas_dw.calc_tenant_health_score(
CAST(m.distinct_features_used AS INT),
m.total_usage_count, m.total_errors, m.usage_event_count,
CAST(m.upgrade_flag AS INT), CAST(m.downgrade_flag AS INT)
) >= 70 THEN 'HEALTHY'
WHEN best_practice_saas_dw.calc_tenant_health_score(
CAST(m.distinct_features_used AS INT),
m.total_usage_count, m.total_errors, m.usage_event_count,
CAST(m.upgrade_flag AS INT), CAST(m.downgrade_flag AS INT)
) >= 40 THEN 'AT_RISK'
ELSE 'CHURN_RISK'
END AS health_tier,
ce.churn_date,
ce.reason_code AS churn_reason
FROM best_practice_saas_dw.dws_tenant_monthly_metrics m
LEFT JOIN best_practice_saas_dw.ods_churn_events ce
ON m.account_id = ce.account_id
AND DATE_TRUNC('month', ce.churn_date) = m.usage_month;
Create an ADS layer refresh task in Studio under best_practices/saas_dw/, scheduled after DWS refresh completes (e.g., if DWS refreshes on the hour, set ADS for 10 minutes after the hour).
Manually trigger the initial refresh:
REFRESH DYNAMIC TABLE best_practice_saas_dw.ads_tenant_health_score;
SELECT COUNT(*) AS ads_count
FROM best_practice_saas_dw.ads_tenant_health_score;
ads_count
---------
4046
Health Tier Distribution
SELECT health_tier,
COUNT(*) AS tenant_month_count,
ROUND(AVG(health_score), 1) AS avg_score
FROM best_practice_saas_dw.ads_tenant_health_score
GROUP BY health_tier
ORDER BY avg_score DESC;
Result interpretation: The CHURN_RISK proportion is high in the dataset (87.7%), reflecting a characteristic of the RavenStack dataset — it is dominated by tenants with low feature usage volume. Most tenants use only 1–2 features per month, which lowers the base score due to the distinct_features_used / 5.0 weight. This also reflects real-world SaaS: most tenants are in a shallow usage stage; only deeply engaged top tenants can reach the HEALTHY zone.
Cross-validation of health tiers against actual churn:
SELECT health_tier, churn_flag, COUNT(*) AS cnt
FROM best_practice_saas_dw.ads_tenant_health_score
GROUP BY health_tier, churn_flag
ORDER BY health_tier, churn_flag;
About 22% of AT_RISK tenants have already churned, and about 23% of CHURN_RISK tenants have churned. The similar churn rates between the two tiers suggest there is room to improve the scoring model's differentiation on this dataset — consider incorporating historical trend features (consecutive months of decline) and external predictive models (External Function) for further optimization.
Churn Reason Analysis
SELECT reason_code,
COUNT(*) AS churn_count,
ROUND(AVG(refund_amount_usd), 2) AS avg_refund
FROM best_practice_saas_dw.ods_churn_events
GROUP BY reason_code
ORDER BY churn_count DESC;
Missing features (features) is the top churn reason, indicating product feature competitiveness is the primary area for improvement. The unknown category (95 records) with above-average refund amounts suggests that collecting more structured feedback at churn time has significant value.
Churn Rate Analysis by Plan Tier
SELECT plan_tier,
COUNT(*) AS accounts,
ROUND(AVG(seats), 1) AS avg_seats,
SUM(CAST(churn_flag AS INT)) AS churned_accounts,
ROUND(100.0 * SUM(CAST(churn_flag AS INT)) / COUNT(*), 1) AS churn_rate_pct
FROM best_practice_saas_dw.ods_accounts
GROUP BY plan_tier
ORDER BY churn_rate_pct DESC;
Churn rates across all three plan tiers are very close (all around 22%), indicating that churn is not concentrated in any particular tier but is a systemic product retention challenge. This also means that discounting strategies targeting specific tiers will have limited effect; the focus should be on feature usage depth and support quality.
Feature Usage Heat Analysis
SELECT feature_name,
COUNT(DISTINCT subscription_id) AS user_count,
SUM(usage_count) AS total_usage,
ROUND(AVG(usage_duration_secs) / 60.0, 1) AS avg_minutes,
SUM(error_count) AS total_errors,
ROUND(100.0 * SUM(CAST(is_beta_feature AS INT)) / COUNT(*), 1) AS beta_pct
FROM best_practice_saas_dw.ods_feature_usage
GROUP BY feature_name
ORDER BY user_count DESC
LIMIT 6;
feature_26 has the broadest user coverage (144 subscriptions) and the highest average usage duration (52.9 minutes) — it is a core feature. Its beta adoption rate of 11.8% indicates a significant proportion of users are using the beta version; monitor the beta error rate closely and fix issues promptly.
Semantic View: Encapsulating Business Semantics for Analytics Agent
The Semantic View wraps the ADS layer health scoring table as a business-oriented view, adding a risk_signal field that consolidates complex conditions into a single signal. The Analytics Agent can then directly query "which tenants are at churn risk" in natural language.
CREATE OR REPLACE VIEW best_practice_saas_dw.v_tenant_churn_risk AS
SELECT
account_id,
account_name,
industry,
country,
plan_tier,
seats,
usage_month,
ROUND(mrr_amount, 2) AS mrr_usd,
distinct_features_used,
total_usage_count,
total_errors,
beta_features_used,
health_score,
health_tier,
churn_flag,
churn_reason,
CASE
WHEN churn_flag = 1 AND churn_reason IS NOT NULL THEN churn_reason
WHEN health_tier = 'CHURN_RISK' AND churn_flag = 0 THEN 'predicted_risk'
ELSE 'stable'
END AS risk_signal
FROM best_practice_saas_dw.ads_tenant_health_score;
Query risk signal scale and average MRR (for prioritization):
SELECT risk_signal,
COUNT(*) AS cnt,
ROUND(AVG(mrr_usd), 0) AS avg_mrr
FROM best_practice_saas_dw.v_tenant_churn_risk
GROUP BY risk_signal
ORDER BY cnt DESC;
Result interpretation: Tenants churning due to pricing have an average MRR of $5,459; those churning due to features have an even higher average MRR ($7,096). This indicates that high-value customer churn is primarily driven by product feature competitiveness and price sensitivity — not support quality issues. The customer success team should prioritize outreach to these two risk categories with targeted retention plans.
RBAC Permission Configuration
Different roles have different permission needs for tenant data:
Role
Accessible scope
Restrictions
Data Analyst
All ADS / DWS aggregate data
Cannot view ODS raw account names (visible after masking)
Customer Success (CS)
v_tenant_churn_risk with real account names
Requires admin authorization
Sales
v_tenant_churn_risk, only their own customers
Row-level filtering implemented at the application layer
Operations Admin
All layers including ODS raw data
No restrictions
Control view access via GRANT:
-- Grant CS role access to query the Semantic View
GRANT SELECT ON VIEW best_practice_saas_dw.v_tenant_churn_risk TO ROLE cs_team;
-- Grant analyst access to query DWS / ADS layers
GRANT SELECT ON DYNAMIC TABLE best_practice_saas_dw.dws_tenant_monthly_metrics TO ROLE analyst;
GRANT SELECT ON DYNAMIC TABLE best_practice_saas_dw.ads_tenant_health_score TO ROLE analyst;
-- ODS layer accessible only by admins
GRANT SELECT ON TABLE best_practice_saas_dw.ods_accounts TO ROLE admin;
GRANT SELECT ON TABLE best_practice_saas_dw.ods_subscriptions TO ROLE admin;
Do not write REFRESH INTERVAL for Dynamic Tables: The DDL does not include a REFRESH INTERVAL parameter; all refresh scheduling is managed uniformly through Studio Tasks. This allows attaching monitoring alerts and data quality checks to the same task and makes it easy to change the schedule without rebuilding the table.
Column Masking applies transparently to Dynamic Tables: When the DWD layer JOINs ods_accounts, non-privileged users see a masked account_name (Com****). DWD / DWS / ADS also store the masked values. If the CS role needs to see the original account name, query v_tenant_churn_risk directly (CS role already has access to this view) and avoid applying the masking function on account_name in the view definition.
Health score UDF parameter types: calc_tenant_health_score is defined with DOUBLE parameters. When calling it in a Dynamic Table's SELECT clause, distinct_features_used, upgrade_flag, and downgrade_flag from the DWS table are BIGINT; they need CAST(... AS INT) before being passed in, otherwise a type mismatch parse error will occur.
Kafka PIPE offset management: Messages that arrive while the PIPE is stopped are not lost; after restart the PIPE continues from the last offset. To reset the consumption offset (e.g., for historical data backfill), configure the start offset in the Studio task or rebuild the PIPE.
Dynamic Table incremental refresh limitations: If the ODS layer uses INSERT OVERWRITE for full replacement, Dynamic Tables will fall back to full refresh. Use INSERT INTO append writes, combined with CDC and retaining the ingest_time field for incremental identification.
Column Masking: After binding SET MASK, all users querying that column via standard SQL will go through masking, including Dynamic Tables, views, and ad-hoc queries.
⚠️ Manual verification required: Column Masking currently matches authorization by username via current_user(). All usernames that need to see plaintext data must be added one by one to the whitelist in the masking function. If your Lakehouse version supports role-based dynamic evaluation (e.g., HAS_ROLE('role_name')), you can use roles instead of username lists for more flexible management. Contact Singdata technical support to confirm whether this function is supported in your version.