Online Education Learning Behavior Data Warehouse Best Practices

Integrating student learning behavior logs (course clicks, quiz submissions, assignment submissions) with course metadata into a multi-layer data warehouse to output learning effectiveness scores and at-risk student alerts. Based on the Open University Learning Analytics (OULAD) public dataset, this guide provides an end-to-end demonstration of the complete Kafka PIPE โ†’ ODS โ†’ DWD โ†’ DWS โ†’ ADS build process, covering three key platform capabilities: Inverted Index, BITMAP functions, and SQL UDFs.


Overview

The learning behavior data pipeline for an online education platform follows: real-time behavior events (Kafka) โ†’ raw storage (ODS) โ†’ cleansing and enrichment (DWD) โ†’ student course progress aggregation (DWS) โ†’ learning scores and alerts (ADS).

Singdata Lakehouse addresses several core challenges with the following combination:

ProblemSolution
High-frequency millisecond writes for student click and quiz eventsKafka PIPE for continuous ingestion โ€” no consumer code required
Automated incremental computation across ODS โ†’ DWD โ†’ DWS โ†’ ADSDynamic Table with declarative SQL; the system auto-schedules the dependency chain
Attendance and completion rate statistics by course/classGROUP_BITMAP function for fast active-student cardinality calculation
Fast filtering by activity type (quiz, oucontent)Inverted Index for full-text search on activity type tags
Reusable multi-dimensional learning scoring logicSQL UDF calc_learning_score encapsulating the weighted scoring formula
Inferring student knowledge concept masteryExternal Function calling a knowledge graph API to infer mastery status

SQL Commands Used

Command / FunctionPurposeNotes
CREATE TABLECreate ODS layer raw tablesRegular tables serving as upstream sources for Dynamic Tables
CREATE INVERTED INDEXCreate an Inverted Index on the activity_type columnEnables full-text search on activity types
CREATE PIPECreate a Kafka continuous ingestion pipelineBound to the ODS target table
CREATE FUNCTIONCreate SQL UDF calc_learning_scoreEncapsulates the weighted learning score formula
CREATE DYNAMIC TABLECreate incremental computation tables for DWD / DWS / ADS layersSystem automatically detects upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLEManually trigger a single refreshUsed for initial build or debugging
GROUP_BITMAPCount active students (deduplicated)Returns BIGINT cardinality
GROUP_BITMAP_STATEBuild bitmap objects for cross-course intersection analysisReturns bitmap type, supports AND/OR operations

Prerequisites

All examples in this guide run under the best_practice_education_dw schema.

CREATE SCHEMA IF NOT EXISTS best_practice_education_dw;

Result:

{}


ODS Layer: Raw Data Tables

The ODS layer receives three types of data: real-time behavior events from Kafka, student enrollment and grade data synced via PostgreSQL CDC, and course metadata imported in batch. This guide uses the OULAD public dataset (real data from the UK Open University) to build the test data.

Create Tables

-- Student VLE behavior event table (Kafka PIPE write target) CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_vle ( code_module STRING, code_presentation STRING, id_student BIGINT, id_site BIGINT, event_date INT, sum_click INT, ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Student basic information table (CDC sync target) CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_info ( code_module STRING, code_presentation STRING, id_student BIGINT, gender STRING, region STRING, highest_education STRING, imd_band STRING, age_band STRING, num_of_prev_attempts INT, studied_credits INT, disability STRING, final_result STRING ); -- Course metadata table CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_courses ( code_module STRING, code_presentation STRING, module_presentation_length INT ); -- VLE resource type table CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_vle ( id_site BIGINT, code_module STRING, code_presentation STRING, activity_type STRING, week_from INT, week_to INT ); -- Assignment submission and grade table CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_assessment ( id_assessment BIGINT, id_student BIGINT, date_submitted INT, is_banked INT, score DOUBLE );

ingest_time uses DEFAULT CURRENT_TIMESTAMP() and is automatically populated when Kafka PIPE writes; it does not need to be carried in the message body.

Create Inverted Index

The activity_type column of doc_ods_vle stores activity type tags (quiz, oucontent, resource, forumng, etc.). Analysts frequently filter resources by type.

CREATE INVERTED INDEX idx_inv_activity_type ON TABLE doc_ods_vle (activity_type);

The index only takes effect for data written after it is created. If the table already has existing data, run BUILD INDEX to cover the backfill:

BUILD INDEX idx_inv_activity_type ON doc_ods_vle;

Configure Kafka PIPE

The Kafka PIPE attempts to connect to the Kafka broker and verify the subscription relationship at DDL creation time. Replace the broker address and topic name with your actual values before creating the pipeline in production.

Option 1: Write via Kafka (recommended)

The following Python example shows how to push student behavior events to a Kafka topic to trigger PIPE ingestion:

from kafka import KafkaProducer import json, time producer = KafkaProducer( bootstrap_servers=['<kafka-broker>:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) # Construct a student VLE behavior event message event = { "code_module": "AAA", "code_presentation": "2013J", "id_student": 11391, "id_site": 546714, "event_date": 15, "sum_click": 5 } producer.send('edu_student_vle_events', value=event) producer.flush() print("Event sent")

Corresponding Kafka PIPE DDL:

-- First create a raw string receiver table; PIPE writes JSON strings to it CREATE TABLE IF NOT EXISTS best_practice_education_dw.kafka_raw_vle (value STRING); -- Create Kafka PIPE CREATE PIPE IF NOT EXISTS best_practice_education_dw.pipe_student_vle VIRTUAL_CLUSTER = 'DEFAULT' BATCH_INTERVAL_IN_SECONDS = '60' AS COPY INTO best_practice_education_dw.kafka_raw_vle FROM ( SELECT CAST(value AS STRING) AS value FROM READ_KAFKA( '<kafka-broker>:9092', 'edu_student_vle_events', '', 'cz_edu_consumer', '','','','', 'raw', 'raw', 0, map() ) );

Option 2: INSERT simulation (when Kafka is unavailable)

If Kafka is not yet configured, save the data as a local CSV file, upload it to a User Volume via cz-cli, and import with COPY INTO (recommended):

Import from local CSV (recommended)

-- Step 1: Upload local CSV to User Volume via SQL PUT PUT '/path/to/student_vle_data.csv' TO USER VOLUME FILE 'student_vle_data.csv';

-- Step 2: COPY INTO the table from User Volume COPY INTO best_practice_education_dw.doc_ods_student_vle FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('student_vle_data.csv');

You can also insert small batches of test data inline (no CSV file needed):

INSERT INTO best_practice_education_dw.doc_ods_student_vle (code_module, code_presentation, id_student, id_site, event_date, sum_click) VALUES ('AAA','2013J',11391,546652, 1, 4), ('AAA','2013J',11391,546652, 2, 8), ('AAA','2013J',11391,546712, 3,15), ('AAA','2013J',11391,546714, 5, 3), ('AAA','2013J',11391,546715, 7, 6), ('AAA','2013J',11391,546712,10,12), ('AAA','2013J',11391,546714,12, 4), ('AAA','2013J',28400,546652, 1, 2), ('AAA','2013J',28400,546712, 4, 9), ('AAA','2013J',28400,546715, 6, 5), ('AAA','2013J',31604,546652, 1, 5), ('AAA','2013J',31604,546712, 3,11), ('AAA','2013J',31604,546714, 6, 2), ('EEE','2013J',70001,550001, 1,18), ('EEE','2013J',70001,550001, 3,22), ('BBB','2013J',40102,547001, 2, 7), ('BBB','2013J',40102,547002, 4, 2), ('CCC','2014J',50001,548001, 2,14), ('CCC','2014J',50001,548002, 5, 3); -- 29 records total

Verify ODS row count:

SELECT COUNT(*) AS row_count FROM best_practice_education_dw.doc_ods_student_vle;

row_count --------- 29


Learning Score UDF

Encapsulate the multi-dimensional learning effectiveness scoring logic as a SQL UDF, reusable in both DWS and ADS layers.

Scoring formula:

  • assessment_avg ร— 0.50: quiz scores contribute 50 points
  • min(total_clicks, 200) / 200 ร— 30: platform click depth contributes 30 points
  • min(submission_count, 5) / 5 ร— 10: assignment submission frequency contributes 10 points
  • min(days_active, 30) / 30 ร— 10: active days contribute 10 points

CREATE OR REPLACE FUNCTION best_practice_education_dw.calc_learning_score( total_clicks INT, assessment_avg DOUBLE, submission_count INT, days_active INT ) RETURNS DOUBLE AS GREATEST(0.0, LEAST(100.0, COALESCE(assessment_avg, 0) * 0.50 + LEAST(total_clicks, 200) / 200.0 * 30.0 + LEAST(submission_count, 5) / 5.0 * 10.0 + LEAST(days_active, 30) / 30.0 * 10.0 ));

Verify the function โ€” highly active student (120 clicks, average score 78, 3 submissions, 15 active days):

SELECT best_practice_education_dw.calc_learning_score(120, 78.0, 3, 15) AS sample_score;

sample_score ------------ 68


DWD Layer Dynamic Table: Cleansing and Enrichment

The DWD layer joins ODS raw events with student info, course metadata, and VLE resource types, outputting a standardized wide learning event table for direct aggregation by the DWS layer.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.dwd_learning_events AS SELECT v.code_module, v.code_presentation, v.id_student, v.id_site, vl.activity_type, v.event_date, v.sum_click, v.ingest_time, s.gender, s.region, s.highest_education, s.age_band, s.final_result AS enrollment_result, c.module_presentation_length FROM best_practice_education_dw.doc_ods_student_vle v LEFT JOIN best_practice_education_dw.doc_ods_student_info s ON v.code_module = s.code_module AND v.code_presentation = s.code_presentation AND v.id_student = s.id_student LEFT JOIN best_practice_education_dw.doc_ods_vle vl ON v.id_site = vl.id_site LEFT JOIN best_practice_education_dw.doc_ods_courses c ON v.code_module = c.code_module AND v.code_presentation = c.code_presentation;

Manually trigger the first refresh:

REFRESH DYNAMIC TABLE best_practice_education_dw.dwd_learning_events; SELECT COUNT(*) AS dwd_count FROM best_practice_education_dw.dwd_learning_events;

dwd_count --------- 29


DWS Layer Dynamic Table: Student Course Progress Aggregation

The DWS layer aggregates DWD layer data at the id_student + code_module + code_presentation granularity, outputting behavioral statistics for each student in each course as direct input to the ADS layer scoring.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.dws_student_course_progress AS SELECT e.code_module, e.code_presentation, e.id_student, e.gender, e.region, e.highest_education, e.age_band, e.enrollment_result, e.module_presentation_length, COUNT(DISTINCT e.event_date) AS days_active, SUM(e.sum_click) AS total_clicks, COUNT(DISTINCT e.id_site) AS distinct_resources, SUM(CASE WHEN e.activity_type = 'quiz' THEN e.sum_click ELSE 0 END) AS quiz_clicks, COUNT(DISTINCT CASE WHEN e.activity_type = 'quiz' THEN e.event_date END) AS quiz_days, SUM(CASE WHEN e.activity_type = 'oucontent' THEN e.sum_click ELSE 0 END) AS content_clicks, MAX(e.event_date) AS last_active_day FROM best_practice_education_dw.dwd_learning_events e GROUP BY e.code_module, e.code_presentation, e.id_student, e.gender, e.region, e.highest_education, e.age_band, e.enrollment_result, e.module_presentation_length;

Manually trigger the first refresh and view results:

REFRESH DYNAMIC TABLE best_practice_education_dw.dws_student_course_progress; SELECT code_module, code_presentation, id_student, days_active, total_clicks, distinct_resources, quiz_clicks, last_active_day FROM best_practice_education_dw.dws_student_course_progress ORDER BY total_clicks DESC LIMIT 8;

code_module | code_presentation | id_student | days_active | total_clicks | distinct_resources | quiz_clicks | last_active_day ------------+-------------------+------------+-------------+--------------+--------------------+-------------+---------------- AAA | 2013J | 11391 | 7 | 52 | 4 | 7 | 12 EEE | 2013J | 70001 | 2 | 40 | 1 | 0 | 3 DDD | 2013J | 60001 | 2 | 21 | 1 | 0 | 5 BBB | 2013J | 40102 | 3 | 19 | 2 | 2 | 8 AAA | 2013J | 31604 | 3 | 18 | 3 | 2 | 6 CCC | 2014J | 50001 | 2 | 17 | 2 | 3 | 5 AAA | 2013J | 28400 | 3 | 16 | 3 | 0 | 6 CCC | 2014J | 50002 | 1 | 8 | 1 | 0 | 3

Result interpretation:

  • Student 11391 (AAA course) has 52 total clicks and 7 active days โ€” the highest engagement in the dataset. quiz_clicks=7 indicates quiz interaction, and this student's composite score is likely to fall in the MEDIUM_RISK range.
  • Student 70001 (EEE course) has 40 clicks but only 2 active days, representing an intensive short-burst learning pattern. quiz_clicks=0 means no quiz participation โ€” knowledge mastery should be monitored.
  • distinct_resources measures the breadth of resources a student explored: accessing more resource types generally indicates better learning completeness.

BITMAP Functions: Course Active Student Count

GROUP_BITMAP deduplicates student IDs more efficiently than COUNT(DISTINCT) and can be combined with GROUP_BITMAP_STATE for cross-course intersection analysis:

-- Active student count per course SELECT code_module, code_presentation, GROUP_BITMAP(CAST(id_student AS BIGINT)) AS active_student_count FROM best_practice_education_dw.dws_student_course_progress GROUP BY code_module, code_presentation ORDER BY code_module;

code_module | code_presentation | active_student_count ------------+-------------------+--------------------- AAA | 2013J | 5 BBB | 2013J | 2 CCC | 2014J | 2 DDD | 2013J | 2 EEE | 2013J | 1 FFF | 2013J | 1

Cross-course comparison: total enrolled students vs. highly engaged students (total clicks > 20) per course:

SELECT a.code_module, a.code_presentation, GROUP_BITMAP(CAST(a.id_student AS BIGINT)) AS total_enrolled, GROUP_BITMAP(CASE WHEN a.total_clicks > 20 THEN CAST(a.id_student AS BIGINT) END) AS high_engagement FROM best_practice_education_dw.dws_student_course_progress a GROUP BY a.code_module, a.code_presentation ORDER BY a.code_module;

code_module | code_presentation | total_enrolled | high_engagement ------------+-------------------+----------------+---------------- AAA | 2013J | 5 | 1 BBB | 2013J | 2 | 0 CCC | 2014J | 2 | 0 DDD | 2013J | 2 | 1 EEE | 2013J | 1 | 1 FFF | 2013J | 1 | 0


ADS Layer Dynamic Table: Learning Scores and High-Risk Alerts

The ADS layer calls the calc_learning_score UDF to score each student and outputs three risk levels for direct dashboard consumption.

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.ads_student_learning_score AS SELECT p.code_module, p.code_presentation, p.id_student, p.gender, p.region, p.highest_education, p.age_band, p.enrollment_result, p.days_active, p.total_clicks, p.distinct_resources, p.quiz_clicks, p.last_active_day, COALESCE(a.submission_count, 0) AS submission_count, COALESCE(a.avg_score, 0.0) AS avg_score, COALESCE(a.max_score, 0.0) AS max_score, best_practice_education_dw.calc_learning_score( CAST(p.total_clicks AS INT), a.avg_score, CAST(COALESCE(a.submission_count, 0) AS INT), CAST(p.days_active AS INT) ) AS learning_score, CASE WHEN best_practice_education_dw.calc_learning_score( CAST(p.total_clicks AS INT), a.avg_score, CAST(COALESCE(a.submission_count, 0) AS INT), CAST(p.days_active AS INT) ) >= 75 THEN 'LOW_RISK' WHEN best_practice_education_dw.calc_learning_score( CAST(p.total_clicks AS INT), a.avg_score, CAST(COALESCE(a.submission_count, 0) AS INT), CAST(p.days_active AS INT) ) >= 50 THEN 'MEDIUM_RISK' ELSE 'HIGH_RISK' END AS risk_level FROM best_practice_education_dw.dws_student_course_progress p LEFT JOIN ( SELECT sa.id_student, COUNT(*) AS submission_count, ROUND(AVG(CASE WHEN sa.is_banked = 0 THEN sa.score END), 2) AS avg_score, MAX(CASE WHEN sa.is_banked = 0 THEN sa.score END) AS max_score FROM best_practice_education_dw.doc_ods_student_assessment sa GROUP BY sa.id_student ) a ON p.id_student = CAST(a.id_student AS STRING);

Manually trigger the first refresh and view high-risk alert results:

REFRESH DYNAMIC TABLE best_practice_education_dw.ads_student_learning_score; SELECT code_module, id_student, days_active, total_clicks, avg_score, learning_score, risk_level FROM best_practice_education_dw.ads_student_learning_score ORDER BY learning_score ASC LIMIT 10;

code_module | id_student | days_active | total_clicks | avg_score | learning_score | risk_level ------------+------------+-------------+--------------+-----------+-----------------+----------- BBB | 41203 | 1 | 3 | 0 | 2.783 | HIGH_RISK FFF | 80001 | 1 | 2 | 32 | 18.633 | HIGH_RISK DDD | 60002 | 1 | 4 | 40 | 22.933 | HIGH_RISK AAA | 32885 | 1 | 1 | 41.5 | 25.233 | HIGH_RISK BBB | 40102 | 3 | 19 | 48.5 | 32.1 | HIGH_RISK AAA | 30268 | 2 | 4 | 65 | 37.767 | HIGH_RISK CCC | 50002 | 1 | 8 | 78 | 42.533 | HIGH_RISK AAA | 31604 | 3 | 18 | 82 | 46.7 | HIGH_RISK AAA | 28400 | 3 | 16 | 71 | 46.9 | HIGH_RISK DDD | 60001 | 2 | 21 | 82.5 | 49.067 | HIGH_RISK

View risk level distribution:

SELECT risk_level, COUNT(*) AS student_count FROM best_practice_education_dw.ads_student_learning_score GROUP BY risk_level ORDER BY risk_level;

risk_level | student_count -------------+-------------- HIGH_RISK | 10 MEDIUM_RISK | 3

Result interpretation:

  • HIGH_RISK (10 students): avg_score=0 (student 41203) means never submitted an assignment; learning score is only 2.8, indicating extreme dropout risk. Several other students have scores between 40โ€“65 but very low click counts, suggesting neither engagement nor content mastery.
  • MEDIUM_RISK (3 students): Taking student 11391 as an example โ€” 52 clicks, average score 83, 7 active days, learning score 61.6. Engagement is strong but still below the LOW_RISK threshold (75 points), primarily because submission count (5) and active days (7) haven't reached their ceiling.
  • No LOW_RISK students in the current dataset, which is expected given the small simulated dataset (only 29 behavior records). In real production environments, the proportion of highly engaged students is typically 30โ€“50%.

Lakehouse Studio Task Scheduling

Dynamic Table periodic refreshes are managed through Lakehouse Studio tasks, not by writing REFRESH INTERVAL in the DDL. Decoupling tasks from Dynamic Tables allows monitoring alerts and data quality checks to be attached to the same task.

Create Task Folder

cz-cli task create-folder "education_dw" --parent 186117 -p skill_test

{"data":186121}

Create DWD Refresh Task

cz-cli task create "Refresh_DWD_Learning_Events" --type SQL --folder 186121 -p skill_test

{"data":{"id":10354662,"studio_url":"https://..."}}

Save SQL content:

cz-cli task save-content "Refresh_DWD_Learning_Events" \ --content "REFRESH DYNAMIC TABLE best_practice_education_dw.dwd_learning_events;" \ -p skill_test

Configure hourly schedule (5-field cron expression):

cz-cli task save-cron "Refresh_DWD_Learning_Events" --cron "0 0/1 * * ?" -p skill_test

Create DWS Refresh Task

cz-cli task create "Refresh_DWS_Learning_Progress" --type SQL --folder 186121 -p skill_test cz-cli task save-content "Refresh_DWS_Learning_Progress" \ --content "REFRESH DYNAMIC TABLE best_practice_education_dw.dws_student_course_progress;" \ -p skill_test cz-cli task save-cron "Refresh_DWS_Learning_Progress" --cron "0 0/1 * * ?" -p skill_test

Create ADS Refresh Task

cz-cli task create "Refresh_ADS_Learning_Score" --type SQL --folder 186121 -p skill_test cz-cli task save-content "Refresh_ADS_Learning_Score" \ --content "REFRESH DYNAMIC TABLE best_practice_education_dw.ads_student_learning_score;" \ -p skill_test cz-cli task save-cron "Refresh_ADS_Learning_Score" --cron "0 0/1 * * ?" -p skill_test


External Function: Knowledge Concept Mastery Inference

In real-world online education scenarios, each question maps to several knowledge concepts. An external knowledge graph API can infer a student's mastery status for each concept, providing a basis for targeted remediation. The following shows a typical External Function integration pattern.

External Function Framework

# knowledge_mastery.py โ€” calls a knowledge graph API to infer knowledge concept mastery from clickzetta.zettapark.functions import annotate class KnowledgeMastery: @annotate( input_args=[("student_id", "BIGINT"), ("question_ids", "STRING"), ("scores", "STRING")], return_type="STRING" ) def evaluate(self, student_id: int, question_ids: str, scores: str) -> str: """ Calls a knowledge graph service to infer the student's mastery level for each concept. Returns a JSON string, e.g.: {"concepts": [{"id": "C001", "name": "Linear Equations", "mastery": "proficient"}]} """ import json, requests payload = { "student_id": student_id, "responses": [ {"question_id": qid, "score": float(sc)} for qid, sc in zip(question_ids.split(","), scores.split(",")) ] } # Replace with actual knowledge graph service URL resp = requests.post( "https://<knowledge-graph-service>/api/mastery", json=payload, timeout=5 ) return json.dumps(resp.json(), ensure_ascii=False)

After deploying as an External Function, it can be called directly in SQL:

-- Example: query knowledge concept mastery status for student 11391 SELECT id_student, best_practice_education_dw.infer_knowledge_mastery( id_student, '1752,1753,1754', -- corresponding question IDs '78,85,88' -- corresponding question scores ) AS mastery_result FROM best_practice_education_dw.doc_ods_student_info WHERE id_student = 11391 LIMIT 1;


Data Warehouse Object Overview

After the full build, core objects under the best_practice_education_dw schema:

SHOW TABLES IN best_practice_education_dw;

Main objects:

schema_name | table_name | is_dynamic -----------------------------+---------------------------------+----------- best_practice_education_dw | doc_ods_student_vle | false best_practice_education_dw | doc_ods_student_info | false best_practice_education_dw | doc_ods_courses | false best_practice_education_dw | doc_ods_vle | false best_practice_education_dw | doc_ods_student_assessment | false best_practice_education_dw | dwd_learning_events | true best_practice_education_dw | dws_student_course_progress | true best_practice_education_dw | ads_student_learning_score | true

Architecture:

Kafka (real-time) PostgreSQL CDC CSV Batch | | | v pipe_student_vle (PIPE) | | kafka_raw_vle | | v v doc_ods_student_vle doc_ods_student_info doc_ods_courses / doc_ods_vle Inverted Index Inverted Index (activity_type) | | | +--------------------+---+-------------------+ | v Studio Task: Refresh_DWD_Learning_Events (hourly) dwd_learning_events (Dynamic Table) LEFT JOIN enrichment ยท cleansing ยท normalization | v Studio Task: Refresh_DWS_Learning_Progress (hourly) dws_student_course_progress (Dynamic Table) days_active / total_clicks / quiz_clicks / GROUP_BITMAP | v Studio Task: Refresh_ADS_Learning_Score (hourly) ads_student_learning_score (Dynamic Table) calc_learning_score UDF ยท HIGH/MEDIUM/LOW_RISK


Notes

  • Do not write REFRESH INTERVAL in Dynamic Table DDL: Scheduling is managed via Lakehouse Studio tasks. The DDL does not include a REFRESH INTERVAL parameter; refresh tasks are created via cz-cli task create/save-content/save-cron, and alerts and data quality checks can be appended to the same task.

  • Inverted Index does not automatically apply to existing data: CREATE INVERTED INDEX only takes effect for data written after the index is created. If the table already has existing data, run BUILD INDEX idx_inv_activity_type ON doc_ods_vle to cover the backfill; otherwise MATCH_ALL queries may return empty results.

  • GROUP_BITMAP and GROUP_BITMAP_STATE have different semantics: GROUP_BITMAP returns cardinality (BIGINT); GROUP_BITMAP_STATE returns a bitmap object (supports AND/OR intersection operations). They cannot be mixed โ€” BITMAP_COUNT(GROUP_BITMAP(...)) will report a type error. Use GROUP_BITMAP_STATE followed by BITMAP_COUNT.

  • Dynamic Table incremental refresh depends on upstream change tracking: The first REFRESH performs a full snapshot computation. Subsequent incremental refreshes only process rows added or changed in the upstream ODS layer since the last refresh point. If the ODS layer uses INSERT OVERWRITE, it will cause the Dynamic Table to fall back to full refresh mode.

  • Type matching in ADS layer JOINs: dws_student_course_progress.id_student may become STRING type after Dynamic Table aggregation. When JOINing with doc_ods_student_assessment.id_student (BIGINT), explicit CAST is required, otherwise the join will fail or produce a full Cartesian product.

  • External Function network latency: Calling external knowledge graph APIs incurs network round-trip latency and is not suitable for high-frequency queries. It is recommended to materialize inference results into a dedicated ADS layer table, updated in batch once per day or per learning phase.