Online Education Learning Behavior Data Warehouse Best Practices
Integrating student learning behavior logs (course clicks, quiz submissions, assignment submissions) with course metadata into a multi-layer data warehouse to output learning effectiveness scores and at-risk student alerts. Based on the Open University Learning Analytics (OULAD) public dataset, this guide provides an end-to-end demonstration of the complete Kafka PIPE โ ODS โ DWD โ DWS โ ADS build process, covering three key platform capabilities: Inverted Index, BITMAP functions, and SQL UDFs.
Overview
The learning behavior data pipeline for an online education platform follows: real-time behavior events (Kafka) โ raw storage (ODS) โ cleansing and enrichment (DWD) โ student course progress aggregation (DWS) โ learning scores and alerts (ADS).
Singdata Lakehouse addresses several core challenges with the following combination:
Problem
Solution
High-frequency millisecond writes for student click and quiz events
Kafka PIPE for continuous ingestion โ no consumer code required
System automatically detects upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLE
Manually trigger a single refresh
Used for initial build or debugging
GROUP_BITMAP
Count active students (deduplicated)
Returns BIGINT cardinality
GROUP_BITMAP_STATE
Build bitmap objects for cross-course intersection analysis
Returns bitmap type, supports AND/OR operations
Prerequisites
All examples in this guide run under the best_practice_education_dw schema.
CREATE SCHEMA IF NOT EXISTS best_practice_education_dw;
Result:
{}
ODS Layer: Raw Data Tables
The ODS layer receives three types of data: real-time behavior events from Kafka, student enrollment and grade data synced via PostgreSQL CDC, and course metadata imported in batch. This guide uses the OULAD public dataset (real data from the UK Open University) to build the test data.
Create Tables
-- Student VLE behavior event table (Kafka PIPE write target)
CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_vle (
code_module STRING,
code_presentation STRING,
id_student BIGINT,
id_site BIGINT,
event_date INT,
sum_click INT,
ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Student basic information table (CDC sync target)
CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_info (
code_module STRING,
code_presentation STRING,
id_student BIGINT,
gender STRING,
region STRING,
highest_education STRING,
imd_band STRING,
age_band STRING,
num_of_prev_attempts INT,
studied_credits INT,
disability STRING,
final_result STRING
);
-- Course metadata table
CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_courses (
code_module STRING,
code_presentation STRING,
module_presentation_length INT
);
-- VLE resource type table
CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_vle (
id_site BIGINT,
code_module STRING,
code_presentation STRING,
activity_type STRING,
week_from INT,
week_to INT
);
-- Assignment submission and grade table
CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_assessment (
id_assessment BIGINT,
id_student BIGINT,
date_submitted INT,
is_banked INT,
score DOUBLE
);
ingest_time uses DEFAULT CURRENT_TIMESTAMP() and is automatically populated when Kafka PIPE writes; it does not need to be carried in the message body.
Create Inverted Index
The activity_type column of doc_ods_vle stores activity type tags (quiz, oucontent, resource, forumng, etc.). Analysts frequently filter resources by type.
CREATE INVERTED INDEX idx_inv_activity_type
ON TABLE doc_ods_vle (activity_type);
โ ๏ธ Note: CREATE INVERTED INDEX requires the same schema context as the target table. Before executing, switch schema using the -s best_practice_education_dw parameter or prefix the table name with the schema, otherwise you will get an "index and table must in the same schema" error.
The index only takes effect for data written after it is created. If the table already has existing data, run BUILD INDEX to cover the backfill:
BUILD INDEX idx_inv_activity_type ON doc_ods_vle;
Configure Kafka PIPE
The Kafka PIPE attempts to connect to the Kafka broker and verify the subscription relationship at DDL creation time. Replace the broker address and topic name with your actual values before creating the pipeline in production.
Option 1: Write via Kafka (recommended)
The following Python example shows how to push student behavior events to a Kafka topic to trigger PIPE ingestion:
-- First create a raw string receiver table; PIPE writes JSON strings to it
CREATE TABLE IF NOT EXISTS best_practice_education_dw.kafka_raw_vle (value STRING);
-- Create Kafka PIPE
CREATE PIPE IF NOT EXISTS best_practice_education_dw.pipe_student_vle
VIRTUAL_CLUSTER = 'DEFAULT'
BATCH_INTERVAL_IN_SECONDS = '60'
AS
COPY INTO best_practice_education_dw.kafka_raw_vle
FROM (
SELECT CAST(value AS STRING) AS value
FROM READ_KAFKA(
'<kafka-broker>:9092',
'edu_student_vle_events',
'',
'cz_edu_consumer',
'','','','',
'raw', 'raw',
0,
map()
)
);
๐ก Tip: Once created, the PIPE runs by default and batch-consumes every BATCH_INTERVAL_IN_SECONDS seconds. The JSON parsing from kafka_raw_vle to doc_ods_student_vle can be handled by a Dynamic Table.
Option 2: INSERT simulation (when Kafka is unavailable)
If Kafka is not yet configured, save the data as a local CSV file, upload it to a User Volume via cz-cli, and import with COPY INTO (recommended):
๐ก Tip: The examples below use cz-cli (the Singdata ClickZetta Lakehouse CLI tool). If cz-cli is not installed, see cz-cli Setup Guide. If you prefer not to use the command line, you can also run SQL in the Development โ SQL Editor of Lakehouse Studio, and configure/trigger scheduled tasks in the Studio โ Tasks page.
Import from local CSV (recommended)
-- Step 1: Upload local CSV to User Volume via SQL PUT
PUT '/path/to/student_vle_data.csv' TO USER VOLUME FILE 'student_vle_data.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_education_dw.doc_ods_student_vle
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('student_vle_data.csv');
You can also insert small batches of test data inline (no CSV file needed):
Verify the function โ highly active student (120 clicks, average score 78, 3 submissions, 15 active days):
SELECT best_practice_education_dw.calc_learning_score(120, 78.0, 3, 15) AS sample_score;
sample_score
------------
68
๐ก Tip: A learning score of 68 falls in the MEDIUM_RISK range (50โ75 points), indicating the student has good engagement but average academic performance. Focus should be on assignment quality rather than attendance frequency.
DWD Layer Dynamic Table: Cleansing and Enrichment
The DWD layer joins ODS raw events with student info, course metadata, and VLE resource types, outputting a standardized wide learning event table for direct aggregation by the DWS layer.
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.dwd_learning_events
AS
SELECT
v.code_module,
v.code_presentation,
v.id_student,
v.id_site,
vl.activity_type,
v.event_date,
v.sum_click,
v.ingest_time,
s.gender,
s.region,
s.highest_education,
s.age_band,
s.final_result AS enrollment_result,
c.module_presentation_length
FROM best_practice_education_dw.doc_ods_student_vle v
LEFT JOIN best_practice_education_dw.doc_ods_student_info s
ON v.code_module = s.code_module
AND v.code_presentation = s.code_presentation
AND v.id_student = s.id_student
LEFT JOIN best_practice_education_dw.doc_ods_vle vl
ON v.id_site = vl.id_site
LEFT JOIN best_practice_education_dw.doc_ods_courses c
ON v.code_module = c.code_module
AND v.code_presentation = c.code_presentation;
โ ๏ธ Note: Do not write REFRESH INTERVAL in Dynamic Table DDL. Refresh scheduling is managed via Lakehouse Studio tasks (see the "Lakehouse Studio Task Scheduling" section below).
Manually trigger the first refresh:
REFRESH DYNAMIC TABLE best_practice_education_dw.dwd_learning_events;
SELECT COUNT(*) AS dwd_count FROM best_practice_education_dw.dwd_learning_events;
The DWS layer aggregates DWD layer data at the id_student + code_module + code_presentation granularity, outputting behavioral statistics for each student in each course as direct input to the ADS layer scoring.
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.dws_student_course_progress
AS
SELECT
e.code_module,
e.code_presentation,
e.id_student,
e.gender,
e.region,
e.highest_education,
e.age_band,
e.enrollment_result,
e.module_presentation_length,
COUNT(DISTINCT e.event_date) AS days_active,
SUM(e.sum_click) AS total_clicks,
COUNT(DISTINCT e.id_site) AS distinct_resources,
SUM(CASE WHEN e.activity_type = 'quiz' THEN e.sum_click ELSE 0 END) AS quiz_clicks,
COUNT(DISTINCT CASE WHEN e.activity_type = 'quiz' THEN e.event_date END) AS quiz_days,
SUM(CASE WHEN e.activity_type = 'oucontent' THEN e.sum_click ELSE 0 END) AS content_clicks,
MAX(e.event_date) AS last_active_day
FROM best_practice_education_dw.dwd_learning_events e
GROUP BY
e.code_module, e.code_presentation, e.id_student,
e.gender, e.region, e.highest_education, e.age_band,
e.enrollment_result, e.module_presentation_length;
Manually trigger the first refresh and view results:
REFRESH DYNAMIC TABLE best_practice_education_dw.dws_student_course_progress;
SELECT code_module, code_presentation, id_student,
days_active, total_clicks, distinct_resources, quiz_clicks, last_active_day
FROM best_practice_education_dw.dws_student_course_progress
ORDER BY total_clicks DESC
LIMIT 8;
Student 11391 (AAA course) has 52 total clicks and 7 active days โ the highest engagement in the dataset. quiz_clicks=7 indicates quiz interaction, and this student's composite score is likely to fall in the MEDIUM_RISK range.
Student 70001 (EEE course) has 40 clicks but only 2 active days, representing an intensive short-burst learning pattern. quiz_clicks=0 means no quiz participation โ knowledge mastery should be monitored.
distinct_resources measures the breadth of resources a student explored: accessing more resource types generally indicates better learning completeness.
BITMAP Functions: Course Active Student Count
GROUP_BITMAP deduplicates student IDs more efficiently than COUNT(DISTINCT) and can be combined with GROUP_BITMAP_STATE for cross-course intersection analysis:
-- Active student count per course
SELECT
code_module,
code_presentation,
GROUP_BITMAP(CAST(id_student AS BIGINT)) AS active_student_count
FROM best_practice_education_dw.dws_student_course_progress
GROUP BY code_module, code_presentation
ORDER BY code_module;
Cross-course comparison: total enrolled students vs. highly engaged students (total clicks > 20) per course:
SELECT
a.code_module,
a.code_presentation,
GROUP_BITMAP(CAST(a.id_student AS BIGINT)) AS total_enrolled,
GROUP_BITMAP(CASE WHEN a.total_clicks > 20 THEN CAST(a.id_student AS BIGINT) END) AS high_engagement
FROM best_practice_education_dw.dws_student_course_progress a
GROUP BY a.code_module, a.code_presentation
ORDER BY a.code_module;
๐ก Tip: GROUP_BITMAP returns cardinality (BIGINT) and is suitable for post-aggregation counting. GROUP_BITMAP_STATE returns a bitmap object that can be used for cross-query BITMAP_OR/BITMAP_AND intersection analysis โ for example, counting students enrolled in both AAA and BBB courses simultaneously.
ADS Layer Dynamic Table: Learning Scores and High-Risk Alerts
The ADS layer calls the calc_learning_score UDF to score each student and outputs three risk levels for direct dashboard consumption.
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.ads_student_learning_score
AS
SELECT
p.code_module,
p.code_presentation,
p.id_student,
p.gender,
p.region,
p.highest_education,
p.age_band,
p.enrollment_result,
p.days_active,
p.total_clicks,
p.distinct_resources,
p.quiz_clicks,
p.last_active_day,
COALESCE(a.submission_count, 0) AS submission_count,
COALESCE(a.avg_score, 0.0) AS avg_score,
COALESCE(a.max_score, 0.0) AS max_score,
best_practice_education_dw.calc_learning_score(
CAST(p.total_clicks AS INT),
a.avg_score,
CAST(COALESCE(a.submission_count, 0) AS INT),
CAST(p.days_active AS INT)
) AS learning_score,
CASE
WHEN best_practice_education_dw.calc_learning_score(
CAST(p.total_clicks AS INT), a.avg_score,
CAST(COALESCE(a.submission_count, 0) AS INT),
CAST(p.days_active AS INT)
) >= 75 THEN 'LOW_RISK'
WHEN best_practice_education_dw.calc_learning_score(
CAST(p.total_clicks AS INT), a.avg_score,
CAST(COALESCE(a.submission_count, 0) AS INT),
CAST(p.days_active AS INT)
) >= 50 THEN 'MEDIUM_RISK'
ELSE 'HIGH_RISK'
END AS risk_level
FROM best_practice_education_dw.dws_student_course_progress p
LEFT JOIN (
SELECT
sa.id_student,
COUNT(*) AS submission_count,
ROUND(AVG(CASE WHEN sa.is_banked = 0 THEN sa.score END), 2) AS avg_score,
MAX(CASE WHEN sa.is_banked = 0 THEN sa.score END) AS max_score
FROM best_practice_education_dw.doc_ods_student_assessment sa
GROUP BY sa.id_student
) a ON p.id_student = CAST(a.id_student AS STRING);
Manually trigger the first refresh and view high-risk alert results:
REFRESH DYNAMIC TABLE best_practice_education_dw.ads_student_learning_score;
SELECT code_module, id_student, days_active, total_clicks, avg_score,
learning_score, risk_level
FROM best_practice_education_dw.ads_student_learning_score
ORDER BY learning_score ASC
LIMIT 10;
HIGH_RISK (10 students): avg_score=0 (student 41203) means never submitted an assignment; learning score is only 2.8, indicating extreme dropout risk. Several other students have scores between 40โ65 but very low click counts, suggesting neither engagement nor content mastery.
MEDIUM_RISK (3 students): Taking student 11391 as an example โ 52 clicks, average score 83, 7 active days, learning score 61.6. Engagement is strong but still below the LOW_RISK threshold (75 points), primarily because submission count (5) and active days (7) haven't reached their ceiling.
No LOW_RISK students in the current dataset, which is expected given the small simulated dataset (only 29 behavior records). In real production environments, the proportion of highly engaged students is typically 30โ50%.
Lakehouse Studio Task Scheduling
Dynamic Table periodic refreshes are managed through Lakehouse Studio tasks, not by writing REFRESH INTERVAL in the DDL. Decoupling tasks from Dynamic Tables allows monitoring alerts and data quality checks to be attached to the same task.
๐ก Tip: After creating all three tasks, configure upstream/downstream dependencies in the Lakehouse Studio UI (Refresh_DWS depends on Refresh_DWD, Refresh_ADS depends on Refresh_DWS) to ensure correct refresh order. After configuration, run cz-cli task deploy <task-name> to publish tasks and activate scheduling. You can add alert rules on each task to automatically notify when a refresh times out or fails.
In real-world online education scenarios, each question maps to several knowledge concepts. An external knowledge graph API can infer a student's mastery status for each concept, providing a basis for targeted remediation. The following shows a typical External Function integration pattern.
External Function Framework
# knowledge_mastery.py โ calls a knowledge graph API to infer knowledge concept mastery
from clickzetta.zettapark.functions import annotate
class KnowledgeMastery:
@annotate(
input_args=[("student_id", "BIGINT"), ("question_ids", "STRING"),
("scores", "STRING")],
return_type="STRING"
)
def evaluate(self, student_id: int, question_ids: str, scores: str) -> str:
"""
Calls a knowledge graph service to infer the student's mastery level for each concept.
Returns a JSON string, e.g.:
{"concepts": [{"id": "C001", "name": "Linear Equations", "mastery": "proficient"}]}
"""
import json, requests
payload = {
"student_id": student_id,
"responses": [
{"question_id": qid, "score": float(sc)}
for qid, sc in zip(question_ids.split(","), scores.split(","))
]
}
# Replace with actual knowledge graph service URL
resp = requests.post(
"https://<knowledge-graph-service>/api/mastery",
json=payload, timeout=5
)
return json.dumps(resp.json(), ensure_ascii=False)
After deploying as an External Function, it can be called directly in SQL:
-- Example: query knowledge concept mastery status for student 11391
SELECT
id_student,
best_practice_education_dw.infer_knowledge_mastery(
id_student,
'1752,1753,1754', -- corresponding question IDs
'78,85,88' -- corresponding question scores
) AS mastery_result
FROM best_practice_education_dw.doc_ods_student_info
WHERE id_student = 11391
LIMIT 1;
Data Warehouse Object Overview
After the full build, core objects under the best_practice_education_dw schema:
Kafka (real-time) PostgreSQL CDC CSV Batch
| | |
v pipe_student_vle (PIPE) | |
kafka_raw_vle | |
v v
doc_ods_student_vle doc_ods_student_info doc_ods_courses / doc_ods_vle
Inverted Index Inverted Index (activity_type)
| | |
+--------------------+---+-------------------+
|
v Studio Task: Refresh_DWD_Learning_Events (hourly)
dwd_learning_events (Dynamic Table)
LEFT JOIN enrichment ยท cleansing ยท normalization
|
v Studio Task: Refresh_DWS_Learning_Progress (hourly)
dws_student_course_progress (Dynamic Table)
days_active / total_clicks / quiz_clicks / GROUP_BITMAP
|
v Studio Task: Refresh_ADS_Learning_Score (hourly)
ads_student_learning_score (Dynamic Table)
calc_learning_score UDF ยท HIGH/MEDIUM/LOW_RISK
Notes
Do not write REFRESH INTERVAL in Dynamic Table DDL: Scheduling is managed via Lakehouse Studio tasks. The DDL does not include a REFRESH INTERVAL parameter; refresh tasks are created via cz-cli task create/save-content/save-cron, and alerts and data quality checks can be appended to the same task.
Inverted Index does not automatically apply to existing data: CREATE INVERTED INDEX only takes effect for data written after the index is created. If the table already has existing data, run BUILD INDEX idx_inv_activity_type ON doc_ods_vle to cover the backfill; otherwise MATCH_ALL queries may return empty results.
GROUP_BITMAP and GROUP_BITMAP_STATE have different semantics: GROUP_BITMAP returns cardinality (BIGINT); GROUP_BITMAP_STATE returns a bitmap object (supports AND/OR intersection operations). They cannot be mixed โ BITMAP_COUNT(GROUP_BITMAP(...)) will report a type error. Use GROUP_BITMAP_STATE followed by BITMAP_COUNT.
Dynamic Table incremental refresh depends on upstream change tracking: The first REFRESH performs a full snapshot computation. Subsequent incremental refreshes only process rows added or changed in the upstream ODS layer since the last refresh point. If the ODS layer uses INSERT OVERWRITE, it will cause the Dynamic Table to fall back to full refresh mode.
Type matching in ADS layer JOINs: dws_student_course_progress.id_student may become STRING type after Dynamic Table aggregation. When JOINing with doc_ods_student_assessment.id_student (BIGINT), explicit CAST is required, otherwise the join will fail or produce a full Cartesian product.
External Function network latency: Calling external knowledge graph APIs incurs network round-trip latency and is not suitable for high-frequency queries. It is recommended to materialize inference results into a dedicated ADS layer table, updated in batch once per day or per learning phase.