Incremental Computing Mechanism
Incremental Computing is one of the core capabilities of the Singdata Lakehouse. It significantly reduces compute resource consumption by processing only the changed data since the last refresh, rather than recalculating all data each time, while maintaining exactly the same results as full computation.
Why Incremental Computing Is Needed
Before incremental computing emerged, data processing mainly had two modes:
| Mode | Characteristics | Issues |
|---|---|---|
| Batch Processing | Full recalculation each time, T+1 day-level delivery | As data volume grows, compute time and resources grow linearly, unable to meet near-real-time needs |
| Streaming Processing | Resident service, row-by-row processing, second-level delivery | Requires understanding of complex concepts like Watermark/Window/Trigger; state storage expands as windows grow; Retraction handling is difficult |
To simultaneously meet "low latency" and "low cost", enterprises were often forced to adopt the Lambda Architecture: one batch processing pipeline + one streaming processing pipeline, maintaining two sets of code, making data consistency difficult to guarantee.
The goal of incremental computing: Using a single standard SQL, covering both batch and streaming processing scenarios, eliminating the Lambda architecture and achieving the Kappa Architecture (one engine, full pipeline real-time).
Three Computing Paradigms Compared
| Dimension | Batch Computing | Streaming Computing | Incremental Computing |
|---|---|---|---|
| Design Goal | Throughput first | Freshness first | Flexible balance of throughput, freshness, and latency |
| Trigger Method | Active computation (scheduled triggers) | Passive computation (data-driven) | Active computation (supports scheduled + dependency-triggered + real-time triggered) |
| Compute Model | Full computation | Streaming computing model (a specialization of incremental) | General Incremental Computing model (GIC) |
| Storage Model | Static storage | Pipe pipeline data model (without full set) | Dynamic Data (supports full/incremental reads and writes) |
| Data Modeling | Standard dimensional modeling (ODS/DWD/ADS) | None (pipeline mode, intermediate data not queryable) | Standard dimensional modeling (supports layering, intermediate data queryable) |
| Language Expression | Standard SQL + UDF | Non-standard SQL (requires EventTime, Watermark, etc.) | Standard SQL + UDF |
| Version Management | Supports MVCC | Not supported | Supports MVCC (version-based recomputation possible) |
| Retraction Handling | No such concept | Requires state storage to hold full data within window | Unified as a form of incremental data (incremental subtraction) |
When to Choose Which?
- Choose Batch: One-time historical data loading, full repair, insensitive to latency (T+1 acceptable)
- Choose Streaming: Requires second-level response, involves complex time windows (sliding windows, session windows), out-of-order data processing
- Choose Incremental Computing: Data arrives in batches, wants minute/hour-level freshness, requires significantly lower resource costs than streaming computing, wants to develop with standard SQL
Core Concepts
Incremental computing is built on three fundamental data concepts:
1. Snapshot
A collection of all data in a dataset at a certain point in time.
2. Delta
A set of data change records generated between two snapshots of a dataset.
3. Change Data Capture (CDC)
A standard database concept used to capture data changes. Each row change is generated by one of three operations:
| Operation | Meaning | Example |
|---|---|---|
| Insert | Add a new row | +F |
| Update | Modify a row | -B(old) -> +B(new) |
| Delete | Remove a row | -C |
Relationship Between the Three
The system abstracts input data into Delta sets through CDC. Jobs sense and consume upstream Deltas, output their own Deltas downstream, forming a complete incremental Pipeline.
Incremental Computing Principles
MVCC Mechanism
The Lakehouse maintains multiple historical versions for each table. Each data change (INSERT/UPDATE/DELETE) generates a new version, while old versions remain accessible. Dynamic Tables precisely locate Delta data by recording the source table version checkpoint from the last refresh.
Incremental Algorithm
Different operators handle incremental data differently:
| Operator | Incremental Processing Method | Needs to Read Historical Data |
|---|---|---|
| Filter | Apply filter conditions only to incremental data | No |
| Project | Apply column pruning/computation only to incremental data | No |
| Join | Join incremental data with right table historical data | Yes, needs to read right table history |
| Aggregate | Merge incremental data with own historical aggregation results | Yes, needs to read own historical results |
Example: Coffee Sales Statistics
Suppose a coffee shop has two tables:
Business requirement: Calculate total sales for each type of coffee, excluding lattes.
Declarative Definition (user only needs to write this):
Internal Incremental Execution Process of the System:
The entire process only processes the 3 newly added rows, rather than re-scanning the entire orders table.
Scheduling Modes
Incremental computing supports three scheduling modes. The same SQL can seamlessly switch without modifying query logic:
1. Immediate Consumption on Data Arrival (Real-Time Triggered)
When a data change occurs on the source table, incremental computing is triggered immediately.
- Latency: Second-level
- Applicable: Scenarios requiring extremely high freshness, such as risk control and real-time monitoring
- Note: Higher compute overhead; query complexity should not be too high
2. Periodic Scheduling by Time Interval
Triggered periodically at preset intervals (minutes/hours/days).
- Latency: Depends on refresh interval
- Applicable: Most near-real-time scenarios (minute-level/hourly reports)
- Advantage: Can accumulate data for batch processing, leverage cluster elasticity, easier capacity planning
3. Dependency-Triggered Scheduling (DAG Cascading)
Downstream computation is automatically triggered after upstream tasks complete, forming a data pipeline.
- Applicable: Data warehouse layering scenarios where you want to monitor leaf node freshness with automatic upstream adaptation
- Advantage: Only need to configure leaf node scheduling; upstream is automatically arranged
Performance and Cost
Comparison with Streaming Computing
In typical scenarios, incremental computing saves resources compared to streaming computing engines like Apache Flink, mainly because:
- No Persistent State Storage: Streaming computing needs to maintain full state within windows + Checkpoints; incremental computing is based on MVCC, reading on demand
- Vectorized Execution Engine: Native Vector Engine is several to dozens of times more efficient than Java row-based processing
- Batch Join Optimization: Incremental Join is a left-table incremental rows single-pass Hash Join with the right table; the right table is read only once, utilizing storage cache
4 Factors Affecting Performance
| Factor | Impact | Description |
|---|---|---|
| Query Complexity | Higher means greater cost | Outer Join + large amounts of historical data connection may generate many retraction operations |
| Data Change Type | Append-Only has lowest cost | Update/Delete requires interaction with historical data, higher compute cost |
| Data Change Rate | Faster means greater cost | 1 million rows/sec new vs 10,000 rows/sec new, significant resource consumption difference |
| Scheduling Frequency | Higher means greater cost | Higher frequency increases system fixed overhead (Plan generation, resource allocation) ratio |
Cost-Based Optimization Framework
Singdata Lakehouse adopts Cost-based incremental optimization, not Rule-based:
- Rule-based: Fixed incremental execution plan when SQL is defined, cannot adapt to dynamic data changes
- Cost-based: During each refresh, dynamically selects the optimal execution plan (full or incremental, operator algorithm selection) based on comprehensive data statistics, cluster resources, and incremental data volume
The system will automatically fall back to full computation in the following situations:
- Source table historical versions exceed Time Travel retention period
- Incremental data volume is too large, full computation is actually better
- Dynamic Table SQL definition has changed
Applicable Scenarios
Incremental computing is the optimal choice when the business simultaneously meets the following conditions:
- Computation logic can be described with standard SQL or UDF
- Data arrives in batches, forming a continuous incremental dataset
- Wants minute/hour-level freshness while requiring significantly lower costs than streaming computing
- Does not involve complex time windows (sliding windows, session windows) or Watermark mechanisms
- Wants to progressively upgrade the data warehouse architecture, gradually verifying near-real-time effects
Inapplicable Scenarios
- Requires sub-second latency (consider streaming processing)
- Involves complex out-of-order data processing and Watermark mechanisms
- One-time full data loading (use INSERT INTO or COPY INTO directly)
Limitations
Non-Deterministic Functions
Scenarios involving random functions weaken the advantages of incremental computing because already-computed results are no longer stable, forcing existing data to be recomputed.
| Function | Impact | Workaround |
|---|---|---|
CURRENT_TIMESTAMP() | Results change on every refresh, approaching full recomputation | Avoid direct use in Dynamic Tables |
CURRENT_DATE() | Changes daily, stable within the day | Can be used in T+1 scenarios |
RAND() | Different results on every execution | Avoid using |
DATE_FORMAT(NOW(), 'yyyy-MM') | Stable within a month, changes across months | Acceptable, incremental computing works normally within a month |
Recommendation: If random functions can remain stable within a certain time range (e.g., by month/day), incremental computing can still work effectively.
Other Limitations
- When source table changes are too large, incremental computing may approach full computation load
- Some complex queries (e.g., multi-level nested subqueries + non-equi Join) may not be able to execute incrementally; the system will automatically fall back to full computation
Complete Example: User Behavior Log Analysis
Business Scenario
An internet company wants to understand user operations on different pages in real-time, associating operations with user basic information (age, gender, region) for precise analysis.
Data Sources:
- User behavior logs: Kafka real-time event stream
- User info dimension table: MySQL CDC -> Kafka
Step 1: Real-Time Log Data Loading
Step 2: Dimension Table Incremental Update
Step 3: Log and Dimension Table Join (DWD Layer)
Step 4: Aggregate Analysis (DWS Layer)
Architecture Summary
- Data Ingestion: Pipe automatically loads from Kafka in batches, supporting Append and CDC writes
- ETL Processing: Dynamic Table declarative definition, automatic incremental refresh
- Data Warehouse Layering: ODS -> DWD -> DWS, complete layering thinking
- No Manual Intervention: System automatically completes scheduling, incremental optimization, dependency propagation
FAQ
Q: What is the difference between incremental computing and materialized views?
| Feature | Dynamic Table (Incremental Computing) | Materialized View |
|---|---|---|
| Design Goal | Build multi-layer data pipelines | Transparently improve single-table query performance |
| Query Rewrite | Does not automatically rewrite queries; must explicitly reference | Optimizer automatically rewrites queries to use MV |
| Data Sources | Supports complex queries such as multi-table JOIN, UNION | Only supports single table |
| Refresh Method | Incremental refresh, only processes Delta | Full or incremental (limited) |
| Applicable Scenarios | ETL pipelines, data warehouse layering | Accelerating specific queries |
Q: How to confirm whether a refresh is incremental or full?
Check the refresh mode and duration in the output. If a particular refresh takes significantly longer than usual, a full refresh may have been triggered.
Q: What happens to downstream Dynamic Tables after a full refresh?
Downstream Dynamic Tables perceive the upstream full refresh and automatically adapt during their next refresh. If all upstream data changes, downstream may also execute a computation close to full.
Q: How to avoid unnecessary full refreshes?
- Ensure
data_retention_daysis long enough to avoid source table historical versions expiring - Avoid using non-deterministic functions in Dynamic Table SQL
- Before modifying SQL definitions, evaluate whether the change is truly needed
Q: Can incremental computing guarantee Exactly-Once semantics?
Yes. Based on MVCC version management, each refresh is based on a determined source table version checkpoint. After fault recovery, recomputation can start from the last checkpoint, ensuring results are consistent with full computation.
Related Documentation
- Dynamic Table Introduction -- Complete Dynamic Table concepts
- Create Dynamic Table -- DDL syntax and parameter descriptions
- View Refresh History -- Monitor refresh status and mode
- Time Travel -- MVCC version management fundamentals
- Pipe Continuous Ingestion -- External data real-time ingestion
- Table Stream -- Incremental data consumption and CDC output
