Data Freshness

Data freshness measures the latency between when data is generated and when it becomes available for querying. A weather forecast that refreshes hourly is sufficient; a nuclear power plant control system needs millisecond-level response — the freshness requirements across different scenarios can differ by tens of thousands of times.

Setting "good enough" as the goal rather than always chasing the fastest is the starting point for designing data processing pipelines. There is only one criterion for "good enough": If data arrives N minutes late, does it cause a measurable business loss? If it doesn't, it's good enough. Changing a dashboard from updating every 5 seconds to every 5 minutes may leave decision quality unchanged — but the difference in system architecture and operational cost can be orders of magnitude.

Freshness Is Not a Switch — It's a Spectrum

"Real-time" and "offline" are simplified legacy labels that compress a continuous spectrum into two switches. The real world looks like this:

Freshness Level	Typical Interval	How Data Is Refreshed
Daily	T+1	Batch jobs run at night, full recomputation
Hourly	Every 1–4 hours	Scheduled jobs, incremental or full computation
Minute-level	Every 1–15 minutes	Detect data changes and refresh incrementally on demand
Second-level	1–30 seconds	Event-driven, continuous incremental refresh
Sub-second	< 1 second	Per-event persistent processing

The further right you go, the lower the latency and the higher the cost. The real engineering question is: which frequency band does your scenario actually need?

The Widespread Misuse of the "Real-Time" Label

Many scenarios labeled "real-time" don't actually require second-level or sub-second latency from a business perspective. An IDC 2025 industry survey found that 63% of data processing scenarios only need minute-level freshness to meet business requirements — core report delivery can use minute-level as the standard (including fault recovery time); marketing activity analysis is explicitly positioned as "minute-level is sufficient"; BI dashboards refreshing within 1 to 15 minutes do not degrade decision quality.

Behind this is a universal pattern: the consumers of analytical workloads are people. Reading dashboards, understanding trends, and making decisions takes minutes to hours in itself. Whether data is available in 5 minutes versus 5 seconds makes negligible difference to the final decision — but the difference in system architecture and operational cost is orders of magnitude.

In production environments, the bottleneck in end-to-end latency is often not the processing engine, but the collection frequency upstream and the consumption pace downstream. Extracting data from business systems has its own latency (CDC log polling, file arrival cycles), and after results are pushed to dashboards, users may only check them every half hour. Making the middle processing layer run at second-level, when upstream hasn't kept up and downstream hasn't utilized it, pays extra costs that generate no corresponding value.

Consumption patterns also determine the effective upper bound on freshness. If the result goes into an email weekly report that's read once a day, minute-level refresh is meaningless — T+1 is sufficient. If it's an automated decision API embedded in an operations system, freshness requirements are determined by the API call frequency — if the interface is called every 5 minutes, data refreshed within 10 seconds has no value. BI tool connection methods also affect freshness: direct queries can retrieve the latest data each time; scheduled imports are determined by import frequency. Freshness targets should be derived backward from the consumption end — first ask "how is the data ultimately used", then ask "how fast should the processing pipeline be".

Another commonly overlooked factor: stream processing persists in production more because of stateful semantics than because of latency. Exactly-once semantics, out-of-order event correction, complex event pattern matching — these are the truly irreplaceable capabilities of stream engines. Many streaming jobs exist to unify streaming and batch architectures and reuse resources, not because the business truly needs sub-second latency.

The IDC 2025 industry survey showed that 63% of data processing scenarios only need minute-level freshness. The engineering community is validating this direction simultaneously. Multiple engines on independent evolutionary paths are converging toward the same goal — letting users declare "how fresh the data needs to be" and letting the engine decide whether to use batch or stream processing to achieve it. Flink's Materialized Table, Snowflake Dynamic Tables, and Databricks DLT are all doing the same thing. This is not a coincidence — it's the common answer that emerges when engineering constraints are pushed to a certain point.

Three Computational Paradigms: Incremental Computing Is Reshaping Both Ends Simultaneously

Incremental computing is not "a third option between batch and stream" — it is reshaping the landscape from both ends simultaneously, pulling batch processing and stream processing each toward the middle.

Batch Processing: Want to Be More Real-Time, But Don't Want to Learn Stream Processing

The core problem with batch processing is not that "it computes incorrectly" — it is the most accurate way to compute. The problem is that data freshness can't keep up with business needs. A dashboard that delivers T+1 data means users see yesterday's data. Market changes this morning can't be analyzed until this afternoon.

The traditional path to solve this problem is to learn stream processing: build a stream processing pipeline, deploy a persistent cluster, understand watermarks, windows, and triggers. But for data teams that primarily use SQL, this is equivalent to changing the entire kitchen to cook one dish.

Incremental computing offers an alternative path: no need to learn stream processing, no need to change the toolchain. The same standard SQL, with REFRESH INTERVAL 5 MINUTE declared, changes the batch processing pipeline from T+1 to minute-level. For most batch processing scenarios that only need minute-level freshness, this is sufficient.

Notably, many teams don't start from stream processing at all — they start from daily batch processing. When the business starts feeling "yesterday's data isn't enough", the options often become "either accept the status quo or introduce an entire stream computing system". Introducing stream computing means a new language, new operations, new team skills. What incremental computing opens up is not just "cheaper than stream processing" — it's these scenarios that were never built to be real-time at all because the barrier was too high.

Stream Processing: Want to Be Simpler, But Don't Want to Lose Real-Time Capability

The core problem with stream processing is not "not fast enough" — it's fast enough. The problem is architectures designed for second-level latency being widely used for scenarios that only need minute-level.

Persistent clusters occupy resources 24/7; state backend tuning requires dedicated teams; checkpoint and backpressure tuning grows non-linearly with the number of jobs. The operational burden is not just "difficult" — it's difficult at exactly a certain scale range. Very large organizations can maintain dedicated Flink platform teams and use mixed deployment to reuse nighttime idle resources, actually pressing down per-job costs. Teams with very few jobs can manage with native operational complexity. The most painful are those in the middle — job volume has grown beyond what native operations can handle, but the scale isn't sufficient to support a dedicated stream computing platform team. Every new business line going live means adding another layer to the operational burden.

The real consequence of this problem is not just "operations become harder": a new business line that only needs minute-level freshness, if required to bear the TCO and specialist requirements of a persistent streaming cluster, would rationally choose not to build it. Not because the scenario lacks value, but because the barrier blocks it. What incremental computing opens up are those scenarios that would never have been built in the first place.

Cloud-managed Flink alleviates the visible operational burden but doesn't eliminate the root cause — resources are still persistent, persistent means paying even at low peak, and new business line launches still require judging "is it worth it?".

When a pipeline is simply "processing data from ODS to DWD for downstream dashboards" and a 3-minute delay is completely irrelevant — why pay the cost of a 24/7 persistent cluster?

Incremental computing offers a choice to return to simplicity: scenarios that truly need sub-second level (anti-fraud, real-time bidding, CEP) retain stream processing. Scenarios that just want data available as soon as possible use declarative SQL. No need to manage state, no need to understand checkpoints, no need to get up at night to handle backpressure — set the refresh frequency and let the engine do the work.

What Will Stay, What Will Migrate

Paradigm	Scenarios That Will Stay in Original Paradigm	Scenarios That Will Migrate to Incremental Computing
Batch processing	Month-end reconciliation, compliance auditing, scenarios requiring full-data validation	BI dashboards, operational analysis, marketing reports, daily ETL
Stream processing	Anti-fraud interception, RTB bidding, sub-second alerts, stateful CEP requiring exactly-once	CDC ingestion to data warehouse, feature engineering, real-time big screens, minute-level data warehouses

Incremental computing is not inserting a middle option between batch and stream — it is redefining both lines: batch processing is being pulled up to minute-level, and the portion of stream processing that was "only for minute-level" is being pulled down to declarative SQL. Both ends converge toward the middle. Switching from persistent computing to on-demand refresh is a change of orders of magnitude in cost, not a marginal optimization in percentages.

How Singdata Lakehouse Does It

Singdata Lakehouse implements tiered freshness through two core capabilities:

Data ingestion: Continuously ingest data through real-time sync tasks (database CDC) or Pipe (Kafka / object storage), with latency options ranging from seconds to minutes.

Dynamic Table: Define multi-level freshness using standard SQL. The REFRESH INTERVAL clause controls the refresh frequency; the engine adaptively chooses between full or incremental execution underneath. See Dynamic Table.

CREATE DYNAMIC TABLE dws_sales_dashboard REFRESH INTERVAL 5 MINUTE VCLUSTER DEFAULT AS SELECT ...; -- Standard SQL, exactly the same as batch processing

REFRESH INTERVAL 5 MINUTE tells the engine "data needs to catch up within 5 minutes". The engine decides whether to use incremental or full execution this time — you don't need to manage it.

How to Determine What Freshness Level a Scenario Needs

First, and most critical: do you need computational semantics, or data freshness?

These two types of requirements are fundamentally different — they are not on the same dimension of fast vs. slow trade-offs.

If the business requires exactly-once guarantees, out-of-order event correction, or complex event pattern matching (such as "same account with 3 logins from different locations within 10 minutes") — that is a computational semantics problem. The core requirement of these scenarios is not "how quickly data is available" but "the computation process cannot have errors or missing data". Use stream processing; incremental computing is not a substitute.

If you simply want data to be available as soon as possible so dashboard numbers can catch up with recent events — that is a freshness problem. Use incremental computing, declare the refresh interval you need, and leave the rest to the engine.

After confirming it is a freshness problem, ask two more specific questions:

Who are the data consumers?

If they are people (dashboards, reports, BI analysis), minute-level is almost always sufficient. The pace of human decision-making determines the effective upper bound on data freshness. If the consumer is an automated decision system (anti-fraud interception, bidding engine), then second to sub-second level may need consideration — but at that point you are usually back to the computational semantics problem.

Can the upstream and downstream of the pipeline support how fast you want to go?

If the source data itself is a file produced hourly, processing the middle layer to second-level has no meaning. If results are pushed to a weekly report viewed once a week, second-level refresh has no value. Freshness is determined by the slowest link in the entire pipeline; accelerating only one segment in the middle is wasteful.

Freshness requirements within the same business are layered.

A data big screen requiring minute-level, a trending leaderboard requiring hourly, compliance reports requiring daily — these three requirements coexisting in the same business simultaneously is the norm, not the exception. Using one stream processing architecture to cover all layers means paying the cost of the strictest requirement for all scenarios. Layered configuration (different Dynamic Tables with different refresh intervals for different refresh frequencies) is more economical than a one-size-fits-all approach and easier to maintain.