How to Ensure Exactly-Once Processing in Spark Structured Streaming

·September 24, 2025

·10 min read

How to Ensure Exactly-Once Processing in Spark Structured Streaming — Image Source: unsplash

You want to know if Spark can process data only once. This means Spark must always get the same data from the same place. Your data sinks also need to be idempotent. If you write the same data again, the result should stay the same. Data integrity is important because you do not want extra copies or missing records.

Here is a quick look at delivery semantics in streaming systems:

Delivery Semantics	Description
At most once	The app may see the tuple one time or not at all. This allows data loss but no extra copies.
At least once	The app will always see the tuple, but there may be extra copies.
Exactly once	The app sees and uses each tuple only one time, even if there are failures or retries.

To get Exactly-Once Guarantees, you need the right setup for sources, sinks, and fault tolerance.

Key Takeaways

Pick sources like Kafka or HDFS that let you replay data. This helps you fix problems and not lose any records.
Turn on checkpointing and Write-Ahead Logs to save your stream’s state. This lets Spark start again from the last safe spot and not make extra copies of data.
Use idempotent sinks so you do not get duplicate records. These sinks make sure writing the same data again does not change anything.
Watch your streaming jobs often with tools like StreamingQueryListener. This helps you find problems early and keep exactly-once delivery.
Test your streaming setup a lot to make sure each record is only processed once. This helps you stop data loss or extra copies.

Exactly-Once Guarantees in Spark Streaming

Is Exactly Once Possible?

You might ask if Spark can really do exactly-once. Spark Structured Streaming can give strong exactly-once guarantees for many jobs. It uses checkpointing and Write-Ahead Logs to help with this. These tools keep track of your stream’s state and save what has been processed. If Spark stops working, it can start again without losing or repeating data. This helps keep your data safe and correct.

Here is a table that shows how Spark makes sure of exactly-once guarantees:

Mechanism	Description
Checkpointing	Saves state in a special HDFS folder so Spark can recover and not repeat data.
Input File Handling	Makes sure every input file is only read once and stops reading files that are not finished.
Metadata Usage	Uses metadata to remember which files are done, so only finished data is read later.

Micro-batching is also very important. Spark splits your stream into small groups called batches. Each batch has a clear start and finish. This keeps delivery rules the same every time. If Spark needs to run a batch again, you get the same answer. You can trust Spark to process your stream the right way.

Note: foreachBatch only gives at-least-once delivery. If you use foreachBatch, you need to add extra steps to remove repeated records and get exactly-once.

Key Conditions

You must follow some rules to get exactly-once in Spark. You need sources that let you replay data. You must turn on checkpointing and Write-Ahead Logs. These save your stream’s state and help you fix problems if Spark fails. You also need idempotent sinks. If you write the same data two times, the result should not change. This setup gives you strong delivery rules.

Spark uses micro-batching to make clear batch lines. The driver saves where it is before each batch starts. This lets Spark run batches again and still keep exactly-once. You see these rules in real streaming systems. For example, some data stacks use Amazon Kinesis and Lambda to control delivery and make sure processing is exactly-once.

Understanding Exactly-Once Semantics

Definition

You might ask what exactly-once semantics means. It is a promise that each piece of data gets handled only one time. If something fails, the system does not skip or repeat data. Spark keeps track of your data’s progress. It uses checkpoints and write-ahead logs to remember where it stopped. If there is a problem, Spark can restart from the last safe place. You do not lose any data, and you do not get extra copies. Idempotent sinks help keep your results right, even if Spark writes the same data again.

Tip: Exactly-once semantics lets you trust your results. Your data stays complete and correct, even if things go wrong.

Comparison with Other Guarantees

Streaming systems have three main types of processing guarantees. Each type handles data delivery in its own way:

At-most-once: The system may drop messages. You might miss some data, but you never see the same message twice.
At-least-once: The system delivers every message, but you might see some messages more than once.
Exactly-once: The system delivers each message only one time.

Here is a table that shows how they are different:

Processing Guarantee	Description	Characteristics	Challenges
At-Most-Once	Messages can be lost; delivery is not promised.	Messages are short-lived; missed messages are gone.	No delivery check; risk of losing data.
At-Least-Once	Delivery is promised but you may get duplicate messages.	Messages stay until confirmed; messages may be sent again.	Must deal with duplicates; more work for the system.
Exactly-Once	Tries for perfect delivery and confirmation, but is hard to do.	Delivery and confirmation are together; tracks all steps.	Two-phase commit makes it tough to achieve.

Spark gives you at-least-once processing by default. You can get exactly-once semantics with extra setup, like using reliable sources and idempotent sinks. Some systems, such as Apache Flink, give exactly-once processing without extra work. Spark lets you choose, but you must set it up right to avoid missing or repeated data.

Configuring for Exactly-Once Processing

Reliable Sources

You need reliable sources for exactly-once delivery in Spark. Reliable sources let you replay data if something fails. Fault-tolerant file systems like HDFS help Spark recover and process records only once. Kafka is also a good choice. The Kafka Direct API in Spark makes sure you get all records just once if you set up exactly-once output.

Replayable sources are important for stream processing. You can keep track of where you are reading and reprocess records after a failure. Apache Kafka and Amazon Kinesis let you do this. Non-replayable sources, like MemoryStream, do not let you recover after closing the app. You should pick sources that let you replay records to keep your data safe.

Replayable sources like Kafka and idempotent sinks like Delta Lake are needed for exactly-once processing.
Checkpointing and write-ahead logs (WAL) help you retry without making duplicates.
The WAL works like a journal and tracks each micro-batch so Spark can safely reprocess records if something goes wrong.

Checkpointing Setup

Checkpointing is very important for exactly-once guarantees. You save your streaming job’s progress in safe storage. Regular checkpointing helps Spark recover from failures and stops data loss. You should use storage like HDFS or cloud storage for checkpoints. The storage you choose affects how well Spark can recover and keep your records safe.

You need to turn on Write Ahead Logs (WAL) by setting spark.streaming.receiver.writeAheadLog.enable. WAL saves incoming data before processing, so Spark can recover after a failure. Do not use cache replication with WAL because logs are already stored safely. Make sure the receiver only says records are saved after putting them in the WAL.

Use metadata checkpoints to save your streaming app’s setup and context. This lets Spark recover after a driver failure.
Data checkpoints store RDDs without extra links, which saves space compared to caching.
Doing checkpoints often helps recovery but can slow things down. You need to find a good balance.

Checkpointing lets Spark restart from the last safe spot. It keeps work correct by using the latest saved state. This makes sure you process records only once, even if something fails.

Idempotent Sinks

Idempotent sinks are needed for exactly-once delivery. These sinks let you write the same records again and the result stays the same. If Spark retries a batch, idempotent sinks stop duplicate records. You need sinks that support idempotent actions to keep your records right.

Characteristic	Description
Idempotent Operations	Handle reprocessing and keep data safe during failures.
Replayable Sources	Help recover from problems without making extra records.
Exactly-Once Semantics	Make sure you process each record only once, even if something fails.

Idempotent sinks help with reprocessing and keep data safe.
Replayable sources are needed for recovery without making duplicates.
Both are needed for exactly-once processing in Spark Structured Streaming.

To get exactly-once, you must use transactional sinks. The source must be replayable. The sink must support idempotent actions so you can reprocess records if something fails.

Sink Types

You have different choices for sinks that support exactly-once guarantees. Transactional sinks are the best. Streaming from the change data capture (CDC) feed of a Delta table is the top choice for exactly-once guarantees. Streaming straight from the Delta table is not as good.

Recommended Option	Description
CDC Feed	Streaming from the change data capture feed of a Delta table is best for exactly-once guarantees.
Delta Table	Streaming straight from the Delta table is not as good.

Transactional sinks like databases, HDFS, S3, and Kafka help you save data safely. Kafka does not give exactly-once by default. It gives at-least-once delivery. You can stop duplicate records by using unique keys for messages and saving them in a database or filesystem. For example, sending messages to HBase with unique row keys means any message with the same key will replace the old one.

You can use transactional writes to save data and stop duplicates. This helps you keep only unique records in your sink.

For more about exactly-once with Kafka and S3, you can find guides from Confluent and other sources.

Troubleshooting Tips

You might have problems when setting up exactly-once delivery. Here are some tips to help:

Use checkpointing to save your streaming job’s progress in safe storage.
Write-Ahead Logs (WAL) save incoming data before processing, so you can recover after a failure.
Idempotent writes make sure writing the same batch again does not make duplicates.

If you see duplicate records, check your sink setup. Make sure your sinks support transactional commit and idempotent actions. Always test your stream jobs to make sure you process each record only once.

Common Pitfalls

Failure Recovery

You might have trouble with failure recovery in Spark Structured Streaming. If you set up checkpointing wrong or use bad storage, you can lose data or get duplicates. Spark gives each partition a special ID and keeps track of the epoch for every minibatch. This helps you find and stop duplicate records after something fails. The lineage graph in Spark lets you replay steps, so you get the same results even if things break.

Some mistakes people make in recovery are:

Not turning on checkpointing or using weak storage.
Not handling event-time, which can cause late or mixed-up events.
Watermarks set up wrong can drop good data or slow things down.
Not planning for enough resources, which makes things run slow.

You can make recovery better by using transactional sinks and keeping another table to track the last partition and epoch you saved. Only add records to the main table after checking the other table, and do this all in one transaction.

Duplicate Writes

Duplicate writes can happen if you do not set up your stream job right. For example, you might get duplicate records from a Kafka topic when you only want one update. Late data and wrong output modes can also cause this. If you pick the wrong output mode, you might write extra data.

You can remove duplicate records in your stream by using a special ID for each event. This works like removing duplicates in regular data. Your query remembers enough from old records to filter out repeats. You can do this with or without watermarking.

Other reasons for duplicate writes are:

Not managing Kafka offsets right.
Not clearing cached DataFrames, which can waste memory.
Making streaming queries too complex, which makes them hard to fix.

Monitoring

Monitoring helps you find problems early and keep exactly-once delivery. Spark has StreamingQueryListener, which lets you watch how streaming is doing. You should match Spark cluster logs with other system logs to find problems faster. Tools like Elasticsearch or Splunk help you collect logs in one place.

Good ways to monitor include:

Using checkpointing to save offsets and state for recovery.
Picking idempotent sinks, like Delta Lake or Kafka 0.11+, for safe retries.
Setting up alerts with Spark’s metrics and tools like Prometheus and Grafana.
Watching metrics with Spark’s REST API to find problems fast.

A strong monitoring setup helps keep your stream working well and makes sure delivery is reliable.

You can get exactly-once delivery in Spark Structured Streaming if you follow some important steps. First, use sources that let you replay data. Make sure your receivers are reliable. Use the Write-Ahead Log and checkpoints to keep your work safe. Pick idempotent sinks so you do not get duplicate records. Set things up carefully and watch your stream closely to find problems early. Check your sources and sinks often. Test your streaming jobs to make sure your delivery works well.

Checklist for Exactly-Once Processing:

Use replayable sources (like Kafka or Event Hubs)
Enable reliable receivers
Turn on Write-Ahead Log
Set up driver checkpoints
Choose idempotent sinks

FAQ

What does "idempotent sink" mean in Spark Structured Streaming?

An idempotent sink lets you write the same data again. The result does not change if you write it twice. This stops duplicate records when Spark retries a batch. Your data stays clean and correct.

Can you use Kafka for exactly-once processing?

You can use Kafka for exactly-once processing if you set up transactional writes. You also need idempotent consumers. You must manage offsets carefully. This helps you avoid duplicate messages.

Why is checkpointing important for fault tolerance?

Checkpointing saves your streaming job’s progress. If Spark fails, you can restart from the last checkpoint. You do not lose data. You do not process records twice. This keeps your stream reliable.

What happens if you do not use replayable sources?

If you do not use replayable sources, you cannot get lost data back after a failure. You might miss records or get duplicates. Always pick sources that let you replay data for safe processing.

How do you monitor exactly-once processing in Spark?

You can use StreamingQueryListener to watch your stream. Set up alerts with tools like Prometheus. Track metrics and logs to catch problems early. Good monitoring helps keep your data safe.