Batch processing and stream processing are two different approaches to data processing. Batch processing involves collecting data over time and processing it in large chunks at scheduled intevals. Stream processing, on the other hand, processes data in real time as it arrives. In this blog post, we’ll illustrate them both with examples.
Batch Processing | Stream Processing | |
---|---|---|
Implementation difficulty | Traditionally easier to implement and to handle failures, since there is built-in slack time | More complex, often requires specialized infrastructure, difficult to handle failures and recovery |
Latency | Higher latency (minutes to hours) | Low latency (seconds or less) |
Cost | Generally more cost effective, especially because you can run batch jobs in off-peak hours | Can be more expensive due to constant processing |
What is batch processing?
Batch processing involves collecting data over time and processing it in large chunks at scheduled intervals. This method is ideal for handling large volumes of data where immediate results are not critical.
Examples
- Web scraping scripts that collect data from PDF files on a fixed hourly schedule
- Processing new files from a file store every 10 minutes or hour
- Scheduled reading of messages from a queue (e.g., every 10 minutes)
- Manually initiated processes for handling accumulated data
- Hourly data transfers from OLTP storage or NoSQL databases to data lakes/warehouses
Technologies
Batch processing is traditionally how data warehouses and ETL pipelines operate, going back to the days of “big data” and Hadoop. Many of the orchestrators and frameworks below, then, were built in the batch processing paradigm.
What is stream processing?
Stream processing, in contrast, handles each data item in real-time as it arrives. This method is perfect for scenarios requiring immediate data processing and analysis. Anything labeled “real-time” or “live” likely involves stream processing.
It’s important to note that stream processing isn’t just short-interval batch processing! Batch processing processes data in discrete chunks even if those chunks are small!
Examples
- Real-time event processing through APIs and message queues
- Continuous data processing from IoT devices like cars or weather stations
- Instant processing of social media posts for immediate timeline updates
- Trigger-based processing when new data appears (e.g., file uploads)
Technologies
Traditionally, stream processing has been more complex to implement than batch processing. In stream processing, one has to deal with things like state management, consistency, and ensuring that each event is processed exactly once. There are a number of frameworks that have been developed to enable this and make it easier.
Message brokers
Message brokers like Apache Kafka and Amazon SQS (Simple Queue Service) play a crucial role in enabling stream processing. They ingest and buffer streaming data, ensure reliable delivery of messages/events, and allow multiple consumers to read from the same stream of data.
Streaming frameworks
Streaming frameworks like Apache Flink and Kafka Streams provide the tools to process and analyze streaming data in real-time. They offer features like windowing, state management, and fault tolerance to handle the complexities of stream processing.
Streaming databases
Streaming databases are a new category of databases designed to handle real-time data processing and analytics. They provide low-latency access to streaming data, enabling real-time dashboards, analytics, and decision-making.
Use cases
Feature engineering for rec sys: use batch processing (okay for models to train on data lag)
Fraud detection: use stream processing (can’t wait a day for fraud)
BI / Analytics: use batch processing (easier to manage, generally okay with 1 day of data lag)
Trading desk: use stream processing (need to be able to react to real time news / data)