Skip to content
5 min read·Lesson 2 of 10

Batch vs Streaming

Pick the right processing pattern: scheduled batch, micro-batch, or true streaming. Trade-offs of latency, complexity, cost, and correctness.

The first design choice in any data pipeline is its temporal mode. Get this wrong and the rest is harder than it needs to be.

The Three Modes

ModeLatencyToolsBest for
Batchhours / dailyAirflow, dbt, Spark, BigQuery scheduled queriesReporting, ML training, end-of-day reconciliation
Micro-batch1–10 minutesSpark Structured Streaming, Delta Live Tables, Snowflake Dynamic TablesNear-real-time dashboards, slowly changing dimensions
Streamingsub-second to secondsKafka Streams, Flink, Beam / Dataflow, Kinesis AnalyticsFraud, trading, IoT alerts, real-time personalisation

Why Batch Is Still the Default

  • Idempotent re-runs are simple — the dataset for "yesterday" is well-defined.
  • Failures are easy to recover: rerun the partition.
  • Cost is predictable: one job, one window.
  • Tooling (Airflow + dbt + warehouse) is mature and well-understood.
  • Most business decisions are made on data hours or a day old.

Default to batch unless a downstream use case clearly needs faster than that.

When Streaming Pays Off

  • Fraud detection — block a transaction now, not tomorrow.
  • Trading and pricing — milliseconds matter.
  • IoT alerts — temperature, pressure, equipment failure.
  • Real-time personalisation — react to a click within seconds.
  • Operational analytics — supply-chain, logistics, on-call dashboards.
  • CDC pipelines — propagate database changes to other systems with low lag.

If "minutes late" is acceptable, prefer micro-batch. True streaming is for cases where humans or automated systems will act in real time.

The Hard Parts of Streaming

Time Semantics

  • Event time — when the event actually happened.
  • Processing time — when your system saw it.
  • Late, out-of-order, and duplicate events are the rule, not the exception.
  • Watermarks define how long to wait before declaring a window "closed".

Delivery Semantics

  • At most once — messages may be lost. Rarely acceptable.
  • At least once — duplicates possible; consumers must be idempotent.
  • Exactly once — engine + sinks coordinate to ensure each event lands once. Possible with Flink + transactional sinks, Kafka EOS, but expensive.

State Management

Stateful operations (joins, aggregations, sessionisation) require checkpointed state. RocksDB-backed state is common; size and recovery are real engineering concerns.

Operational Complexity

Streaming pipelines run 24/7. Restarts, schema changes, code deploys, and recovery from corrupted state all require care. Plan for blue/green and replays.

Architectures That Combine Both

Lambda Architecture

Two parallel paths: a streaming path for fast, approximate results and a batch path for accurate, eventually-corrected results. Downstream merges both.

  • Pros: best-of-both for latency and correctness.
  • Cons: two pipelines, two codebases, double the bugs.

Kappa Architecture

One streaming pipeline does everything. Reprocessing means replaying the stream from the start.

  • Pros: one codebase, one mental model.
  • Cons: stream replay can be expensive; not all sources are replayable.

Modern Lakehouse / "Streaming Tables"

Delta Live Tables, Snowflake Dynamic Tables, BigQuery Continuous Queries blur the line. You write SQL or DataFrame code; the platform decides whether to run it as batch or streaming. Often the most pragmatic choice today.

Cost

ModeCost driver
BatchCompute per run × runs per day; storage
Micro-batchAlways-on cluster (or serverless triggers)
StreamingAlways-on workers + state storage + bandwidth

True streaming is typically 5–10× the cost of equivalent batch for the same data volume — because the system never sleeps. Make sure the business value justifies it.

Choosing in Practice

  1. Ask the consumer: "if this data were 1 hour late, would the right action change?" Most of the time, no.
  2. If yes: "if it were 5 minutes late?" If still yes, you may need streaming.
  3. Estimate cost and complexity at each tier.
  4. Default to the simplest tier that meets the requirement.

Beware of "we want it real-time" without a concrete decision driving it. "Real-time" is sometimes a request for "we want fresh data" — which can often be solved with hourly batch and a confident SLA, far more cheaply.

What's Next

The next two lessons cover the most common batch patterns: ETL versus ELT, then the orchestration tools that schedule and operate them. Streaming gets its own dedicated lesson later, after we've covered the storage layers it depends on.

Key Takeaways

  • Batch is simpler, cheaper, and the right default — use until latency forces otherwise.
  • Streaming pays off when seconds-fresh data drives decisions or operations.
  • Micro-batch (Spark Structured Streaming, Delta Live) sits in the middle — minute-fresh.
  • Exactly-once semantics, late data, and ordering are the hard parts of streaming.
  • Lambda and Kappa architectures combine batch and streaming for different consistency goals.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →