Batch vs Streaming — Data Engineering Fundamentals | CertQnA

The first design choice in any data pipeline is its temporal mode. Get this wrong and the rest is harder than it needs to be.

The Three Modes

Mode	Latency	Tools	Best for
Batch	hours / daily	Airflow, dbt, Spark, BigQuery scheduled queries	Reporting, ML training, end-of-day reconciliation
Micro-batch	1–10 minutes	Spark Structured Streaming, Delta Live Tables, Snowflake Dynamic Tables	Near-real-time dashboards, slowly changing dimensions
Streaming	sub-second to seconds	Kafka Streams, Flink, Beam / Dataflow, Kinesis Analytics	Fraud, trading, IoT alerts, real-time personalisation

Why Batch Is Still the Default

Idempotent re-runs are simple — the dataset for "yesterday" is well-defined.
Failures are easy to recover: rerun the partition.
Cost is predictable: one job, one window.
Tooling (Airflow + dbt + warehouse) is mature and well-understood.
Most business decisions are made on data hours or a day old.

Default to batch unless a downstream use case clearly needs faster than that.

When Streaming Pays Off

Fraud detection — block a transaction now, not tomorrow.
Trading and pricing — milliseconds matter.
IoT alerts — temperature, pressure, equipment failure.
Real-time personalisation — react to a click within seconds.
Operational analytics — supply-chain, logistics, on-call dashboards.
CDC pipelines — propagate database changes to other systems with low lag.

If "minutes late" is acceptable, prefer micro-batch. True streaming is for cases where humans or automated systems will act in real time.

The Hard Parts of Streaming

Time Semantics

Event time — when the event actually happened.
Processing time — when your system saw it.
Late, out-of-order, and duplicate events are the rule, not the exception.
Watermarks define how long to wait before declaring a window "closed".

Delivery Semantics

At most once — messages may be lost. Rarely acceptable.
At least once — duplicates possible; consumers must be idempotent.
Exactly once — engine + sinks coordinate to ensure each event lands once. Possible with Flink + transactional sinks, Kafka EOS, but expensive.

State Management

Stateful operations (joins, aggregations, sessionisation) require checkpointed state. RocksDB-backed state is common; size and recovery are real engineering concerns.

Operational Complexity

Streaming pipelines run 24/7. Restarts, schema changes, code deploys, and recovery from corrupted state all require care. Plan for blue/green and replays.

Architectures That Combine Both

Lambda Architecture

Two parallel paths: a streaming path for fast, approximate results and a batch path for accurate, eventually-corrected results. Downstream merges both.

Pros: best-of-both for latency and correctness.
Cons: two pipelines, two codebases, double the bugs.

Kappa Architecture

One streaming pipeline does everything. Reprocessing means replaying the stream from the start.

Pros: one codebase, one mental model.
Cons: stream replay can be expensive; not all sources are replayable.

Modern Lakehouse / "Streaming Tables"

Delta Live Tables, Snowflake Dynamic Tables, BigQuery Continuous Queries blur the line. You write SQL or DataFrame code; the platform decides whether to run it as batch or streaming. Often the most pragmatic choice today.

Cost

Mode	Cost driver
Batch	Compute per run × runs per day; storage
Micro-batch	Always-on cluster (or serverless triggers)
Streaming	Always-on workers + state storage + bandwidth

True streaming is typically 5–10× the cost of equivalent batch for the same data volume — because the system never sleeps. Make sure the business value justifies it.

Choosing in Practice

Ask the consumer: "if this data were 1 hour late, would the right action change?" Most of the time, no.
If yes: "if it were 5 minutes late?" If still yes, you may need streaming.
Estimate cost and complexity at each tier.
Default to the simplest tier that meets the requirement.

Beware of "we want it real-time" without a concrete decision driving it. "Real-time" is sometimes a request for "we want fresh data" — which can often be solved with hourly batch and a confident SLA, far more cheaply.

What's Next

The next two lessons cover the most common batch patterns: ETL versus ELT, then the orchestration tools that schedule and operate them. Streaming gets its own dedicated lesson later, after we've covered the storage layers it depends on.