The first design choice in any data pipeline is its temporal mode. Get this wrong and the rest is harder than it needs to be.
The Three Modes
| Mode | Latency | Tools | Best for |
|---|---|---|---|
| Batch | hours / daily | Airflow, dbt, Spark, BigQuery scheduled queries | Reporting, ML training, end-of-day reconciliation |
| Micro-batch | 1–10 minutes | Spark Structured Streaming, Delta Live Tables, Snowflake Dynamic Tables | Near-real-time dashboards, slowly changing dimensions |
| Streaming | sub-second to seconds | Kafka Streams, Flink, Beam / Dataflow, Kinesis Analytics | Fraud, trading, IoT alerts, real-time personalisation |
Why Batch Is Still the Default
- Idempotent re-runs are simple — the dataset for "yesterday" is well-defined.
- Failures are easy to recover: rerun the partition.
- Cost is predictable: one job, one window.
- Tooling (Airflow + dbt + warehouse) is mature and well-understood.
- Most business decisions are made on data hours or a day old.
Default to batch unless a downstream use case clearly needs faster than that.
When Streaming Pays Off
- Fraud detection — block a transaction now, not tomorrow.
- Trading and pricing — milliseconds matter.
- IoT alerts — temperature, pressure, equipment failure.
- Real-time personalisation — react to a click within seconds.
- Operational analytics — supply-chain, logistics, on-call dashboards.
- CDC pipelines — propagate database changes to other systems with low lag.
If "minutes late" is acceptable, prefer micro-batch. True streaming is for cases where humans or automated systems will act in real time.
The Hard Parts of Streaming
Time Semantics
- Event time — when the event actually happened.
- Processing time — when your system saw it.
- Late, out-of-order, and duplicate events are the rule, not the exception.
- Watermarks define how long to wait before declaring a window "closed".
Delivery Semantics
- At most once — messages may be lost. Rarely acceptable.
- At least once — duplicates possible; consumers must be idempotent.
- Exactly once — engine + sinks coordinate to ensure each event lands once. Possible with Flink + transactional sinks, Kafka EOS, but expensive.
State Management
Stateful operations (joins, aggregations, sessionisation) require checkpointed state. RocksDB-backed state is common; size and recovery are real engineering concerns.
Operational Complexity
Streaming pipelines run 24/7. Restarts, schema changes, code deploys, and recovery from corrupted state all require care. Plan for blue/green and replays.
Architectures That Combine Both
Lambda Architecture
Two parallel paths: a streaming path for fast, approximate results and a batch path for accurate, eventually-corrected results. Downstream merges both.
- Pros: best-of-both for latency and correctness.
- Cons: two pipelines, two codebases, double the bugs.
Kappa Architecture
One streaming pipeline does everything. Reprocessing means replaying the stream from the start.
- Pros: one codebase, one mental model.
- Cons: stream replay can be expensive; not all sources are replayable.
Modern Lakehouse / "Streaming Tables"
Delta Live Tables, Snowflake Dynamic Tables, BigQuery Continuous Queries blur the line. You write SQL or DataFrame code; the platform decides whether to run it as batch or streaming. Often the most pragmatic choice today.
Cost
| Mode | Cost driver |
|---|---|
| Batch | Compute per run × runs per day; storage |
| Micro-batch | Always-on cluster (or serverless triggers) |
| Streaming | Always-on workers + state storage + bandwidth |
True streaming is typically 5–10× the cost of equivalent batch for the same data volume — because the system never sleeps. Make sure the business value justifies it.
Choosing in Practice
- Ask the consumer: "if this data were 1 hour late, would the right action change?" Most of the time, no.
- If yes: "if it were 5 minutes late?" If still yes, you may need streaming.
- Estimate cost and complexity at each tier.
- Default to the simplest tier that meets the requirement.
Beware of "we want it real-time" without a concrete decision driving it. "Real-time" is sometimes a request for "we want fresh data" — which can often be solved with hourly batch and a confident SLA, far more cheaply.
What's Next
The next two lessons cover the most common batch patterns: ETL versus ELT, then the orchestration tools that schedule and operate them. Streaming gets its own dedicated lesson later, after we've covered the storage layers it depends on.