Streaming infrastructure is built on a single core abstraction: an append-only distributed log. Producers write events to it; consumers read them in order, independently and at their own pace.
The Three Big Platforms
| Platform | Vendor | Model |
|---|---|---|
| Apache Kafka | Open-source (Confluent commercial) | Self-managed clusters or Confluent Cloud / MSK / Aiven |
| Amazon Kinesis Data Streams | AWS | Fully managed, on-demand or provisioned shards |
| Google Cloud Pub/Sub | GCP | Fully managed, push or pull, global by default |
| Azure Event Hubs | Azure | Kafka-protocol-compatible managed service |
The concepts transfer across all of them; nomenclature varies.
Core Concepts
Topic / Stream
A named feed of messages. "orders", "page-views", "iot.sensor.temperature".
Partition / Shard
A topic is split into partitions for parallelism. Each partition is an ordered sequence; ordering is guaranteed within a partition, not across partitions.
Topic: orders (3 partitions)
P0: [m0, m1, m4, m7, ...]
P1: [m2, m5, m8, ...]
P2: [m3, m6, m9, ...]
Producers route messages by key (e.g. customer_id) so all events for one customer go to the same partition and stay ordered.
Offset
The position of a message in a partition. Consumers track their own offset — Kafka does not push, it lets consumers pull.
Consumer Group
A set of consumers that share work on a topic. Kafka assigns partitions to consumers in the group; each partition is consumed by exactly one member at a time.
Retention
How long messages live. Kafka can retain forever (compaction or time-based). Kinesis defaults to 24h, configurable to 365 days. Pub/Sub default 7 days.
Producer / Consumer in Code
from confluent_kafka import Producer, Consumer
import json
# Producer
p = Producer({"bootstrap.servers": "broker:9092"})
p.produce(
topic="orders",
key=str(order["customer_id"]),
value=json.dumps(order),
)
p.flush()
# Consumer
c = Consumer({
"bootstrap.servers": "broker:9092",
"group.id": "orders-warehouse-loader",
"auto.offset.reset": "earliest",
"enable.auto.commit": False,
})
c.subscribe(["orders"])
while True:
msg = c.poll(1.0)
if msg is None or msg.error():
continue
handle(json.loads(msg.value()))
c.commit(msg)
Schema Management
Untyped JSON blobs are the single biggest source of streaming pain. A schema registry enforces a contract:
- Confluent Schema Registry — Avro, JSON Schema, Protobuf; subject-based versioning.
- AWS Glue Schema Registry — same idea, AWS-flavoured.
- Pub/Sub Schemas — Avro / Protobuf, native.
Compatibility rules (backward, forward, full) prevent producer changes from silently breaking consumers. Use them.
Delivery and Ordering Guarantees
| Guarantee | Kafka | Kinesis | Pub/Sub |
|---|---|---|---|
| At-least-once | Default | Default | Default |
| Exactly-once | Yes (idempotent producer + transactions) | Limited; consumer must dedupe | Yes (with ordering keys + dedupe) |
| Ordered per key | Yes (key → same partition) | Yes (per shard) | Yes (with ordering keys) |
| Global ordering | No (1 partition only — bottleneck) | No | No |
Design consumers to be idempotent — processing the same message twice should produce the same result. Use deterministic IDs and upserts.
Stream Processing on Top
The log is just storage. To do anything with the stream you need a processor:
- Apache Flink — true streaming, sub-100ms latency, strong stateful processing, exactly-once.
- Spark Structured Streaming — micro-batch (seconds), great if you already use Spark.
- Kafka Streams / ksqlDB — JVM library / SQL on top of Kafka; small/medium use cases.
- Apache Beam / Dataflow — unified batch + streaming, GCP-native and portable.
- Cloud-native — Kinesis Data Analytics (now Managed Service for Apache Flink), Azure Stream Analytics.
Common Streaming Patterns
CDC (Change Data Capture)
Stream every insert/update/delete from a database into Kafka, then into the warehouse / lake. Tools: Debezium, AWS DMS, Datastream, Fivetran HVR.
Stream → Lakehouse
Land Kafka topics directly into Delta or Iceberg tables (Auto Loader, Kafka Connect → S3 sink, Pub/Sub → BigQuery).
Real-time aggregations
Window-based counts (orders per minute, fraud signals per user) computed by Flink or Spark and surfaced via APIs or low-latency stores.
Event-sourcing / event-driven microservices
Use Kafka as the system of record. Services subscribe to events rather than calling each other directly.
Kafka vs Cloud-Native: How to Choose
| Pick Kafka (managed) when | Pick cloud-native when |
|---|---|
| You need rich ecosystem (Connect, Streams, ksqlDB, schema registry) | You're fully on one cloud and want zero ops |
| You want portability across clouds | Workloads are simple ingest → store |
| You have heavy per-key ordering / replay needs | Pub/sub + a couple of consumers is enough |
| You're standardising across many teams | You don't want to think about partitions |
Cert Mapping
| Cert | Streaming scope |
|---|---|
| DP-203 / DP-700 | Event Hubs, Stream Analytics, Fabric eventstreams |
| AWS DEA-C01 | Kinesis Data Streams / Firehose, MSK, Managed Flink |
| GCP PDE | Pub/Sub, Dataflow, Dataflow templates |
| Confluent Certified Developer | Kafka deep — producers, consumers, Streams, KSQL |
Operational Reality
- Streaming is operationally heavier than batch — plan for it.
- Always-on means always-paying — track cost per topic.
- Late and out-of-order events are the norm. Watermarks are not optional.
- Schema breaks are the #1 cause of incidents. Use a registry from day one.
- Replay matters: keep enough retention to rebuild downstream state.
Streaming, done well, is transformative. Done badly, it is an expensive way to lose data slightly faster.