Skip to content
5 min read·Lesson 8 of 10

Streaming: Kafka, Kinesis, Pub/Sub

Event streaming platforms compared. Topics, partitions, offsets, and consumer groups; when to use Kafka vs cloud-native alternatives.

Streaming infrastructure is built on a single core abstraction: an append-only distributed log. Producers write events to it; consumers read them in order, independently and at their own pace.

The Three Big Platforms

PlatformVendorModel
Apache KafkaOpen-source (Confluent commercial)Self-managed clusters or Confluent Cloud / MSK / Aiven
Amazon Kinesis Data StreamsAWSFully managed, on-demand or provisioned shards
Google Cloud Pub/SubGCPFully managed, push or pull, global by default
Azure Event HubsAzureKafka-protocol-compatible managed service

The concepts transfer across all of them; nomenclature varies.

Core Concepts

Topic / Stream

A named feed of messages. "orders", "page-views", "iot.sensor.temperature".

Partition / Shard

A topic is split into partitions for parallelism. Each partition is an ordered sequence; ordering is guaranteed within a partition, not across partitions.

Topic: orders   (3 partitions)
  P0: [m0, m1, m4, m7, ...]
  P1: [m2, m5, m8, ...]
  P2: [m3, m6, m9, ...]

Producers route messages by key (e.g. customer_id) so all events for one customer go to the same partition and stay ordered.

Offset

The position of a message in a partition. Consumers track their own offset — Kafka does not push, it lets consumers pull.

Consumer Group

A set of consumers that share work on a topic. Kafka assigns partitions to consumers in the group; each partition is consumed by exactly one member at a time.

Retention

How long messages live. Kafka can retain forever (compaction or time-based). Kinesis defaults to 24h, configurable to 365 days. Pub/Sub default 7 days.

Producer / Consumer in Code

from confluent_kafka import Producer, Consumer
import json

# Producer
p = Producer({"bootstrap.servers": "broker:9092"})
p.produce(
    topic="orders",
    key=str(order["customer_id"]),
    value=json.dumps(order),
)
p.flush()

# Consumer
c = Consumer({
    "bootstrap.servers": "broker:9092",
    "group.id": "orders-warehouse-loader",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
})
c.subscribe(["orders"])
while True:
    msg = c.poll(1.0)
    if msg is None or msg.error():
        continue
    handle(json.loads(msg.value()))
    c.commit(msg)

Schema Management

Untyped JSON blobs are the single biggest source of streaming pain. A schema registry enforces a contract:

  • Confluent Schema Registry — Avro, JSON Schema, Protobuf; subject-based versioning.
  • AWS Glue Schema Registry — same idea, AWS-flavoured.
  • Pub/Sub Schemas — Avro / Protobuf, native.

Compatibility rules (backward, forward, full) prevent producer changes from silently breaking consumers. Use them.

Delivery and Ordering Guarantees

GuaranteeKafkaKinesisPub/Sub
At-least-onceDefaultDefaultDefault
Exactly-onceYes (idempotent producer + transactions)Limited; consumer must dedupeYes (with ordering keys + dedupe)
Ordered per keyYes (key → same partition)Yes (per shard)Yes (with ordering keys)
Global orderingNo (1 partition only — bottleneck)NoNo

Design consumers to be idempotent — processing the same message twice should produce the same result. Use deterministic IDs and upserts.

Stream Processing on Top

The log is just storage. To do anything with the stream you need a processor:

  • Apache Flink — true streaming, sub-100ms latency, strong stateful processing, exactly-once.
  • Spark Structured Streaming — micro-batch (seconds), great if you already use Spark.
  • Kafka Streams / ksqlDB — JVM library / SQL on top of Kafka; small/medium use cases.
  • Apache Beam / Dataflow — unified batch + streaming, GCP-native and portable.
  • Cloud-native — Kinesis Data Analytics (now Managed Service for Apache Flink), Azure Stream Analytics.

Common Streaming Patterns

CDC (Change Data Capture)

Stream every insert/update/delete from a database into Kafka, then into the warehouse / lake. Tools: Debezium, AWS DMS, Datastream, Fivetran HVR.

Stream → Lakehouse

Land Kafka topics directly into Delta or Iceberg tables (Auto Loader, Kafka Connect → S3 sink, Pub/Sub → BigQuery).

Real-time aggregations

Window-based counts (orders per minute, fraud signals per user) computed by Flink or Spark and surfaced via APIs or low-latency stores.

Event-sourcing / event-driven microservices

Use Kafka as the system of record. Services subscribe to events rather than calling each other directly.

Kafka vs Cloud-Native: How to Choose

Pick Kafka (managed) whenPick cloud-native when
You need rich ecosystem (Connect, Streams, ksqlDB, schema registry)You're fully on one cloud and want zero ops
You want portability across cloudsWorkloads are simple ingest → store
You have heavy per-key ordering / replay needsPub/sub + a couple of consumers is enough
You're standardising across many teamsYou don't want to think about partitions

Cert Mapping

CertStreaming scope
DP-203 / DP-700Event Hubs, Stream Analytics, Fabric eventstreams
AWS DEA-C01Kinesis Data Streams / Firehose, MSK, Managed Flink
GCP PDEPub/Sub, Dataflow, Dataflow templates
Confluent Certified DeveloperKafka deep — producers, consumers, Streams, KSQL

Operational Reality

  • Streaming is operationally heavier than batch — plan for it.
  • Always-on means always-paying — track cost per topic.
  • Late and out-of-order events are the norm. Watermarks are not optional.
  • Schema breaks are the #1 cause of incidents. Use a registry from day one.
  • Replay matters: keep enough retention to rebuild downstream state.

Streaming, done well, is transformative. Done badly, it is an expensive way to lose data slightly faster.

Key Takeaways

  • Kafka, Kinesis, and Pub/Sub are append-only distributed logs — producers write, consumers read at their own pace.
  • Topics are partitioned for parallelism; offsets track consumer position.
  • Kafka is the open standard; Kinesis and Pub/Sub trade configurability for managed simplicity.
  • Schema management (Schema Registry, JSON Schema, Avro/Protobuf) prevents silent producer/consumer breaks.
  • Most streaming pipelines combine a log (Kafka) with a processor (Flink, Spark, ksqlDB) and a sink (warehouse, lake, app).

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →