Streaming: Kafka, Kinesis, Pub/Sub — Data Engineering Fundamentals | CertQnA

Streaming infrastructure is built on a single core abstraction: an append-only distributed log. Producers write events to it; consumers read them in order, independently and at their own pace.

The Three Big Platforms

Platform	Vendor	Model
Apache Kafka	Open-source (Confluent commercial)	Self-managed clusters or Confluent Cloud / MSK / Aiven
Amazon Kinesis Data Streams	AWS	Fully managed, on-demand or provisioned shards
Google Cloud Pub/Sub	GCP	Fully managed, push or pull, global by default
Azure Event Hubs	Azure	Kafka-protocol-compatible managed service

The concepts transfer across all of them; nomenclature varies.

Core Concepts

Topic / Stream

A named feed of messages. "orders", "page-views", "iot.sensor.temperature".

Partition / Shard

A topic is split into partitions for parallelism. Each partition is an ordered sequence; ordering is guaranteed within a partition, not across partitions.

Topic: orders   (3 partitions)
  P0: [m0, m1, m4, m7, ...]
  P1: [m2, m5, m8, ...]
  P2: [m3, m6, m9, ...]

Producers route messages by key (e.g. customer_id) so all events for one customer go to the same partition and stay ordered.

Offset

The position of a message in a partition. Consumers track their own offset — Kafka does not push, it lets consumers pull.

Consumer Group

A set of consumers that share work on a topic. Kafka assigns partitions to consumers in the group; each partition is consumed by exactly one member at a time.

Retention

How long messages live. Kafka can retain forever (compaction or time-based). Kinesis defaults to 24h, configurable to 365 days. Pub/Sub default 7 days.

Producer / Consumer in Code

from confluent_kafka import Producer, Consumer
import json

# Producer
p = Producer({"bootstrap.servers": "broker:9092"})
p.produce(
    topic="orders",
    key=str(order["customer_id"]),
    value=json.dumps(order),
)
p.flush()

# Consumer
c = Consumer({
    "bootstrap.servers": "broker:9092",
    "group.id": "orders-warehouse-loader",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
})
c.subscribe(["orders"])
while True:
    msg = c.poll(1.0)
    if msg is None or msg.error():
        continue
    handle(json.loads(msg.value()))
    c.commit(msg)

Schema Management

Untyped JSON blobs are the single biggest source of streaming pain. A schema registry enforces a contract:

Confluent Schema Registry — Avro, JSON Schema, Protobuf; subject-based versioning.
AWS Glue Schema Registry — same idea, AWS-flavoured.
Pub/Sub Schemas — Avro / Protobuf, native.

Compatibility rules (backward, forward, full) prevent producer changes from silently breaking consumers. Use them.

Delivery and Ordering Guarantees

Guarantee	Kafka	Kinesis	Pub/Sub
At-least-once	Default	Default	Default
Exactly-once	Yes (idempotent producer + transactions)	Limited; consumer must dedupe	Yes (with ordering keys + dedupe)
Ordered per key	Yes (key → same partition)	Yes (per shard)	Yes (with ordering keys)
Global ordering	No (1 partition only — bottleneck)	No	No

Design consumers to be idempotent — processing the same message twice should produce the same result. Use deterministic IDs and upserts.

Stream Processing on Top

The log is just storage. To do anything with the stream you need a processor:

Apache Flink — true streaming, sub-100ms latency, strong stateful processing, exactly-once.
Spark Structured Streaming — micro-batch (seconds), great if you already use Spark.
Kafka Streams / ksqlDB — JVM library / SQL on top of Kafka; small/medium use cases.
Apache Beam / Dataflow — unified batch + streaming, GCP-native and portable.
Cloud-native — Kinesis Data Analytics (now Managed Service for Apache Flink), Azure Stream Analytics.

Common Streaming Patterns

CDC (Change Data Capture)

Stream every insert/update/delete from a database into Kafka, then into the warehouse / lake. Tools: Debezium, AWS DMS, Datastream, Fivetran HVR.

Stream → Lakehouse

Land Kafka topics directly into Delta or Iceberg tables (Auto Loader, Kafka Connect → S3 sink, Pub/Sub → BigQuery).

Real-time aggregations

Window-based counts (orders per minute, fraud signals per user) computed by Flink or Spark and surfaced via APIs or low-latency stores.

Event-sourcing / event-driven microservices

Use Kafka as the system of record. Services subscribe to events rather than calling each other directly.

Kafka vs Cloud-Native: How to Choose

Pick Kafka (managed) when	Pick cloud-native when
You need rich ecosystem (Connect, Streams, ksqlDB, schema registry)	You're fully on one cloud and want zero ops
You want portability across clouds	Workloads are simple ingest → store
You have heavy per-key ordering / replay needs	Pub/sub + a couple of consumers is enough
You're standardising across many teams	You don't want to think about partitions

Cert Mapping

Cert	Streaming scope
DP-203 / DP-700	Event Hubs, Stream Analytics, Fabric eventstreams
AWS DEA-C01	Kinesis Data Streams / Firehose, MSK, Managed Flink
GCP PDE	Pub/Sub, Dataflow, Dataflow templates
Confluent Certified Developer	Kafka deep — producers, consumers, Streams, KSQL

Operational Reality

Streaming is operationally heavier than batch — plan for it.
Always-on means always-paying — track cost per topic.
Late and out-of-order events are the norm. Watermarks are not optional.
Schema breaks are the #1 cause of incidents. Use a registry from day one.
Replay matters: keep enough retention to rebuild downstream state.

Streaming, done well, is transformative. Done badly, it is an expensive way to lose data slightly faster.