Message Queues and Event-Driven Design — System Design Fundamentals | CertQnA

Synchronous request/response is simple — until one downstream call slows everything down, fails, or has different scaling needs. Queues and event streams are the glue that lets services move at different speeds and survive each other's failures.

Why Queues

Decoupling — producers don't know consumers; consumers can change without producer changes.
Buffering — absorb spikes; consumers process at their own rate.
Async processing — return 202 fast; do the slow work later.
Retries — failed work goes back on the queue or to a dead-letter queue.
Fan-out — one event, many consumers.

Two Models

Work queue (point-to-point)

Each message is processed by exactly one consumer in a group. Used for tasks: send email, resize image, charge card.

[ Producer ] → [ Queue ] → [ Worker ]
                          → [ Worker ]
                          → [ Worker ]
   each message goes to exactly one worker

Tools: SQS, RabbitMQ, BullMQ (Redis), Celery, Sidekiq.

Pub/sub (topic / event stream)

Each message is broadcast to all subscribers. Used for events: OrderPlaced, UserSignedUp, downstream services react.

[ Producer ] → [ Topic ] → [ Subscriber A ]   (e.g. email service)
                          → [ Subscriber B ]   (e.g. analytics)
                          → [ Subscriber C ]   (e.g. recommendations)

Tools: Kafka, Pub/Sub, SNS+SQS, EventBridge, NATS, RabbitMQ topic exchanges.

Delivery Guarantees

Guarantee	Behaviour	When
At-most-once	May lose messages	Telemetry where loss is acceptable
At-least-once	May deliver duplicates	The practical default for most queues
Exactly-once	Each message delivered once	Requires transactional broker + idempotent consumer; expensive

The pragmatic posture: assume at least once, make consumers idempotent. A duplicate charge_card is a refund waiting to happen — guard it with a deduplication key.

Idempotent Consumers

def handle(message):
    msg_id = message["id"]
    if already_processed(msg_id):
        return ack()
    do_work(message)
    mark_processed(msg_id)
    ack()

Track processed message IDs in a database, Redis, or a "processed" table with a TTL. Or design operations so re-applying is naturally safe (UPSERT, MERGE, conditional update).

Ordering

Most queues guarantee ordering only within a partition / shard.
Order events for the same user/order on the same partition (key by user_id).
Global ordering across all events is rarely possible at scale, and usually unnecessary.

Dead-Letter Queues (DLQs)

After N retries, send the failing message to a dedicated DLQ instead of looping forever. Alert on DLQ depth; investigate, fix, replay.

Without a DLQ, one poison message can clog the queue and block all other work behind it.

Backpressure

If consumers cannot keep up, queues grow. Eventually disk fills, or memory runs out, or messages exceed retention. Mitigations:

Auto-scale consumers based on queue depth.
Reject new producer requests when the queue exceeds a threshold (load shedding).
Prioritise critical messages (separate queue or priority queue).
Monitor lag (Kafka consumer lag, SQS visibility timeout violations) as a first-class SLO.

Event-Driven Architecture

Beyond simple queues, event-driven design treats events as the primary integration medium. Services emit events about facts in their domain; other services subscribe to react.

order-service ──emits──→ OrderPlaced ──→ payment-service
                                       ──→ inventory-service
                                       ──→ email-service
                                       ──→ analytics

Benefits: services don't know each other; new consumers added without changing producers; replay history to bootstrap a new service.

Costs: harder to reason about end-to-end flow; eventual consistency by default; observability requires correlation IDs across many hops.

Event Sourcing

Store the sequence of events as the system of record; current state is derived by replaying events.

+ Full audit trail; can rebuild any view; time travel.
+ Natural fit with CQRS (separate write model and read models).
− Schema evolution and event versioning are hard.
− Steeper learning curve; debugging takes practice.

Don't event-source everything. Use it where audit / regulatory / multi-view requirements justify the complexity.

Sagas: Long-Lived Distributed Transactions

You can't use a single ACID transaction across services. Sagas decompose a business transaction into a sequence of local transactions, each with a compensating action if a later step fails.

  Place order:
    1. ReserveInventory          ← compensate: ReleaseInventory
    2. ChargePayment             ← compensate: RefundPayment
    3. CreateShipment            ← compensate: CancelShipment

  If step 3 fails: run compensations in reverse.

Orchestration vs choreography

Orchestration — a central saga coordinator drives the flow, calling each service. Easier to see; the coordinator becomes a hot spot.
Choreography — each service reacts to events and emits the next event; no coordinator. More autonomous; harder to follow at a glance.

Most teams start with orchestration (clearer); some scale to choreography for autonomy.

Choosing a Broker

Need	Pick
Simple work queue, AWS	SQS
Pub/sub fan-out, AWS	SNS → SQS / EventBridge
High-throughput log, replay, streaming	Kafka / MSK / Confluent / Redpanda
Managed pub/sub, GCP	Pub/Sub
Managed pub/sub, Azure	Service Bus / Event Hubs / Event Grid
Lightweight, self-host	RabbitMQ, NATS, Redis Streams

Cert Mapping

Cert	Messaging scope
AWS SAA / DEA	SQS, SNS, EventBridge, MSK, Kinesis
Azure AZ-204 / AZ-305	Service Bus, Event Grid, Event Hubs
GCP PCA / PDE	Pub/Sub, Eventarc, Workflows

Heuristic

Start synchronous. Add async only when latency, throughput, or coupling demands it.
Default to at-least-once + idempotent consumers.
Always have a DLQ and an alert on its depth.
Track end-to-end correlation IDs from the first request through every event.
Keep events small and immutable; carry IDs, not entire objects, when possible.

The next lesson covers the resilience layer that protects all of this: CDNs, rate limiting, and circuit breakers.