Skip to content
5 min read·Lesson 8 of 10

Message Queues and Event-Driven Design

Decouple services with queues and event streams. Patterns: work queues, pub/sub, event sourcing, sagas, and choreography vs orchestration.

Synchronous request/response is simple — until one downstream call slows everything down, fails, or has different scaling needs. Queues and event streams are the glue that lets services move at different speeds and survive each other's failures.

Why Queues

  • Decoupling — producers don't know consumers; consumers can change without producer changes.
  • Buffering — absorb spikes; consumers process at their own rate.
  • Async processing — return 202 fast; do the slow work later.
  • Retries — failed work goes back on the queue or to a dead-letter queue.
  • Fan-out — one event, many consumers.

Two Models

Work queue (point-to-point)

Each message is processed by exactly one consumer in a group. Used for tasks: send email, resize image, charge card.

[ Producer ] → [ Queue ] → [ Worker ]
                          → [ Worker ]
                          → [ Worker ]
   each message goes to exactly one worker

Tools: SQS, RabbitMQ, BullMQ (Redis), Celery, Sidekiq.

Pub/sub (topic / event stream)

Each message is broadcast to all subscribers. Used for events: OrderPlaced, UserSignedUp, downstream services react.

[ Producer ] → [ Topic ] → [ Subscriber A ]   (e.g. email service)
                          → [ Subscriber B ]   (e.g. analytics)
                          → [ Subscriber C ]   (e.g. recommendations)

Tools: Kafka, Pub/Sub, SNS+SQS, EventBridge, NATS, RabbitMQ topic exchanges.

Delivery Guarantees

GuaranteeBehaviourWhen
At-most-onceMay lose messagesTelemetry where loss is acceptable
At-least-onceMay deliver duplicatesThe practical default for most queues
Exactly-onceEach message delivered onceRequires transactional broker + idempotent consumer; expensive

The pragmatic posture: assume at least once, make consumers idempotent. A duplicate charge_card is a refund waiting to happen — guard it with a deduplication key.

Idempotent Consumers

def handle(message):
    msg_id = message["id"]
    if already_processed(msg_id):
        return ack()
    do_work(message)
    mark_processed(msg_id)
    ack()

Track processed message IDs in a database, Redis, or a "processed" table with a TTL. Or design operations so re-applying is naturally safe (UPSERT, MERGE, conditional update).

Ordering

  • Most queues guarantee ordering only within a partition / shard.
  • Order events for the same user/order on the same partition (key by user_id).
  • Global ordering across all events is rarely possible at scale, and usually unnecessary.

Dead-Letter Queues (DLQs)

After N retries, send the failing message to a dedicated DLQ instead of looping forever. Alert on DLQ depth; investigate, fix, replay.

Without a DLQ, one poison message can clog the queue and block all other work behind it.

Backpressure

If consumers cannot keep up, queues grow. Eventually disk fills, or memory runs out, or messages exceed retention. Mitigations:

  • Auto-scale consumers based on queue depth.
  • Reject new producer requests when the queue exceeds a threshold (load shedding).
  • Prioritise critical messages (separate queue or priority queue).
  • Monitor lag (Kafka consumer lag, SQS visibility timeout violations) as a first-class SLO.

Event-Driven Architecture

Beyond simple queues, event-driven design treats events as the primary integration medium. Services emit events about facts in their domain; other services subscribe to react.

order-service ──emits──→ OrderPlaced ──→ payment-service
                                       ──→ inventory-service
                                       ──→ email-service
                                       ──→ analytics

Benefits: services don't know each other; new consumers added without changing producers; replay history to bootstrap a new service.

Costs: harder to reason about end-to-end flow; eventual consistency by default; observability requires correlation IDs across many hops.

Event Sourcing

Store the sequence of events as the system of record; current state is derived by replaying events.

  • + Full audit trail; can rebuild any view; time travel.
  • + Natural fit with CQRS (separate write model and read models).
  • − Schema evolution and event versioning are hard.
  • − Steeper learning curve; debugging takes practice.

Don't event-source everything. Use it where audit / regulatory / multi-view requirements justify the complexity.

Sagas: Long-Lived Distributed Transactions

You can't use a single ACID transaction across services. Sagas decompose a business transaction into a sequence of local transactions, each with a compensating action if a later step fails.

  Place order:
    1. ReserveInventory          ← compensate: ReleaseInventory
    2. ChargePayment             ← compensate: RefundPayment
    3. CreateShipment            ← compensate: CancelShipment

  If step 3 fails: run compensations in reverse.

Orchestration vs choreography

  • Orchestration — a central saga coordinator drives the flow, calling each service. Easier to see; the coordinator becomes a hot spot.
  • Choreography — each service reacts to events and emits the next event; no coordinator. More autonomous; harder to follow at a glance.

Most teams start with orchestration (clearer); some scale to choreography for autonomy.

Choosing a Broker

NeedPick
Simple work queue, AWSSQS
Pub/sub fan-out, AWSSNS → SQS / EventBridge
High-throughput log, replay, streamingKafka / MSK / Confluent / Redpanda
Managed pub/sub, GCPPub/Sub
Managed pub/sub, AzureService Bus / Event Hubs / Event Grid
Lightweight, self-hostRabbitMQ, NATS, Redis Streams

Cert Mapping

CertMessaging scope
AWS SAA / DEASQS, SNS, EventBridge, MSK, Kinesis
Azure AZ-204 / AZ-305Service Bus, Event Grid, Event Hubs
GCP PCA / PDEPub/Sub, Eventarc, Workflows

Heuristic

  1. Start synchronous. Add async only when latency, throughput, or coupling demands it.
  2. Default to at-least-once + idempotent consumers.
  3. Always have a DLQ and an alert on its depth.
  4. Track end-to-end correlation IDs from the first request through every event.
  5. Keep events small and immutable; carry IDs, not entire objects, when possible.

The next lesson covers the resilience layer that protects all of this: CDNs, rate limiting, and circuit breakers.

Key Takeaways

  • Queues decouple producers from consumers in time, throughput, and failure domain.
  • Work queues distribute tasks; pub/sub broadcasts events to many subscribers.
  • Idempotent consumers and at-least-once delivery are the practical default.
  • Event-driven systems gain flexibility but lose direct request/response simplicity.
  • Choose orchestration (one coordinator) or choreography (services react to events) consciously.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →