Day 8 Hard Distributed Systems Messaging Kafka / SQS Delivery / Backpressure / DLQ

Message Queues — Async Is the Price of Decoupling, and Its Entire ValueKafka vs RabbitMQ vs SQS · Delivery Semantics · Backpressure · Dead Letter Queue

Problem & Constraints

Design an e-commerce order event bus: when a user places an order, the order service emits an order.created event consumed independently by 5 downstreams—inventory decrement, loyalty points, recommendation updates, data-warehouse ingestion, and fraud audit. These downstreams process at wildly different speeds: inventory takes 2ms, the warehouse takes 200ms per batched record.

Hard constraints:

Scale: 50M orders/day, peak 500K events/s (including downstream fan-out).
No lost orders: dropping one order event = money that won't reconcile. Requires at-least-once delivery.
Independent scaling: the slow warehouse must not stall inventory; any downstream can crash and catch up on its backlog after recovery.
Per-order ordering: created → paid → shipped must not reorder, or the state machine breaks.
Poison-message isolation: an event that repeatedly crashes one consumer must not block the whole stream.

Today: why this scenario demands a message queue over synchronous RPC, how to choose Kafka / RabbitMQ / SQS, the essence of delivery semantics, backpressure & consumer lag, and how the DLQ contains poison messages.

High-Level Architecture

graph LR Prod["Order Service
Producer"] K["Kafka Topic
order.events
partitioned by order_id"] subgraph CG["Independent Consumer Groups (own offsets)"] C1["Inventory
fast 2ms"] C2["Loyalty Points"] C3["Recommendation"] C4["Warehouse
slow 200ms batch"] C5["Fraud Audit"] end DLQ["DLQ
order.events.dlq"] Prod -->|"ack=all
idempotent producer"| K K --> C1 K --> C2 K --> C3 K --> C4 K --> C5 C1 -.->|"retries exhausted"| DLQ C4 -.->|"poison msg"| DLQ classDef bus fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef dlq fill:#3a2010,stroke:#ffb450,color:#e8eef5 class K bus class DLQ dlq

Core idea: one event, many independent consumer groups, each tracking its own offset—this is pub/sub fan-out. The producer uses ack=all + idempotence to avoid loss; partitioning by order_id guarantees per-order ordering; a slow consumer's backlog affects only itself; messages whose retries are exhausted go to the DLQ without blocking the main stream.

Key Techniques

1. Selection: Log (Kafka) vs Broker (RabbitMQ) vs Managed (SQS)

Core trade-off: a retained, replayable log vs a flexible-routing smart broker vs a zero-ops managed queue.

Principle: these are three different abstractions. Kafka is a distributed append-only log: a message written to a partition is not deleted when consumed; a consumer merely advances its own offset, which naturally supports multi-subscriber fan-out, replaying history, and high-throughput sequential IO. RabbitMQ is a smart broker: messages route through an exchange by routing key to queues and are deleted once acked, strong at flexible routing (topic/fanout/header exchanges) and per-message control. SQS is AWS's fully managed queue: no ops, near-infinite elasticity, but the simplest feature set (standard queues are unordered; FIFO queues are ordered but throughput-limited).

Dimension	Kafka	RabbitMQ	SQS
Abstraction	persistent log	broker + queue	managed queue
After consume	retained (by TTL)	deleted on ack	deleted on ack
Replay history	✅ reset offset	❌	❌
Throughput	very high (M/s)	medium (10K/s)	high (auto-scales)
Routing flexibility	weak (topic+key)	✅ strong	weak
Ops	heavy (self-managed)	medium	zero

How to choose:

Event streams / many subscribers / replay needed / very high throughput → Kafka. Here the order event is consumed by 5 downstreams and must be replayable to backfill, so choose Kafka.
Complex routing / task dispatch / low-latency per-message control (priority queues, RPC reply) → RabbitMQ.
On AWS, no ops, bursty traffic, queue semantics suffice → SQS.

Real-world cases:

Cloudflare: uses Kafka as a cross-service message bus, processing over 1 trillion messages/day on a single bus across 14 clusters (~330 nodes), broadcasting resource lifecycle events (create/update/delete) as Protobuf to downstreams.
Uber: one of the largest Kafka deployments, trillions of messages and multiple PB/day, with in-house uReplicator (cross-region) and Chaperone (end-to-end audit).
RabbitMQ: widely used for task dispatch (Celery / Sidekiq backends) and microservice command buses needing flexible routing.

2. Delivery Semantics: at-most-once / at-least-once / exactly-once

Core trade-off: no loss vs no duplicates can't both hold (without idempotence); in practice "exactly-once" = at-least-once + idempotence.

Principle: delivery semantics hinge on the order of ack vs processing. At-most-once: ack (commit offset) before processing—a crash loses the message, never duplicates; fine for droppable metric sampling. At-least-once: process before ack—a crash before ack redelivers, never loses but may duplicate; the default for most business. Exactly-once: unreachable at the network layer (Two Generals problem), but achievable in effect via at-least-once delivery + consumer-side idempotence/dedup, so each message takes effect exactly once.

Trade-offs of the three:

At-most-once: ✅ fastest, zero duplicates; ❌ loses messages. Only for tolerable-loss telemetry/log sampling.
At-least-once: ✅ no loss; ❌ duplicates → consumer must be idempotent. Orders can't be lost here, so choose this.
Exactly-once: Kafka achieves EOS via idempotent producer (sequence-number dedup) + transactions (atomic multi-partition writes) for read-process-write; but it costs throughput and only closes the loop inside Kafka—once you write to an external system (DB / third-party API), you still need your own idempotence.

Pseudocode · at-least-once consume + idempotent dedup

def consume(msg):
    if dedup.exists(msg.event_id):   # dedup table / unique constraint
        commit_offset(msg); return     # duplicate delivery, skip
    process(msg)                       # business: decrement stock / add points
    dedup.insert(msg.event_id)
    commit_offset(msg)                 # process then commit = at-least-once
    # crash after process / before commit -> redeliver, dedup catches it

Real-world cases:

Confluent / Kafka: Kafka 0.11 introduced EOS—idempotent producer + transactions, enabled with a single Streams config; the docs are explicit that EOS is bounded to in-Kafka stream processing.
Payment/order systems: typically run at-least-once + idempotency key (see Day 7), pushing "exactly-once" down to business-layer dedup.

3. Backpressure & Consumer Lag: a slow consumer must not stall upstream

Core trade-off: buffer (absorb bursts) vs drop / throttle (protect the system)—queues turn synchronous backpressure into observable lag.

Principle: when produce rate > consume rate, messages pile up. In synchronous RPC this shows up as caller blocking, thread exhaustion, cascading collapse; a message queue converts it into consumer lag (offset deficit), a monitorable, bufferable metric. Two response classes: scale consumption (add consumers, but bounded by partition count—within a group a partition is consumed by only one consumer); limit production (producer throttle/reject, pushing pressure back to the source). Kafka's log model lets the backlog "sit on disk" relatively safely; a RabbitMQ queue backlog in memory/disk may trigger flow control that blocks producers.

Three responses to backlog, different costs:

Add consumers for parallelism: most direct, but max parallelism = partition count. Too few partitions = can't scale out → reserve partitions at topic creation.
Raise single-consumer throughput: batch fetch + batch process + async commit. The slow warehouse consumer especially relies on batching.
Backpressure the producer: when lag exceeds a threshold, throttle or even reject non-critical events (e.g. drop "recommendation updates" to protect "inventory decrement").

Pseudocode · monitor lag + adaptive batching

while True:
    batch = consumer.poll(max_records=500, timeout=100ms)
    lag = end_offset(partition) - current_offset
    if lag > HIGH_WATERMARK:
        scale_out_signal()        # trigger autoscaler to add consumers
    process_batch(batch)          # batching amortizes fixed overhead
    consumer.commit()             # commit offset once per batch

Real-world cases:

Uber: built uForwarder, a push-based consumer proxy, freeing pub-sub consumption from "consumers bounded by partition count"; 1000+ services onboarded, decoupling slow consumers from parallelism limits.
Kafka practice: consumer lag (monitored via Burrow / Cruise Control) is a core SLI and the primary autoscaling signal for stream processing.

4. Dead Letter Queue: the quarantine for poison messages

Core trade-off: infinite retry (blocks the queue) vs drop (lose data) vs DLQ (isolate + keep evidence).

Principle: a message that repeatedly fails due to malformed data, a downstream bug, or a permanently unavailable dependency is a poison message. Infinite retry jams the whole partition/queue (especially ordered queues—everything behind it starves); dropping loses data without a trail. The DLQ approach: after N failed retries, move the message—with its failure context—to a dedicated dead-letter queue so the main stream proceeds; later analyze the DLQ manually or automatically, fix, and redrive back for replay.

Key design points:

Retry count + backoff: only after N exponential-backoff retries does it go to the DLQ, so transient blips aren't mistaken for poison.
Distinguish retryable vs non-retryable: network timeouts should retry; JSON parse failures won't fix in a million retries—send straight to DLQ.
DLQ must carry metadata: source topic, failure reason, stack trace, retry count—otherwise the DLQ becomes an untriageable trash bin.
Alert on the DLQ: messages entering the DLQ are a "silent failure"; without alerting, data is silently lost.

Pseudocode · retry → DLQ

def handle(msg):
    try:
        process(msg)
    except NonRetryable as e:     # malformed data, retry is pointless
        to_dlq(msg, reason=e); return
    except Retryable as e:
        if msg.attempts >= MAX_RETRY:
            to_dlq(msg, reason=e, attempts=msg.attempts)
        else:
            requeue(msg, delay=backoff(msg.attempts))

Real-world cases:

AWS SQS: native DLQ + redrive—messages exceeding maxReceiveCount auto-move to the DLQ; in FIFO queues a poison message blocks the entire message group until processed or moved to the DLQ.
RabbitMQ: uses a dead-letter exchange (DLX) + TTL to implement delayed retry and dead-letter routing—the industry-standard approach.

Scaling & Optimization

Reserve partitions: partition count caps consumer parallelism, and reducing partitions is painful (it breaks the key→partition mapping). Size for future peak at topic creation—err high (but too many adds broker metadata and rebalance overhead).
The cost of ordering: global ordering = single partition = no parallelism. Here we only need per-order ordering, so partition by order_id—ordered yet horizontally scalable. This is the core "partition key = unit of parallelism" trade-off.
Tiered storage: Kafka offloads cold data to object storage (S3), so long retention (replay, compliance audit) isn't bounded by local disk.
Schema registry: use Avro/Protobuf + a schema registry for compatible evolution, so an upstream field change doesn't break every downstream (the silent killer of event buses).
Multi-region / DR: MirrorMaker 2 / uReplicator for cross-region replication; think carefully about offset and dedup under active-active.

Pitfalls & Interview Questions

1. Assuming "using Kafka = exactly-once." EOS only closes the loop inside Kafka. Once you consume and write to an external DB or call a third party, it's still at-least-once + your own idempotence required. Interviewers asking about EOS expect you to name this boundary.

2. Non-idempotent consumers. At-least-once inevitably redelivers. Without a dedup table/unique constraint, you get double charges, double shipments. This is the #1 messaging trap.

3. Trading global ordering for zero parallelism. Stuffing all messages into one partition for global order tanks consumer throughput. 99% of cases only need per-key local ordering.

4. No DLQ, relying on infinite retry. One poison message can deadlock an ordered queue, starving everything behind it. You need a DLQ + retryable-error distinction.

5. Not monitoring consumer lag. Lag is the single most important health metric. Not monitoring = you won't know the warehouse is hours behind until downstream notices missing data.

Likely follow-ups:

The essence of Kafka vs RabbitMQ? When is Kafka mandatory, when is RabbitMQ a better fit?
How does at-least-once achieve "effective exactly-once"? Where does the idempotency key live, how is the dedup table cleaned?
A topic with 100 partitions, a consumer group with 150 consumers—what happens?
A backlog of 100M messages—how do you catch up fast? What levers, what costs?
A poison message in an ordered queue deadlocks it—what do you do? How do you design the DLQ?
What do producer ack=0/1/all mean? When are messages lost?

Going Deeper

1. A topic has 100 partitions and the consumer group has 150 consumers. What happens, and why?

Result: 50 consumers sit completely idle. Kafka's unit of parallelism is the partition—within a consumer group, a partition is consumed by exactly one consumer at any moment (to guarantee no in-group duplication). 100 partitions are split among at most 100 consumers; the extra 50 grab no partition and are pure waste.

Corollary: max consumption parallelism = partition count. To scale to 150, you must first raise partitions to ≥150. But adding partitions has costs: (1) key-based hash routing changes (hash(key) % N with a new N), so a key's historical and new messages may land on different partitions, breaking ordering; (2) more partitions mean more broker metadata, file handles, and rebalance time.

So plan partition count at topic creation for future peak—one of the few "deciding upfront is far cheaper than fixing later" parameters. This is also why Uber built uForwarder—a push proxy that decouples "consumption parallelism" from "partition count."

2. The warehouse consumer is 100M messages behind and the boss wants it caught up in 1 hour. Give at least 3 levers and their costs.

First, the magnitude: 100M / 3600s ≈ a net catch-up rate of 28K/s (and above the live inflow). A single consumer at 200ms/msg does only 5/s—4 orders of magnitude short, so parallelism + batching are mandatory.

① Temporarily scale consumers to the partition cap: with 200 partitions, run 200 consumers. Cost: you need that many partitions; brief pause during rebalance.
② Batching: turn "200ms per record" into "500-record batch ingestion," amortizing fixed network/transaction overhead; single-consumer throughput can rise 10–50×. Cost: coarser retry granularity, higher latency.
③ Bypass acceleration topic: spin up a separate consumer group reading from the backlog head full-tilt into a "fast lane," keeping the original group on live traffic. Cost: temporary architectural complexity, must merge two streams.
④ Degrade non-critical work: during the backlog, only land raw data, skip heavy recompute, backfill offline later. Cost: short-term incomplete data.

Deeper: catching up in an hour presupposes that partition count and downstream write capacity already had headroom. With only 10 partitions, no amount of added consumers exceeds 10-way parallelism—backlog recovery capacity is decided at design time, not conjured during the incident. That's the value of the log model letting backlogs sit safely on disk: it buys you a recovery window, but recovery speed is still capped by the architecture.

3. If "at-least-once + idempotence = exactly-once," why does Kafka still need exactly-once transactions? What does idempotence fail to solve?

Consumer-side idempotence solves "the same message processed twice." But one class it can't solve: the atomicity of read-process-write—a consumer reads from topic A, processes, writes to topic B, then commits A's offset. If those three steps aren't atomic, different crash points break it: wrote B but didn't commit A's offset → reprocess on restart, duplicates in B; committed the offset but didn't write B → data lost in B.

Kafka transactions wrap "write B + commit A's offset" in one atomic transaction, with the idempotent producer (sequence-number dedup), achieving stream-processing EOS: all or nothing. Consumer-side idempotence can't do this, because the offset commit and the external write are two systems' actions.

But the key boundary: this only holds in the "Kafka → Kafka" loop. The moment you write to an external Postgres or call Stripe, Kafka transactions can't reach there—back to at-least-once + business idempotence. So EOS isn't a silver bullet; it precisely solves "in-Kafka stream processing," and understanding its boundary matters more than knowing it exists.

4. Synchronous RPC also decouples services—so why introduce a message queue (an "extra middleware that can lose messages") on a critical path like orders?

It's counterintuitive: doesn't adding a component increase the failure surface? The answer lies in what kind of failure you trade for what kind.

Synchronous RPC's problems: the order service must synchronously call 5 downstreams. (1) Temporal coupling: any downstream slow/down makes checkout slow/fail—the 5 downstreams' availabilities multiply, tanking overall availability; (2) cascading collapse: the slow warehouse → order-service threads pile up blocking → the order service itself dies → drags down upstream; (3) fan-out coupling: adding a 6th downstream requires changing order-service code.

The queue's conversion: the order service only needs to reliably write the event to the queue (one local + one queue write, made atomic with Outbox) and return immediately. Downstreams consume asynchronously—a downstream crash doesn't affect checkout, and it resumes from its offset on recovery; adding a downstream is just a new consumer group, zero order-service change; a slow consumer only backs up itself.

The cost: (1) eventual consistency—downstream state lags, the UI must cope ("processing"); (2) at-least-once duplicates—needs idempotence; (3) one more middleware to operate. Essentially you trade "eventual consistency + idempotence complexity" for "temporal decoupling + cascade resistance + easy extensibility." For orders—"checkout must be fast and stable, downstreams can be async"—it's a good deal; for "must get the result synchronously" cases (real-time fraud rejection), keep the synchronous call.

Problem & Constraints

High-Level Architecture

Key Techniques

1. Selection: Log (Kafka) vs Broker (RabbitMQ) vs Managed (SQS)

2. Delivery Semantics: at-most-once / at-least-once / exactly-once

3. Backpressure & Consumer Lag: a slow consumer must not stall upstream

4. Dead Letter Queue: the quarantine for poison messages

Scaling & Optimization

Pitfalls & Interview Questions

Further Reading

Going Deeper