Day 8 Hard Distributed Systems Messaging Kafka / SQS Delivery / Backpressure / DLQ

Message Queues — Async Is the Price of Decoupling, and Its Entire ValueKafka vs RabbitMQ vs SQS · Delivery Semantics · Backpressure · Dead Letter Queue

Problem & Constraints

Design an e-commerce order event bus: when a user places an order, the order service emits an order.created event consumed independently by 5 downstreams—inventory decrement, loyalty points, recommendation updates, data-warehouse ingestion, and fraud audit. These downstreams process at wildly different speeds: inventory takes 2ms, the warehouse takes 200ms per batched record.

Hard constraints:

Today: why this scenario demands a message queue over synchronous RPC, how to choose Kafka / RabbitMQ / SQS, the essence of delivery semantics, backpressure & consumer lag, and how the DLQ contains poison messages.

High-Level Architecture

graph LR Prod["Order Service
Producer"] K["Kafka Topic
order.events
partitioned by order_id"] subgraph CG["Independent Consumer Groups (own offsets)"] C1["Inventory
fast 2ms"] C2["Loyalty Points"] C3["Recommendation"] C4["Warehouse
slow 200ms batch"] C5["Fraud Audit"] end DLQ["DLQ
order.events.dlq"] Prod -->|"ack=all
idempotent producer"| K K --> C1 K --> C2 K --> C3 K --> C4 K --> C5 C1 -.->|"retries exhausted"| DLQ C4 -.->|"poison msg"| DLQ classDef bus fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef dlq fill:#3a2010,stroke:#ffb450,color:#e8eef5 class K bus class DLQ dlq

Core idea: one event, many independent consumer groups, each tracking its own offset—this is pub/sub fan-out. The producer uses ack=all + idempotence to avoid loss; partitioning by order_id guarantees per-order ordering; a slow consumer's backlog affects only itself; messages whose retries are exhausted go to the DLQ without blocking the main stream.

Key Techniques

1. Selection: Log (Kafka) vs Broker (RabbitMQ) vs Managed (SQS)

Core trade-off: a retained, replayable log vs a flexible-routing smart broker vs a zero-ops managed queue.

Principle: these are three different abstractions. Kafka is a distributed append-only log: a message written to a partition is not deleted when consumed; a consumer merely advances its own offset, which naturally supports multi-subscriber fan-out, replaying history, and high-throughput sequential IO. RabbitMQ is a smart broker: messages route through an exchange by routing key to queues and are deleted once acked, strong at flexible routing (topic/fanout/header exchanges) and per-message control. SQS is AWS's fully managed queue: no ops, near-infinite elasticity, but the simplest feature set (standard queues are unordered; FIFO queues are ordered but throughput-limited).

DimensionKafkaRabbitMQSQS
Abstractionpersistent logbroker + queuemanaged queue
After consumeretained (by TTL)deleted on ackdeleted on ack
Replay history✅ reset offset
Throughputvery high (M/s)medium (10K/s)high (auto-scales)
Routing flexibilityweak (topic+key)✅ strongweak
Opsheavy (self-managed)mediumzero
How to choose:
Real-world cases:

2. Delivery Semantics: at-most-once / at-least-once / exactly-once

Core trade-off: no loss vs no duplicates can't both hold (without idempotence); in practice "exactly-once" = at-least-once + idempotence.

Principle: delivery semantics hinge on the order of ack vs processing. At-most-once: ack (commit offset) before processing—a crash loses the message, never duplicates; fine for droppable metric sampling. At-least-once: process before ack—a crash before ack redelivers, never loses but may duplicate; the default for most business. Exactly-once: unreachable at the network layer (Two Generals problem), but achievable in effect via at-least-once delivery + consumer-side idempotence/dedup, so each message takes effect exactly once.

Trade-offs of the three:
Pseudocode · at-least-once consume + idempotent dedup
def consume(msg):
    if dedup.exists(msg.event_id):   # dedup table / unique constraint
        commit_offset(msg); return     # duplicate delivery, skip
    process(msg)                       # business: decrement stock / add points
    dedup.insert(msg.event_id)
    commit_offset(msg)                 # process then commit = at-least-once
    # crash after process / before commit -> redeliver, dedup catches it
Real-world cases:

3. Backpressure & Consumer Lag: a slow consumer must not stall upstream

Core trade-off: buffer (absorb bursts) vs drop / throttle (protect the system)—queues turn synchronous backpressure into observable lag.

Principle: when produce rate > consume rate, messages pile up. In synchronous RPC this shows up as caller blocking, thread exhaustion, cascading collapse; a message queue converts it into consumer lag (offset deficit), a monitorable, bufferable metric. Two response classes: scale consumption (add consumers, but bounded by partition count—within a group a partition is consumed by only one consumer); limit production (producer throttle/reject, pushing pressure back to the source). Kafka's log model lets the backlog "sit on disk" relatively safely; a RabbitMQ queue backlog in memory/disk may trigger flow control that blocks producers.

Three responses to backlog, different costs:
Pseudocode · monitor lag + adaptive batching
while True:
    batch = consumer.poll(max_records=500, timeout=100ms)
    lag = end_offset(partition) - current_offset
    if lag > HIGH_WATERMARK:
        scale_out_signal()        # trigger autoscaler to add consumers
    process_batch(batch)          # batching amortizes fixed overhead
    consumer.commit()             # commit offset once per batch
Real-world cases:

4. Dead Letter Queue: the quarantine for poison messages

Core trade-off: infinite retry (blocks the queue) vs drop (lose data) vs DLQ (isolate + keep evidence).

Principle: a message that repeatedly fails due to malformed data, a downstream bug, or a permanently unavailable dependency is a poison message. Infinite retry jams the whole partition/queue (especially ordered queues—everything behind it starves); dropping loses data without a trail. The DLQ approach: after N failed retries, move the message—with its failure context—to a dedicated dead-letter queue so the main stream proceeds; later analyze the DLQ manually or automatically, fix, and redrive back for replay.

Key design points:
Pseudocode · retry → DLQ
def handle(msg):
    try:
        process(msg)
    except NonRetryable as e:     # malformed data, retry is pointless
        to_dlq(msg, reason=e); return
    except Retryable as e:
        if msg.attempts >= MAX_RETRY:
            to_dlq(msg, reason=e, attempts=msg.attempts)
        else:
            requeue(msg, delay=backoff(msg.attempts))
Real-world cases:

Scaling & Optimization

Pitfalls & Interview Questions

1. Assuming "using Kafka = exactly-once." EOS only closes the loop inside Kafka. Once you consume and write to an external DB or call a third party, it's still at-least-once + your own idempotence required. Interviewers asking about EOS expect you to name this boundary.
2. Non-idempotent consumers. At-least-once inevitably redelivers. Without a dedup table/unique constraint, you get double charges, double shipments. This is the #1 messaging trap.
3. Trading global ordering for zero parallelism. Stuffing all messages into one partition for global order tanks consumer throughput. 99% of cases only need per-key local ordering.
4. No DLQ, relying on infinite retry. One poison message can deadlock an ordered queue, starving everything behind it. You need a DLQ + retryable-error distinction.
5. Not monitoring consumer lag. Lag is the single most important health metric. Not monitoring = you won't know the warehouse is hours behind until downstream notices missing data.

Likely follow-ups:

  1. The essence of Kafka vs RabbitMQ? When is Kafka mandatory, when is RabbitMQ a better fit?
  2. How does at-least-once achieve "effective exactly-once"? Where does the idempotency key live, how is the dedup table cleaned?
  3. A topic with 100 partitions, a consumer group with 150 consumers—what happens?
  4. A backlog of 100M messages—how do you catch up fast? What levers, what costs?
  5. A poison message in an ordered queue deadlocks it—what do you do? How do you design the DLQ?
  6. What do producer ack=0/1/all mean? When are messages lost?

Further Reading

Going Deeper

1. A topic has 100 partitions and the consumer group has 150 consumers. What happens, and why?

Result: 50 consumers sit completely idle. Kafka's unit of parallelism is the partition—within a consumer group, a partition is consumed by exactly one consumer at any moment (to guarantee no in-group duplication). 100 partitions are split among at most 100 consumers; the extra 50 grab no partition and are pure waste.

Corollary: max consumption parallelism = partition count. To scale to 150, you must first raise partitions to ≥150. But adding partitions has costs: (1) key-based hash routing changes (hash(key) % N with a new N), so a key's historical and new messages may land on different partitions, breaking ordering; (2) more partitions mean more broker metadata, file handles, and rebalance time.

So plan partition count at topic creation for future peak—one of the few "deciding upfront is far cheaper than fixing later" parameters. This is also why Uber built uForwarder—a push proxy that decouples "consumption parallelism" from "partition count."

2. The warehouse consumer is 100M messages behind and the boss wants it caught up in 1 hour. Give at least 3 levers and their costs.

First, the magnitude: 100M / 3600s ≈ a net catch-up rate of 28K/s (and above the live inflow). A single consumer at 200ms/msg does only 5/s—4 orders of magnitude short, so parallelism + batching are mandatory.

  • ① Temporarily scale consumers to the partition cap: with 200 partitions, run 200 consumers. Cost: you need that many partitions; brief pause during rebalance.
  • ② Batching: turn "200ms per record" into "500-record batch ingestion," amortizing fixed network/transaction overhead; single-consumer throughput can rise 10–50×. Cost: coarser retry granularity, higher latency.
  • ③ Bypass acceleration topic: spin up a separate consumer group reading from the backlog head full-tilt into a "fast lane," keeping the original group on live traffic. Cost: temporary architectural complexity, must merge two streams.
  • ④ Degrade non-critical work: during the backlog, only land raw data, skip heavy recompute, backfill offline later. Cost: short-term incomplete data.

Deeper: catching up in an hour presupposes that partition count and downstream write capacity already had headroom. With only 10 partitions, no amount of added consumers exceeds 10-way parallelism—backlog recovery capacity is decided at design time, not conjured during the incident. That's the value of the log model letting backlogs sit safely on disk: it buys you a recovery window, but recovery speed is still capped by the architecture.

3. If "at-least-once + idempotence = exactly-once," why does Kafka still need exactly-once transactions? What does idempotence fail to solve?

Consumer-side idempotence solves "the same message processed twice." But one class it can't solve: the atomicity of read-process-write—a consumer reads from topic A, processes, writes to topic B, then commits A's offset. If those three steps aren't atomic, different crash points break it: wrote B but didn't commit A's offset → reprocess on restart, duplicates in B; committed the offset but didn't write B → data lost in B.

Kafka transactions wrap "write B + commit A's offset" in one atomic transaction, with the idempotent producer (sequence-number dedup), achieving stream-processing EOS: all or nothing. Consumer-side idempotence can't do this, because the offset commit and the external write are two systems' actions.

But the key boundary: this only holds in the "Kafka → Kafka" loop. The moment you write to an external Postgres or call Stripe, Kafka transactions can't reach there—back to at-least-once + business idempotence. So EOS isn't a silver bullet; it precisely solves "in-Kafka stream processing," and understanding its boundary matters more than knowing it exists.

4. Synchronous RPC also decouples services—so why introduce a message queue (an "extra middleware that can lose messages") on a critical path like orders?

It's counterintuitive: doesn't adding a component increase the failure surface? The answer lies in what kind of failure you trade for what kind.

Synchronous RPC's problems: the order service must synchronously call 5 downstreams. (1) Temporal coupling: any downstream slow/down makes checkout slow/fail—the 5 downstreams' availabilities multiply, tanking overall availability; (2) cascading collapse: the slow warehouse → order-service threads pile up blocking → the order service itself dies → drags down upstream; (3) fan-out coupling: adding a 6th downstream requires changing order-service code.

The queue's conversion: the order service only needs to reliably write the event to the queue (one local + one queue write, made atomic with Outbox) and return immediately. Downstreams consume asynchronously—a downstream crash doesn't affect checkout, and it resumes from its offset on recovery; adding a downstream is just a new consumer group, zero order-service change; a slow consumer only backs up itself.

The cost: (1) eventual consistency—downstream state lags, the UI must cope ("processing"); (2) at-least-once duplicates—needs idempotence; (3) one more middleware to operate. Essentially you trade "eventual consistency + idempotence complexity" for "temporal decoupling + cascade resistance + easy extensibility." For orders—"checkout must be fast and stable, downstreams can be async"—it's a good deal; for "must get the result synchronously" cases (real-time fraud rejection), keep the synchronous call.