Design an e-commerce order event bus: when a user places an order, the order service emits an order.created event consumed independently by 5 downstreams—inventory decrement, loyalty points, recommendation updates, data-warehouse ingestion, and fraud audit. These downstreams process at wildly different speeds: inventory takes 2ms, the warehouse takes 200ms per batched record.
Hard constraints:
created → paid → shipped must not reorder, or the state machine breaks.Today: why this scenario demands a message queue over synchronous RPC, how to choose Kafka / RabbitMQ / SQS, the essence of delivery semantics, backpressure & consumer lag, and how the DLQ contains poison messages.
Core idea: one event, many independent consumer groups, each tracking its own offset—this is pub/sub fan-out. The producer uses ack=all + idempotence to avoid loss; partitioning by order_id guarantees per-order ordering; a slow consumer's backlog affects only itself; messages whose retries are exhausted go to the DLQ without blocking the main stream.
Core trade-off: a retained, replayable log vs a flexible-routing smart broker vs a zero-ops managed queue.
Principle: these are three different abstractions. Kafka is a distributed append-only log: a message written to a partition is not deleted when consumed; a consumer merely advances its own offset, which naturally supports multi-subscriber fan-out, replaying history, and high-throughput sequential IO. RabbitMQ is a smart broker: messages route through an exchange by routing key to queues and are deleted once acked, strong at flexible routing (topic/fanout/header exchanges) and per-message control. SQS is AWS's fully managed queue: no ops, near-infinite elasticity, but the simplest feature set (standard queues are unordered; FIFO queues are ordered but throughput-limited).
| Dimension | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| Abstraction | persistent log | broker + queue | managed queue |
| After consume | retained (by TTL) | deleted on ack | deleted on ack |
| Replay history | ✅ reset offset | ❌ | ❌ |
| Throughput | very high (M/s) | medium (10K/s) | high (auto-scales) |
| Routing flexibility | weak (topic+key) | ✅ strong | weak |
| Ops | heavy (self-managed) | medium | zero |
Core trade-off: no loss vs no duplicates can't both hold (without idempotence); in practice "exactly-once" = at-least-once + idempotence.
Principle: delivery semantics hinge on the order of ack vs processing. At-most-once: ack (commit offset) before processing—a crash loses the message, never duplicates; fine for droppable metric sampling. At-least-once: process before ack—a crash before ack redelivers, never loses but may duplicate; the default for most business. Exactly-once: unreachable at the network layer (Two Generals problem), but achievable in effect via at-least-once delivery + consumer-side idempotence/dedup, so each message takes effect exactly once.
def consume(msg):
if dedup.exists(msg.event_id): # dedup table / unique constraint
commit_offset(msg); return # duplicate delivery, skip
process(msg) # business: decrement stock / add points
dedup.insert(msg.event_id)
commit_offset(msg) # process then commit = at-least-once
# crash after process / before commit -> redeliver, dedup catches it
Core trade-off: buffer (absorb bursts) vs drop / throttle (protect the system)—queues turn synchronous backpressure into observable lag.
Principle: when produce rate > consume rate, messages pile up. In synchronous RPC this shows up as caller blocking, thread exhaustion, cascading collapse; a message queue converts it into consumer lag (offset deficit), a monitorable, bufferable metric. Two response classes: scale consumption (add consumers, but bounded by partition count—within a group a partition is consumed by only one consumer); limit production (producer throttle/reject, pushing pressure back to the source). Kafka's log model lets the backlog "sit on disk" relatively safely; a RabbitMQ queue backlog in memory/disk may trigger flow control that blocks producers.
while True:
batch = consumer.poll(max_records=500, timeout=100ms)
lag = end_offset(partition) - current_offset
if lag > HIGH_WATERMARK:
scale_out_signal() # trigger autoscaler to add consumers
process_batch(batch) # batching amortizes fixed overhead
consumer.commit() # commit offset once per batch
consumer lag (monitored via Burrow / Cruise Control) is a core SLI and the primary autoscaling signal for stream processing.Core trade-off: infinite retry (blocks the queue) vs drop (lose data) vs DLQ (isolate + keep evidence).
Principle: a message that repeatedly fails due to malformed data, a downstream bug, or a permanently unavailable dependency is a poison message. Infinite retry jams the whole partition/queue (especially ordered queues—everything behind it starves); dropping loses data without a trail. The DLQ approach: after N failed retries, move the message—with its failure context—to a dedicated dead-letter queue so the main stream proceeds; later analyze the DLQ manually or automatically, fix, and redrive back for replay.
def handle(msg):
try:
process(msg)
except NonRetryable as e: # malformed data, retry is pointless
to_dlq(msg, reason=e); return
except Retryable as e:
if msg.attempts >= MAX_RETRY:
to_dlq(msg, reason=e, attempts=msg.attempts)
else:
requeue(msg, delay=backoff(msg.attempts))
maxReceiveCount auto-move to the DLQ; in FIFO queues a poison message blocks the entire message group until processed or moved to the DLQ.order_id—ordered yet horizontally scalable. This is the core "partition key = unit of parallelism" trade-off.Likely follow-ups:
ack=0/1/all mean? When are messages lost?Result: 50 consumers sit completely idle. Kafka's unit of parallelism is the partition—within a consumer group, a partition is consumed by exactly one consumer at any moment (to guarantee no in-group duplication). 100 partitions are split among at most 100 consumers; the extra 50 grab no partition and are pure waste.
Corollary: max consumption parallelism = partition count. To scale to 150, you must first raise partitions to ≥150. But adding partitions has costs: (1) key-based hash routing changes (hash(key) % N with a new N), so a key's historical and new messages may land on different partitions, breaking ordering; (2) more partitions mean more broker metadata, file handles, and rebalance time.
So plan partition count at topic creation for future peak—one of the few "deciding upfront is far cheaper than fixing later" parameters. This is also why Uber built uForwarder—a push proxy that decouples "consumption parallelism" from "partition count."
First, the magnitude: 100M / 3600s ≈ a net catch-up rate of 28K/s (and above the live inflow). A single consumer at 200ms/msg does only 5/s—4 orders of magnitude short, so parallelism + batching are mandatory.
Deeper: catching up in an hour presupposes that partition count and downstream write capacity already had headroom. With only 10 partitions, no amount of added consumers exceeds 10-way parallelism—backlog recovery capacity is decided at design time, not conjured during the incident. That's the value of the log model letting backlogs sit safely on disk: it buys you a recovery window, but recovery speed is still capped by the architecture.
Consumer-side idempotence solves "the same message processed twice." But one class it can't solve: the atomicity of read-process-write—a consumer reads from topic A, processes, writes to topic B, then commits A's offset. If those three steps aren't atomic, different crash points break it: wrote B but didn't commit A's offset → reprocess on restart, duplicates in B; committed the offset but didn't write B → data lost in B.
Kafka transactions wrap "write B + commit A's offset" in one atomic transaction, with the idempotent producer (sequence-number dedup), achieving stream-processing EOS: all or nothing. Consumer-side idempotence can't do this, because the offset commit and the external write are two systems' actions.
But the key boundary: this only holds in the "Kafka → Kafka" loop. The moment you write to an external Postgres or call Stripe, Kafka transactions can't reach there—back to at-least-once + business idempotence. So EOS isn't a silver bullet; it precisely solves "in-Kafka stream processing," and understanding its boundary matters more than knowing it exists.
It's counterintuitive: doesn't adding a component increase the failure surface? The answer lies in what kind of failure you trade for what kind.
Synchronous RPC's problems: the order service must synchronously call 5 downstreams. (1) Temporal coupling: any downstream slow/down makes checkout slow/fail—the 5 downstreams' availabilities multiply, tanking overall availability; (2) cascading collapse: the slow warehouse → order-service threads pile up blocking → the order service itself dies → drags down upstream; (3) fan-out coupling: adding a 6th downstream requires changing order-service code.
The queue's conversion: the order service only needs to reliably write the event to the queue (one local + one queue write, made atomic with Outbox) and return immediately. Downstreams consume asynchronously—a downstream crash doesn't affect checkout, and it resumes from its offset on recovery; adding a downstream is just a new consumer group, zero order-service change; a slow consumer only backs up itself.
The cost: (1) eventual consistency—downstream state lags, the UI must cope ("processing"); (2) at-least-once duplicates—needs idempotence; (3) one more middleware to operate. Essentially you trade "eventual consistency + idempotence complexity" for "temporal decoupling + cascade resistance + easy extensibility." For orders—"checkout must be fast and stable, downstreams can be async"—it's a good deal; for "must get the result synchronously" cases (real-time fraud rejection), keep the synchronous call.