Design an e-commerce checkout path: 50K orders/sec, p99 SLO 300ms, overall availability target 99.95% (~4.4h downtime/year). Placing an order synchronously calls roughly 30 downstream dependencies — inventory, coupons, fraud, address, payment pre-auth… Only inventory and payment are critical; the rest are non-critical.
What you really need to prevent isn't "a service died" but cascading failure: one GC pause in the coupon service pushes its response from 20ms to 5s. The caller waits synchronously, every order request pins a thread blocking on coupons — within seconds the thread pool is exhausted and even requests that never touch coupons can't get in. One non-critical dependency drags down the entire checkout. The core thesis of reliability engineering: keep local failures local. This issue covers four complementary weapons: circuit breaker, retry with backoff, bulkhead, graceful degradation.
Every downstream call passes through a resilience layer: timeout → bulkhead (isolated resource pool) → circuit breaker (tracks error rate) → retry (with backoff). If any stage decides failure, it immediately goes to a fallback (cached value / default / skip) instead of blocking. At the ingress there's also load shedding, which under overload protects critical requests first.
graph LR
IN["order request"]
LS{"load shed
overload→drop"}
ORD["order svc
orchestrate"]
BH["bulkhead
pool per dep"]
CB{"breaker
open?"}
DEP["downstream
coupon/fraud..."]
FB["fallback
cache/default/skip"]
IN --> LS -->|admit| ORD --> BH --> CB
LS -.->|reject 429| IN
CB -->|closed| DEP
CB -.->|open→fail fast| FB
DEP -.->|timeout/error| FB
classDef in fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef gate fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef dep fill:#0e2030,stroke:#5eead4,color:#e8eef5
class IN in
class LS,CB gate
class ORD,BH core
class DEP,FB dep
Each dependency is isolated; failures fail fast to a fallback so one slow dep can't exhaust global resources
One-line trade-off: trade "temporary unavailability of the failing dependency" for "the caller fails fast and the downstream gets to breathe".
Principle: the electrical-breaker metaphor. A three-state machine: Closed (let traffic through, track failure rate) → failure rate exceeds threshold → Open (reject directly, send nothing, go straight to fallback) → cool down → Half-Open (admit a few probe requests) → success returns to Closed, failure goes back to Open. It turns "wait out the full timeout before failing" (wasting seconds each time while still piling pressure on a dying downstream) into fail fast, and by cutting the sustained barrage it gives the downstream a chance to recover.
# Circuit breaker state machine (pseudo-code)
def call(dep, req):
if state == OPEN:
if now() - opened_at > cooldown: # cooldown elapsed
state = HALF_OPEN # admit one probe
else:
return fallback(req) # fail fast, don't hit dep
try:
resp = dep.invoke(req, timeout=200ms)
on_success() # half-open success -> CLOSED
return resp
except (Timeout, ServerError):
on_failure() # window error rate over threshold -> OPEN
return fallback(req)
HystrixCommand per remote dependency, trips when error rate crosses a threshold (How it Works).One-line trade-off: trade "extra load from retries" for "absorbing transient blips" — but get it wrong and you pour gasoline on the fire.
Principle: transient faults (network blips, leader election, occasional timeouts) usually heal with one retry. But naive immediate retry is an amplifier: downstream overloads → many requests fail → all clients retry at once → downstream gets hit with double the traffic and dies for good. That's a retry storm. Two antidotes: ① exponential backoff, doubling the wait after each failure (100ms→200ms→400ms) to give the downstream room to recover; ② jitter, adding randomness so retry times spread out — otherwise clients "align" into a synchronized pulse and slam the downstream periodically.
# Full-jitter exponential backoff (AWS recommended)
def retry(req, max_attempts=3):
base, cap = 0.1, 2.0 # 100ms base, 2s cap
for attempt in range(max_attempts):
try:
return call(req) # only retry idempotent + retryable errors
except Retryable:
if attempt == max_attempts - 1: raise
if not retry_budget.try_acquire(): raise # global retry budget
backoff = min(cap, base * 2 ** attempt)
sleep(random.uniform(0, backoff)) # full jitter
One-line trade-off: trade "lower overall resource utilization" for "a single dependency's failure can't exhaust global resources".
Principle: from shipbuilding — a hull split into watertight compartments, so a breach in one doesn't sink the ship. Mapped to systems: don't let all downstream calls share one thread/connection pool. Give each dependency its own quota (e.g. coupons gets at most 20 threads). Then if coupons slows down, it can only fill its own 20, while critical inventory and payment calls still have threads — the failure is sealed in its compartment. This is the real cure for the avalanche scenario at the top.
# Semaphore bulkhead: cap per-dependency concurrency (pseudo-code)
sem = {dep: Semaphore(limit[dep]) for dep in deps} # per-dep quota
def call(dep, req):
if not sem[dep].try_acquire(): # full -> reject immediately, no queue
return fallback(req) # compartment full -> doesn't spill to others
try:
return dep.invoke(req, timeout=200ms)
finally:
sem[dep].release()
One-line trade-off: trade "feature completeness / result precision" for "keeping the critical path alive".
Principle: when a dependency fails or the system overloads, rather than returning 500 or collapsing entirely, return a degraded response: coupon service down → place the order at list price (fall back to a default); personalized recommendations down → return the popular list (fall back to cache/static). The companion at ingress is load shedding: as the system nears overload, proactively reject some requests (return 429) so the rest can complete normally — partial success beats everyone timing out. Choose what to drop by criticality: shed non-critical first, protect core checkout.
Amplification = 27×. Per-layer retries are multiplicative: 3×3×3. A request that should hit once becomes 27 at the bottom — exactly the mechanism by which a retry storm finishes off an overloaded downstream that was only trying to catch its breath.
Why retry budget is better: max-retries is a per-request local cap and can't constrain overall amplification; under broad failure every request maxes out 3×, still 27× when stacked. A retry budget uses a token bucket to cap "retries ≤ X% of normal requests" — a global ceiling, so no matter how big the failure, retry traffic can't exceed that ratio. In practice combine both: small per-request cap plus a global budget.
Risk: if at the instant cooldown ends you let all backed-up requests through to probe, you deliver another blow to a dying downstream — just as it's about to recover it's knocked back to Open, flapping in an open→half-open→open cycle, and the downstream never catches up. That's a mini thundering herd.
Design points: ① Half-Open admits only a tiny number of probes (e.g. 1 or a small fixed concurrency); everything else keeps failing fast to fallback; ② require N consecutive successes before returning to Closed (one success isn't enough — don't be fooled by a fluke); ③ a probe failure immediately reverts to Open and resets or extends the cooldown (add backoff to avoid frequent probing); ④ the cooldown itself can grow exponentially, so the longer the outage, the sparser the probes. Essence: Half-Open is "test recovery with a trickle," never "revive at full volume."
Consequences of setting it short: upstream 1s, downstream 2s. The downstream is still processing normally (say it would return at 1.5s), but upstream already gave up at 1s. So: ① the downstream finishes work nobody wants — pure waste; ② upstream counts the "timeout" as a failure and may retry, sending another request to an already-busy downstream, amplifying load; ③ upstream's timeout pollutes the breaker's stats, tripping it on a false judgment that the downstream is unhealthy — when nothing is actually broken.
Correct approach: the timeout budget decreases top-down. Ingress 300ms; after network and self overhead, give downstream 250ms; it gives its downstream 200ms… each layer tighter than its parent. Better still, use deadline propagation: pass an absolute deadline down, and a downstream that sees the deadline already passed just gives up without computing. This gives the whole chain a consistent view of "how much time is left," eliminating wasted work.
Goodput view: throughput is how many requests you accept; goodput is how many complete successfully within SLO. Under overload, if you admit everything, every request queues and slows, eventually all exceed the deadline — clients get nothing but timeouts, goodput approaches 0, while CPU is fully spent on doomed requests. Shed some (return 429) and the rest have ample resources to finish within SLO, so goodput is actually higher. Do a little less, accomplish more.
Death spiral: no shedding + client retries = positive feedback. Overload→slow→timeout→client retry→higher load→slower… each retry round pushes the system deeper, with no self-healing. Breaking the spiral needs: server-side load shedding + client-side throttling (adaptive throttling that reduces sends based on recent reject rate) + retry budget. Miss any one and the system "locks up" at the overload point instead of bouncing back.
No substitution — they intercept at different stages, as defense in depth:
Stitched together: the bulkhead contains the failure, the breaker stops useless calls, retry absorbs transient blips, degradation catches the user experience. Drop any one and this slow coupon call causes perceptible harm at some layer.