Day 23 Medium Reliability Circuit Breaker Backoff Bulkhead

Reliability — One flaky dependency shouldn't take down the whole chainCircuit Breaker, Retry/Backoff, Bulkhead & Graceful Degradation

Problem & Constraints

Design an e-commerce checkout path: 50K orders/sec, p99 SLO 300ms, overall availability target 99.95% (~4.4h downtime/year). Placing an order synchronously calls roughly 30 downstream dependencies — inventory, coupons, fraud, address, payment pre-auth… Only inventory and payment are critical; the rest are non-critical.

What you really need to prevent isn't "a service died" but cascading failure: one GC pause in the coupon service pushes its response from 20ms to 5s. The caller waits synchronously, every order request pins a thread blocking on coupons — within seconds the thread pool is exhausted and even requests that never touch coupons can't get in. One non-critical dependency drags down the entire checkout. The core thesis of reliability engineering: keep local failures local. This issue covers four complementary weapons: circuit breaker, retry with backoff, bulkhead, graceful degradation.

High-Level Architecture

Every downstream call passes through a resilience layer: timeout → bulkhead (isolated resource pool) → circuit breaker (tracks error rate) → retry (with backoff). If any stage decides failure, it immediately goes to a fallback (cached value / default / skip) instead of blocking. At the ingress there's also load shedding, which under overload protects critical requests first.

graph LR
    IN["order request"]
    LS{"load shed
overload→drop"} ORD["order svc
orchestrate"] BH["bulkhead
pool per dep"] CB{"breaker
open?"} DEP["downstream
coupon/fraud..."] FB["fallback
cache/default/skip"] IN --> LS -->|admit| ORD --> BH --> CB LS -.->|reject 429| IN CB -->|closed| DEP CB -.->|open→fail fast| FB DEP -.->|timeout/error| FB classDef in fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef gate fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef dep fill:#0e2030,stroke:#5eead4,color:#e8eef5 class IN in class LS,CB gate class ORD,BH core class DEP,FB dep

Each dependency is isolated; failures fail fast to a fallback so one slow dep can't exhaust global resources

Key Techniques

1. Circuit Breaker — pull the switch on a failing dependency

One-line trade-off: trade "temporary unavailability of the failing dependency" for "the caller fails fast and the downstream gets to breathe".

Principle: the electrical-breaker metaphor. A three-state machine: Closed (let traffic through, track failure rate) → failure rate exceeds threshold → Open (reject directly, send nothing, go straight to fallback) → cool down → Half-Open (admit a few probe requests) → success returns to Closed, failure goes back to Open. It turns "wait out the full timeout before failing" (wasting seconds each time while still piling pressure on a dying downstream) into fail fast, and by cutting the sustained barrage it gives the downstream a chance to recover.

Trade-off:
# Circuit breaker state machine (pseudo-code)
def call(dep, req):
    if state == OPEN:
        if now() - opened_at > cooldown:   # cooldown elapsed
            state = HALF_OPEN               # admit one probe
        else:
            return fallback(req)            # fail fast, don't hit dep
    try:
        resp = dep.invoke(req, timeout=200ms)
        on_success()                        # half-open success -> CLOSED
        return resp
    except (Timeout, ServerError):
        on_failure()                        # window error rate over threshold -> OPEN
        return fallback(req)
Real-world:

2. Retry with Backoff + Jitter — a double-edged sword

One-line trade-off: trade "extra load from retries" for "absorbing transient blips" — but get it wrong and you pour gasoline on the fire.

Principle: transient faults (network blips, leader election, occasional timeouts) usually heal with one retry. But naive immediate retry is an amplifier: downstream overloads → many requests fail → all clients retry at once → downstream gets hit with double the traffic and dies for good. That's a retry storm. Two antidotes: ① exponential backoff, doubling the wait after each failure (100ms→200ms→400ms) to give the downstream room to recover; ② jitter, adding randomness so retry times spread out — otherwise clients "align" into a synchronized pulse and slam the downstream periodically.

Trade-off:
# Full-jitter exponential backoff (AWS recommended)
def retry(req, max_attempts=3):
    base, cap = 0.1, 2.0          # 100ms base, 2s cap
    for attempt in range(max_attempts):
        try:
            return call(req)       # only retry idempotent + retryable errors
        except Retryable:
            if attempt == max_attempts - 1: raise
            if not retry_budget.try_acquire(): raise  # global retry budget
            backoff = min(cap, base * 2 ** attempt)
            sleep(random.uniform(0, backoff))         # full jitter
Real-world:

3. Bulkhead — partition resources into watertight compartments

One-line trade-off: trade "lower overall resource utilization" for "a single dependency's failure can't exhaust global resources".

Principle: from shipbuilding — a hull split into watertight compartments, so a breach in one doesn't sink the ship. Mapped to systems: don't let all downstream calls share one thread/connection pool. Give each dependency its own quota (e.g. coupons gets at most 20 threads). Then if coupons slows down, it can only fill its own 20, while critical inventory and payment calls still have threads — the failure is sealed in its compartment. This is the real cure for the avalanche scenario at the top.

Trade-off:
# Semaphore bulkhead: cap per-dependency concurrency (pseudo-code)
sem = {dep: Semaphore(limit[dep]) for dep in deps}   # per-dep quota
def call(dep, req):
    if not sem[dep].try_acquire():      # full -> reject immediately, no queue
        return fallback(req)            # compartment full -> doesn't spill to others
    try:
        return dep.invoke(req, timeout=200ms)
    finally:
        sem[dep].release()
Real-world:

4. Graceful Degradation + Load Shedding — partial beats total death

One-line trade-off: trade "feature completeness / result precision" for "keeping the critical path alive".

Principle: when a dependency fails or the system overloads, rather than returning 500 or collapsing entirely, return a degraded response: coupon service down → place the order at list price (fall back to a default); personalized recommendations down → return the popular list (fall back to cache/static). The companion at ingress is load shedding: as the system nears overload, proactively reject some requests (return 429) so the rest can complete normally — partial success beats everyone timing out. Choose what to drop by criticality: shed non-critical first, protect core checkout.

Trade-off:
Real-world:

Extensions & Optimizations

Common Pitfalls + Interview Follow-ups

1. Retries stack across layers → exponential amplification: A→B→C, each retrying 3×, and the bottom C actually takes 3×3×3 = 27× traffic. Rule: retry only at the outermost layer, or share a retry budget end-to-end.
2. Timeout hierarchy inverted: the upstream timeout must be greater than the sum of downstream timeouts. If upstream is 1s and downstream is 2s, upstream gives up early while downstream keeps computing — wasted resources, plus possible upstream retries. Timeouts should decrease top-down.
3. Retrying without distinguishing error types: 4xx (bad params, auth failure) won't fix on retry no matter how many times — it just adds load. Retry only timeouts and 5xx, and only when the operation is idempotent (a double charge is an incident).
4. The fallback itself depends on the failed component: if degradation reads from a "cache" that shares the same dead Redis as the main path, the fallback is useless. The fallback path must be simpler and more independent than the main path.
5. Guessing breaker thresholds: too sensitive → normal blips trip it (false positives); too dull → effectively absent. Base it on a minimum request count + sliding-window error rate, and think through half-open probe concurrency (don't let probes become a thundering herd).

Further Resources

Deeper Thinking (click to expand)

1. A→B→C, three layers, each configured "retry up to 3×". When C briefly overloads, what's the actual amplification it absorbs? Why is a retry budget safer than max-retries?

Amplification = 27×. Per-layer retries are multiplicative: 3×3×3. A request that should hit once becomes 27 at the bottom — exactly the mechanism by which a retry storm finishes off an overloaded downstream that was only trying to catch its breath.

Why retry budget is better: max-retries is a per-request local cap and can't constrain overall amplification; under broad failure every request maxes out 3×, still 27× when stacked. A retry budget uses a token bucket to cap "retries ≤ X% of normal requests" — a global ceiling, so no matter how big the failure, retry traffic can't exceed that ratio. In practice combine both: small per-request cap plus a global budget.

2. After the breaker goes Open and the cooldown elapses to Half-Open, it admits probes. What happens if probe concurrency isn't limited? How should you design Half-Open?

Risk: if at the instant cooldown ends you let all backed-up requests through to probe, you deliver another blow to a dying downstream — just as it's about to recover it's knocked back to Open, flapping in an open→half-open→open cycle, and the downstream never catches up. That's a mini thundering herd.

Design points: ① Half-Open admits only a tiny number of probes (e.g. 1 or a small fixed concurrency); everything else keeps failing fast to fallback; ② require N consecutive successes before returning to Closed (one success isn't enough — don't be fooled by a fluke); ③ a probe failure immediately reverts to Open and resets or extends the cooldown (add backoff to avoid frequent probing); ④ the cooldown itself can grow exponentially, so the longer the outage, the sparser the probes. Essence: Half-Open is "test recovery with a trickle," never "revive at full volume."

3. Why must the upstream timeout exceed the sum of downstream timeouts? What chain reaction follows from setting upstream shorter than downstream?

Consequences of setting it short: upstream 1s, downstream 2s. The downstream is still processing normally (say it would return at 1.5s), but upstream already gave up at 1s. So: ① the downstream finishes work nobody wants — pure waste; ② upstream counts the "timeout" as a failure and may retry, sending another request to an already-busy downstream, amplifying load; ③ upstream's timeout pollutes the breaker's stats, tripping it on a false judgment that the downstream is unhealthy — when nothing is actually broken.

Correct approach: the timeout budget decreases top-down. Ingress 300ms; after network and self overhead, give downstream 250ms; it gives its downstream 200ms… each layer tighter than its parent. Better still, use deadline propagation: pass an absolute deadline down, and a downstream that sees the deadline already passed just gives up without computing. This gives the whole chain a consistent view of "how much time is left," eliminating wasted work.

4. Under overload, load shedding (proactively rejecting some) counterintuitively beats "admit everything." Explain via goodput, and how retries push a no-shedding system into a death spiral.

Goodput view: throughput is how many requests you accept; goodput is how many complete successfully within SLO. Under overload, if you admit everything, every request queues and slows, eventually all exceed the deadline — clients get nothing but timeouts, goodput approaches 0, while CPU is fully spent on doomed requests. Shed some (return 429) and the rest have ample resources to finish within SLO, so goodput is actually higher. Do a little less, accomplish more.

Death spiral: no shedding + client retries = positive feedback. Overload→slow→timeout→client retry→higher load→slower… each retry round pushes the system deeper, with no self-healing. Breaking the spiral needs: server-side load shedding + client-side throttling (adaptive throttling that reduces sends based on recent reject rate) + retry budget. Miss any one and the system "locks up" at the overload point instead of bouncing back.

5. Circuit breaker, bulkhead, retry, degradation — can they substitute for each other? Use one "coupon service slows to 5s" incident to show where each intercepts and what breaks if it's missing.

No substitution — they intercept at different stages, as defense in depth:

  • Bulkhead intercepts "resource exhaustion": coupons slow, it fills at most its dedicated 20 threads, critical calls unaffected. Missing it → the slow dep eats all global threads, site-wide avalanche (the opening scenario).
  • Circuit breaker intercepts "sustained useless waiting": detects coupon error rate spiking and trips, so later requests fail fast instead of each waiting 5s. Missing it → even with a bulkhead, those 20 threads stay pinned by 5s slow calls and that dependency's throughput goes to zero.
  • Retry (+backoff) handles "transient" blips: a one-off slow call can be saved by backoff retry; but for sustained failure the breaker must stop retries, or retries add pressure.
  • Degradation decides "what the user gets after failure": once the breaker trips / the call fails, fall back to list-price checkout so the user still completes the core flow. Missing it → the first three all work, but the user still sees a failed order.

Stitched together: the bulkhead contains the failure, the breaker stops useless calls, retry absorbs transient blips, degradation catches the user experience. Drop any one and this slow coupon call causes perceptible harm at some layer.