Day 23 Medium Reliability Circuit Breaker Backoff Bulkhead

Reliability — One flaky dependency shouldn't take down the whole chainCircuit Breaker, Retry/Backoff, Bulkhead & Graceful Degradation

Problem & Constraints

Design an e-commerce checkout path: 50K orders/sec, p99 SLO 300ms, overall availability target 99.95% (~4.4h downtime/year). Placing an order synchronously calls roughly 30 downstream dependencies — inventory, coupons, fraud, address, payment pre-auth… Only inventory and payment are critical; the rest are non-critical.

What you really need to prevent isn't "a service died" but cascading failure: one GC pause in the coupon service pushes its response from 20ms to 5s. The caller waits synchronously, every order request pins a thread blocking on coupons — within seconds the thread pool is exhausted and even requests that never touch coupons can't get in. One non-critical dependency drags down the entire checkout. The core thesis of reliability engineering: keep local failures local. This issue covers four complementary weapons: circuit breaker, retry with backoff, bulkhead, graceful degradation.

Failure model: downstreams rarely die outright — they go slow. Slow is worse than dead, because it quietly drains your threads/connections.
Dependency tiering: critical vs non-critical. Non-critical deps must be skippable, or their availability becomes your floor.
Serial availability: 30 deps in series, each 99.9% → overall ≈ 99.9%³⁰ ≈ 97%. Without isolation, more deps = more fragile.
Idempotency: determines whether you can safely retry (see Day 7, Day 17).

High-Level Architecture

Every downstream call passes through a resilience layer: timeout → bulkhead (isolated resource pool) → circuit breaker (tracks error rate) → retry (with backoff). If any stage decides failure, it immediately goes to a fallback (cached value / default / skip) instead of blocking. At the ingress there's also load shedding, which under overload protects critical requests first.

graph LR
    IN["order request"]
    LS{"load shed
overload→drop"}
    ORD["order svc
orchestrate"]
    BH["bulkhead
pool per dep"]
    CB{"breaker
open?"}
    DEP["downstream
coupon/fraud..."]
    FB["fallback
cache/default/skip"]

    IN --> LS -->|admit| ORD --> BH --> CB
    LS -.->|reject 429| IN
    CB -->|closed| DEP
    CB -.->|open→fail fast| FB
    DEP -.->|timeout/error| FB

    classDef in fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef gate fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef dep fill:#0e2030,stroke:#5eead4,color:#e8eef5
    class IN in
    class LS,CB gate
    class ORD,BH core
    class DEP,FB dep

Each dependency is isolated; failures fail fast to a fallback so one slow dep can't exhaust global resources

Key Techniques

1. Circuit Breaker — pull the switch on a failing dependency

One-line trade-off: trade "temporary unavailability of the failing dependency" for "the caller fails fast and the downstream gets to breathe".

Principle: the electrical-breaker metaphor. A three-state machine: Closed (let traffic through, track failure rate) → failure rate exceeds threshold → Open (reject directly, send nothing, go straight to fallback) → cool down → Half-Open (admit a few probe requests) → success returns to Closed, failure goes back to Open. It turns "wait out the full timeout before failing" (wasting seconds each time while still piling pressure on a dying downstream) into fail fast, and by cutting the sustained barrage it gives the downstream a chance to recover.

Trade-off:

Timeout only, no breaker: ✅ simple; ❌ when downstream is slow, every request waits out the full timeout (e.g. 1s); threads still get exhausted under load, and you keep hammering a downstream that never recovers.
Consecutive-failure count (e.g. 5 in a row): ✅ trivial to implement; ❌ at high QPS "5" is reached instantly (too sensitive); at low QPS it never trips.
Sliding-window error rate (e.g. >50% over 10s): ✅ adapts to traffic, more stable; ❌ needs windowed stats and a minimum-request threshold to avoid small-sample misfires.

# Circuit breaker state machine (pseudo-code)
def call(dep, req):
    if state == OPEN:
        if now() - opened_at > cooldown:   # cooldown elapsed
            state = HALF_OPEN               # admit one probe
        else:
            return fallback(req)            # fail fast, don't hit dep
    try:
        resp = dep.invoke(req, timeout=200ms)
        on_success()                        # half-open success -> CLOSED
        return resp
    except (Timeout, ServerError):
        on_failure()                        # window error rate over threshold -> OPEN
        return fallback(req)

Real-world:

Netflix Hystrix: the canonical implementation — one HystrixCommand per remote dependency, trips when error rate crosses a threshold (How it Works).
resilience4j: the mainstream JVM successor after Hystrix went into maintenance, using count- or time-based sliding windows.
Martin Fowler: CircuitBreaker is the standard definition of the pattern.

2. Retry with Backoff + Jitter — a double-edged sword

One-line trade-off: trade "extra load from retries" for "absorbing transient blips" — but get it wrong and you pour gasoline on the fire.

Principle: transient faults (network blips, leader election, occasional timeouts) usually heal with one retry. But naive immediate retry is an amplifier: downstream overloads → many requests fail → all clients retry at once → downstream gets hit with double the traffic and dies for good. That's a retry storm. Two antidotes: ① exponential backoff, doubling the wait after each failure (100ms→200ms→400ms) to give the downstream room to recover; ② jitter, adding randomness so retry times spread out — otherwise clients "align" into a synchronized pulse and slam the downstream periodically.

Trade-off:

Fixed-interval retry: ✅ simple, predictable; ❌ gives no recovery room and clients easily align into pulses.
Pure exponential backoff (no jitter): ✅ relieves pressure; ❌ everyone backs off in lockstep, still synchronizing into periodic spikes.
Exponential backoff + Full Jitter: ✅ flattens spikes into near-constant load (Marc Brooker's measured optimum); ❌ higher variance in per-request latency.
Fixed max-retries vs Retry Budget: instead of "retry up to 3 times" per request, globally cap "retries ≤ 10% of total requests" (token bucket). The former still amplifies 3× under broad failure; the latter is inherently bounded.

# Full-jitter exponential backoff (AWS recommended)
def retry(req, max_attempts=3):
    base, cap = 0.1, 2.0          # 100ms base, 2s cap
    for attempt in range(max_attempts):
        try:
            return call(req)       # only retry idempotent + retryable errors
        except Retryable:
            if attempt == max_attempts - 1: raise
            if not retry_budget.try_acquire(): raise  # global retry budget
            backoff = min(cap, base * 2 ** attempt)
            sleep(random.uniform(0, backoff))         # full jitter

Real-world:

AWS: Marc Brooker's Timeouts, retries, and backoff with jitter and Exponential Backoff And Jitter are the authoritative sources; most AWS SDKs ship it by default.
Google SRE: uses a retry budget (proportional cap) rather than a fixed count, to stop retry amplification from crushing backends (SRE Book, Handling Overload).
gRPC: built-in retry policy with backoff params and a retryable-status-code allowlist.

3. Bulkhead — partition resources into watertight compartments

One-line trade-off: trade "lower overall resource utilization" for "a single dependency's failure can't exhaust global resources".

Principle: from shipbuilding — a hull split into watertight compartments, so a breach in one doesn't sink the ship. Mapped to systems: don't let all downstream calls share one thread/connection pool. Give each dependency its own quota (e.g. coupons gets at most 20 threads). Then if coupons slows down, it can only fill its own 20, while critical inventory and payment calls still have threads — the failure is sealed in its compartment. This is the real cure for the avalanche scenario at the top.

Trade-off:

No isolation (shared pool): ✅ highest utilization; ❌ one slow dep drains the whole pool = cascading failure, exactly what we're avoiding.
Thread-pool isolation: ✅ can enforce timeout interruption (the call runs on a separate thread, the main thread can "walk away") and acts as natural throttling; ❌ thread-switch overhead, one pool per dep costs memory.
Semaphore isolation: ✅ lightweight, no thread switching; ❌ cannot interrupt a blocked call (a semaphore only caps concurrency; timeouts must come from below), best for local low-latency calls.

# Semaphore bulkhead: cap per-dependency concurrency (pseudo-code)
sem = {dep: Semaphore(limit[dep]) for dep in deps}   # per-dep quota
def call(dep, req):
    if not sem[dep].try_acquire():      # full -> reject immediately, no queue
        return fallback(req)            # compartment full -> doesn't spill to others
    try:
        return dep.invoke(req, timeout=200ms)
    finally:
        sem[dep].release()

Real-world:

Netflix: Hystrix defaults to thread-pool isolation, running 40+ thread pools and 100+ command types in production, sealing each dependency's failure in its own pool (Hystrix Wiki).
Michael Nygard, Release It!: formalized the Bulkhead stability pattern via the ship-compartment metaphor (same origin as Circuit Breaker).
Kubernetes / service mesh: resource quotas and connection-pool limits (Envoy's max connections / pending requests) implement process- and connection-level bulkheads.

4. Graceful Degradation + Load Shedding — partial beats total death

One-line trade-off: trade "feature completeness / result precision" for "keeping the critical path alive".

Principle: when a dependency fails or the system overloads, rather than returning 500 or collapsing entirely, return a degraded response: coupon service down → place the order at list price (fall back to a default); personalized recommendations down → return the popular list (fall back to cache/static). The companion at ingress is load shedding: as the system nears overload, proactively reject some requests (return 429) so the rest can complete normally — partial success beats everyone timing out. Choose what to drop by criticality: shed non-critical first, protect core checkout.

Trade-off:

No degradation, hard fail: ✅ clean semantics (right or error); ❌ a non-critical dep's availability directly becomes your floor; brittle UX.
No load shedding, admit everything: ✅ drop nothing; ❌ under overload all requests slow down until everything times out (goodput → 0), worse than dropping some.
Random shed vs priority shed: random is simple but hits critical requests too; criticality-based is better but requires propagating priority labels end-to-end.

Real-world:

Google SRE: server-side load shedding + client-side throttling + request criticality tiers is the standard overload practice (Handling Overload, Addressing Cascading Failures).
Netflix: every Hystrix dependency has a fallback — if personalized data is unavailable, return non-personalized default content; better imprecise than blank.

Extensions & Optimizations

Adaptive concurrency limits: instead of hand-tuning fixed bulkhead sizes, use a TCP-congestion-control-like algorithm (Netflix concurrency-limits) that adjusts the concurrency cap based on latency changes.
Hedged requests: for idempotent reads, if p95 hasn't returned, fire a second request in parallel and take the first to arrive — sharply cutting tail latency (pair with a retry budget to prevent amplification).
End-to-end timeout budget: allocate a total budget at ingress (e.g. 300ms) and propagate it decreasing per hop (deadline propagation), so each downstream knows how much time is left and gives up when it's gone.
Chaos engineering: writing these protections doesn't mean they work — use fault injection (inject latency, kill deps) to verify the breaker actually trips and the fallback actually catches (see the upcoming chaos-engineering issue).
Regional failover: when an AZ/region fails, shift traffic to a healthy region — reliability one level above per-host isolation.

Common Pitfalls + Interview Follow-ups

1. Retries stack across layers → exponential amplification: A→B→C, each retrying 3×, and the bottom C actually takes 3×3×3 = 27× traffic. Rule: retry only at the outermost layer, or share a retry budget end-to-end.

2. Timeout hierarchy inverted: the upstream timeout must be greater than the sum of downstream timeouts. If upstream is 1s and downstream is 2s, upstream gives up early while downstream keeps computing — wasted resources, plus possible upstream retries. Timeouts should decrease top-down.

3. Retrying without distinguishing error types: 4xx (bad params, auth failure) won't fix on retry no matter how many times — it just adds load. Retry only timeouts and 5xx, and only when the operation is idempotent (a double charge is an incident).

4. The fallback itself depends on the failed component: if degradation reads from a "cache" that shares the same dead Redis as the main path, the fallback is useless. The fallback path must be simpler and more independent than the main path.

5. Guessing breaker thresholds: too sensitive → normal blips trip it (false positives); too dull → effectively absent. Base it on a minimum request count + sliding-window error rate, and think through half-open probe concurrency (don't let probes become a thundering herd).

Further Resources

Amazon Builders' Library: Marc Brooker, Timeouts, retries, and backoff with jitter — the authoritative practice on timeouts/retries/backoff.
Google SRE Book: Handling Overload and Addressing Cascading Failures — load shedding, throttling, cascading failures.
Martin Fowler bliki: CircuitBreaker — the standard definition.
Netflix: Hystrix Wiki — engineering details of circuit breaker + bulkhead + fallback.
Michael Nygard, Release It! (Pragmatic Bookshelf): the original source of the stability patterns (Circuit Breaker / Bulkhead / Fail Fast / Shed Load).

Deeper Thinking (click to expand)

1. A→B→C, three layers, each configured "retry up to 3×". When C briefly overloads, what's the actual amplification it absorbs? Why is a retry budget safer than max-retries?

Amplification = 27×. Per-layer retries are multiplicative: 3×3×3. A request that should hit once becomes 27 at the bottom — exactly the mechanism by which a retry storm finishes off an overloaded downstream that was only trying to catch its breath.

Why retry budget is better: max-retries is a per-request local cap and can't constrain overall amplification; under broad failure every request maxes out 3×, still 27× when stacked. A retry budget uses a token bucket to cap "retries ≤ X% of normal requests" — a global ceiling, so no matter how big the failure, retry traffic can't exceed that ratio. In practice combine both: small per-request cap plus a global budget.

2. After the breaker goes Open and the cooldown elapses to Half-Open, it admits probes. What happens if probe concurrency isn't limited? How should you design Half-Open?

Risk: if at the instant cooldown ends you let all backed-up requests through to probe, you deliver another blow to a dying downstream — just as it's about to recover it's knocked back to Open, flapping in an open→half-open→open cycle, and the downstream never catches up. That's a mini thundering herd.

Design points: ① Half-Open admits only a tiny number of probes (e.g. 1 or a small fixed concurrency); everything else keeps failing fast to fallback; ② require N consecutive successes before returning to Closed (one success isn't enough — don't be fooled by a fluke); ③ a probe failure immediately reverts to Open and resets or extends the cooldown (add backoff to avoid frequent probing); ④ the cooldown itself can grow exponentially, so the longer the outage, the sparser the probes. Essence: Half-Open is "test recovery with a trickle," never "revive at full volume."

3. Why must the upstream timeout exceed the sum of downstream timeouts? What chain reaction follows from setting upstream shorter than downstream?

Consequences of setting it short: upstream 1s, downstream 2s. The downstream is still processing normally (say it would return at 1.5s), but upstream already gave up at 1s. So: ① the downstream finishes work nobody wants — pure waste; ② upstream counts the "timeout" as a failure and may retry, sending another request to an already-busy downstream, amplifying load; ③ upstream's timeout pollutes the breaker's stats, tripping it on a false judgment that the downstream is unhealthy — when nothing is actually broken.

Correct approach: the timeout budget decreases top-down. Ingress 300ms; after network and self overhead, give downstream 250ms; it gives its downstream 200ms… each layer tighter than its parent. Better still, use deadline propagation: pass an absolute deadline down, and a downstream that sees the deadline already passed just gives up without computing. This gives the whole chain a consistent view of "how much time is left," eliminating wasted work.

4. Under overload, load shedding (proactively rejecting some) counterintuitively beats "admit everything." Explain via goodput, and how retries push a no-shedding system into a death spiral.

Goodput view: throughput is how many requests you accept; goodput is how many complete successfully within SLO. Under overload, if you admit everything, every request queues and slows, eventually all exceed the deadline — clients get nothing but timeouts, goodput approaches 0, while CPU is fully spent on doomed requests. Shed some (return 429) and the rest have ample resources to finish within SLO, so goodput is actually higher. Do a little less, accomplish more.

Death spiral: no shedding + client retries = positive feedback. Overload→slow→timeout→client retry→higher load→slower… each retry round pushes the system deeper, with no self-healing. Breaking the spiral needs: server-side load shedding + client-side throttling (adaptive throttling that reduces sends based on recent reject rate) + retry budget. Miss any one and the system "locks up" at the overload point instead of bouncing back.

5. Circuit breaker, bulkhead, retry, degradation — can they substitute for each other? Use one "coupon service slows to 5s" incident to show where each intercepts and what breaks if it's missing.

No substitution — they intercept at different stages, as defense in depth:

Bulkhead intercepts "resource exhaustion": coupons slow, it fills at most its dedicated 20 threads, critical calls unaffected. Missing it → the slow dep eats all global threads, site-wide avalanche (the opening scenario).
Circuit breaker intercepts "sustained useless waiting": detects coupon error rate spiking and trips, so later requests fail fast instead of each waiting 5s. Missing it → even with a bulkhead, those 20 threads stay pinned by 5s slow calls and that dependency's throughput goes to zero.
Retry (+backoff) handles "transient" blips: a one-off slow call can be saved by backoff retry; but for sustained failure the breaker must stop retries, or retries add pressure.
Degradation decides "what the user gets after failure": once the breaker trips / the call fails, fall back to list-price checkout so the user still completes the core flow. Missing it → the first three all work, but the user still sees a failed order.

Stitched together: the bulkhead contains the failure, the breaker stops useless calls, retry absorbs transient blips, degradation catches the user experience. Drop any one and this slow coupon call causes perceptible harm at some layer.