Day 10 Medium Rate Limiting Token Bucket / GCRA Distributed Counters

Rate Limiting — Not Just "100 Requests Per Second"Token Bucket vs Leaky Bucket vs Sliding Window · Distributed Counters · The 429 Contract

Scenario & Constraints

Design the rate-limiting system for a public API platform (think Stripe / GitHub). 100k+ third-party integrations, peak 1M QPS, per-key limit 100 req/s, total platform capacity 500k QPS. One runaway script can saturate every worker and starve out paying-customer charge requests — which is why rate limiting is a survival baseline, not an anti-scraper add-on.

Today: algorithm choice, distributed implementation, accuracy vs performance, and the 429 contract.

High-Level Architecture

graph LR
    C["Client / 3rd-party app"]
    GW["API Gateway
(Envoy / Kong / homegrown)"] L1["Local token bucket
(per worker)"] L2["Central counter
Redis + Lua"] SHED["Load Shedder
(priority drop)"] APP["Business service"] C -->|"req + API key"| GW GW -->|"① fast path: 90% deny/allow"| L1 L1 -->|"② slow path: periodic sync"| L2 GW -.overload.-> SHED GW -->|✓| APP GW -.->|"✗ 429 + Retry-After"| C classDef edge fill:#0e2030,stroke:#64c8ff,color:#e8eef5 classDef store fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef origin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class GW,L1 edge class L2 store class APP,SHED origin

Gateway uses local token bucket for the fast path; Redis is the central source of truth. Decisions return in under 1 ms.

Key Technical Points

1. Algorithm choice: Token Bucket vs Leaky Bucket vs Sliding Window vs GCRA

Principle: the four algorithms answer different questions — are bursts allowed? is output smoothed? how does the window roll?

AlgorithmCore ideaAllows burstOutput rateComplexity
Fixed WindowReset counter every secondNoBumpyTrivial
Sliding Window LogStore every request timestampExactHigh (memory)
Sliding Window CounterCurrent + prior window pro-ratedApproximateLow
Token BucketBucket of tokens, can be saved up✓ NativelyBurstyLow
Leaky BucketQueue requests, drain at fixed rateSmoothed via queueStrictly evenMedium
GCRALeaky bucket expressed as a single "next allowed time"TunableRolling, smoothMedium (no queue)
Trade-off:
# Token bucket core (Python pseudo-code)
def allow(key, capacity, refill_rate):
    now = time.time()
    tokens, last = state.get(key, (capacity, now))
    # lazy refill: top up by elapsed time when we look
    tokens = min(capacity, tokens + (now - last) * refill_rate)
    if tokens >= 1:
        state[key] = (tokens - 1, now)
        return True
    state[key] = (tokens, now)
    return False   # deny, retry_after = (1 - tokens) / refill_rate
Real-world cases:

2. Distributed limiting: central Redis vs local share vs async aggregation

Principle: a token bucket on one machine is trivial. The hard problem is coordinating a global quota shared across the fleet.

Trade-off:
-- Atomic token bucket in Redis (Lua, Stripe-style)
-- KEYS[1]=bucket key  ARGV={capacity, refill_rate, now, cost}
local cap, rate, now, cost = tonumber(ARGV[1]), tonumber(ARGV[2]),
                              tonumber(ARGV[3]), tonumber(ARGV[4])
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or cap
local last = tonumber(data[2]) or now
tokens = math.min(cap, tokens + (now - last) * rate)
local allowed = tokens >= cost and 1 or 0
if allowed == 1 then tokens = tokens - cost end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], math.ceil(cap / rate) * 2)
return {allowed, tokens}     -- atomic, no race
Real-world cases:

3. Multi-dimensional limiting and priority shedding

Principle: a single dimension cannot contain real traffic. One key may be fine, but the sum of all keys saturates an endpoint; an endpoint may be fine, but its downstream DB is dying. Check multiple dimensions independently, tightest wins.

Typical stack (each layer is an independent bucket; all must pass):
# Multi-dimensional composition (pseudo-code)
def admit(req):
    keys = [
        ("key", req.api_key, 100, 1),
        ("ip", req.ip, 10, 1),
        ("ep", f"{req.api_key}:{req.path}", endpoint_limit(req.path), 1),
    ]
    for kind, k, cap, cost in keys:
        if not bucket_allow(f"{kind}:{k}", cap, refill_for(cap), cost):
            return reject(retry_after_from(cap, k), reason=kind)
    if global_inflight() > SHED_THRESHOLD and priority(req) < HIGH:
        return reject(retry_after=30, reason="overload")
    return ALLOW
Real-world cases:

4. The 429 contract: let callers back off gracefully

Principle: a limiter is not just a "deny" — it is a machine-readable backoff instruction for the caller. Without it, clients spin in a tight retry loop and pile on a second time.

Standard headers:
Crucial pitfall: clients must jitter. If 1000 callers all receive Retry-After: 30, 30 seconds later they all hit you again — thundering herd reborn. Either the server returns a jittered value (30 ± 5s), or the client adds jitter itself (Stripe SDK and AWS SDK both bake in exponential backoff + jitter).
Real-world cases:

Extensions & Optimization

Common Pitfalls / Interview Follow-ups

  1. "Why not sliding window log?" — Memory. 1M users × 100 reqs = 100M timestamps, several GB. Counter approximations get to < 5% error.
  2. "What if Redis goes down?" — Fail-open and circuit-break; Stripe states this as an iron rule.
  3. "Same user, multiple processes — how do they share a local token bucket?" — Per-process buckets don't; you need Redis, or "local sub-quota + async reconciliation".
  4. "429 vs 503 — when each?" — 429 = caller over its quota (the caller's problem); 503 = service overloaded (your problem). Confusing them breaks the caller's retry strategy.
  5. "How do you compute Retry-After?" — With a token bucket: (1 - tokens) / refill_rate. With a sliding window: milliseconds until the next second tick. And the server must add jitter.
  6. "Why is token bucket more popular than leaky bucket?" — Public-API callers naturally arrive in bursts (a cron firing 50 requests at once); token bucket allows that legitimate burst. Strict leaky-bucket smoothing punishes normal users.

Deep Resources

Deeper Questions (click to expand)

1. How do you relate token-bucket capacity and refill rate? Is capacity = refill reasonable?

Common mistake: "100 req/s, so cap=100 and rate=100." But cap=100 means the user can save 100 tokens and instantly fire 100 requests — 1 ms of peak QPS at 100,000, which flattens downstream.

The right framing: cap is the allowed burst size, rate is the long-run average. They are independent:

  • How big a burst can downstream absorb? → choose cap. Common cap = rate × 1–3 (1–3 seconds of burst).
  • What's the daily allowance per user? → choose rate.

Stripe's posture is "rate is the contract, cap is the mercy" — cap is slightly above rate so normal bursts don't error, but not so big that downstream crashes. For extreme cases (financial reconciliation) cap = rate forces strict smoothing — effectively a leaky bucket.

2. 1M QPS won't fit through one Redis node (~300k QPS ceiling). How do you squeeze 1M limit decisions per second through it?

Stack three ideas:

  • Local quick-deny: each gateway keeps a local list of "recently over-limit keys" with 100 ms TTL. Over-limit keys 429 without touching Redis. 99% of hot-key traffic is shielded locally.
  • Local sub-quota: at startup, each gateway claims N seconds of quota (say 100 tokens) from Redis and spends it locally. When empty, claim the next batch. Redis QPS drops from 1M to 1M / 100 = 10k. Cost: uneven distribution across gateways; quota drifts during scale-in/out.
  • Redis Cluster shards: hash by user_id to different shards; each shard absorbs 300k QPS. Cost: a single global quota cannot be expressed as one Redis key.

In production the combo local quick-deny + local sub-quota + Redis Cluster typically lands Redis at 50–100k QPS.

3. Clients all receive 429 + Retry-After=30s, then retry simultaneously — second wave overload. What can the server do?
  • Server returns jittered Retry-After: not a fixed 30 but 30 + uniform(0, 30) — spreads retries over an interval. Cost: some clients wait longer.
  • Stagger window resets per client: anchor the bucket's expiry to hash(api_key) so different users' window boundaries are naturally offset.
  • Force backoff + jitter in the SDK: the official SDK does retry_after × random(0.5, 1.5), covering all callers. Stripe, AWS, Google all do this.
  • Soft signals in normal responses: while X-RateLimit-Remaining is approaching 0, well-built SDKs proactively slow down — don't wait for 429 to react.
  • Continuous-recovery semantics: don't have everyone go from 0 to full quota at the same reset instant; GCRA's continuous rolling refill naturally staggers retry timing.
4. The limiter itself crashes — fail-open or fail-close? Which kind of business goes which way?

Default fail-open: the reasoning is "limiting is protective; a limiter failure should not become a business failure". Stripe and AWS both do this. Fail-open means attack traffic flows straight to the backend the moment the limiter dies — pair it with a circuit breaker so downstream can defend itself.

When fail-close is the right call:

  • Paid quota / billing quota: going over budget = you lose money (e.g., a free tier of 1000 calls / month). Better to deny than admit.
  • Irreversible downstream effects: SMS sends, debit operations — blind admit can fan out into a bulk mistake.
  • Regulatory constraints: some financial / healthcare limits are mandated; if you can't enforce, you must stop.

The pragmatic shape: DDoS layer fail-open (save the site), billing-quota layer fail-close (save the money), deployed as separate components.

5. A single GraphQL query can consume 100× the CPU. How do you define "one request" for rate-limiting?

REST's hidden assumption — "all requests cost about the same" — falls apart under GraphQL. One nested query can scan tens of thousands of rows and run a dozen joins.

The answer is cost weighting:

  • Static cost: analyze the query at submit time, summing per-field cost (list field × an assumed 100 rows), and charge the bucket that many tokens up front. GitHub's GraphQL API does exactly this — 5000 "points" per hour, not 5000 requests.
  • Dynamic cost: after execution, reconcile tokens by actual time / rows scanned. Cost: charge is retroactive, so a malicious query has already burned the CPU.
  • Hard depth / field caps: before the limiter, reject queries with depth > 10 or > 1000 fields at the parsing stage. Apollo has built-in support.
  • Persisted queries: only allow pre-reviewed query hashes; dynamic queries can't even reach your gateway. Facebook and Shopify run this at scale.

The same shape applies elsewhere: bulk APIs (one call with 100 IDs), file uploads (charge per byte), AI inference (charge per token). Any time a single request's cost can vary by > 10×, you need weights.