Day 10 Medium Rate Limiting Token Bucket / GCRA Distributed Counters

Rate Limiting — Not Just "100 Requests Per Second"Token Bucket vs Leaky Bucket vs Sliding Window · Distributed Counters · The 429 Contract

Scenario & Constraints

Design the rate-limiting system for a public API platform (think Stripe / GitHub). 100k+ third-party integrations, peak 1M QPS, per-key limit 100 req/s, total platform capacity 500k QPS. One runaway script can saturate every worker and starve out paying-customer charge requests — which is why rate limiting is a survival baseline, not an anti-scraper add-on.

Fairness: a misbehaving key must not drag down the others; high-priority calls (create charge) must not be starved by low-priority ones (export report).
Precision: a 100 req/s limit cannot legally allow 200 requests crammed into the millisecond around a second boundary.
Low overhead: the limiter latency must be far below the business latency — target p99 < 1 ms.
Multi-dimensional: limit per user / IP / endpoint / org separately, and compose them (the tightest layer wins).
Caller-friendly: 429 must be machine-retriable — Retry-After must be present and accurate.
Limiter failure cannot become business failure: when Redis is down, letting traffic through (fail-open) beats taking down the whole site.

Today: algorithm choice, distributed implementation, accuracy vs performance, and the 429 contract.

High-Level Architecture

graph LR
    C["Client / 3rd-party app"]
    GW["API Gateway
(Envoy / Kong / homegrown)"]
    L1["Local token bucket
(per worker)"]
    L2["Central counter
Redis + Lua"]
    SHED["Load Shedder
(priority drop)"]
    APP["Business service"]

    C -->|"req + API key"| GW
    GW -->|"① fast path: 90% deny/allow"| L1
    L1 -->|"② slow path: periodic sync"| L2
    GW -.overload.-> SHED
    GW -->|✓| APP
    GW -.->|"✗ 429 + Retry-After"| C

    classDef edge fill:#0e2030,stroke:#64c8ff,color:#e8eef5
    classDef store fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef origin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class GW,L1 edge
    class L2 store
    class APP,SHED origin

Gateway uses local token bucket for the fast path; Redis is the central source of truth. Decisions return in under 1 ms.

Key Technical Points

1. Algorithm choice: Token Bucket vs Leaky Bucket vs Sliding Window vs GCRA

Principle: the four algorithms answer different questions — are bursts allowed? is output smoothed? how does the window roll?

Algorithm	Core idea	Allows burst	Output rate	Complexity
Fixed Window	Reset counter every second	No	Bumpy	Trivial
Sliding Window Log	Store every request timestamp	—	Exact	High (memory)
Sliding Window Counter	Current + prior window pro-rated	—	Approximate	Low
Token Bucket	Bucket of tokens, can be saved up	✓ Natively	Bursty	Low
Leaky Bucket	Queue requests, drain at fixed rate	Smoothed via queue	Strictly even	Medium
GCRA	Leaky bucket expressed as a single "next allowed time"	Tunable	Rolling, smooth	Medium (no queue)

Trade-off:

Fixed Window: ✅ one counter; ❌ boundary double-burst (100 at t=0.999s, 100 at t=1.001s — 200 requests in 2ms all admitted).
Sliding Window Log: ✅ exact; ❌ stores N timestamps per key — 1M keys × 100 reqs = 100M entries, Redis memory explodes.
Token Bucket: ✅ naturally supports bursts (save tokens); ❌ does not enforce smoothing — a saved-up burst can still flood downstream. Stripe picks it precisely to allow legitimate bursts.
Leaky Bucket: ✅ downstream sees a flat rate; ❌ needs a queue; a pure form introduces latency even when the queue is short but no token is ready.
GCRA: ✅ single-variable state (TAT — Theoretical Arrival Time), O(1) memory, rolling window, configurable burst; ❌ math is not intuitive. Redis module redis-cell and many Cloudflare workflows use it.

# Token bucket core (Python pseudo-code)
def allow(key, capacity, refill_rate):
    now = time.time()
    tokens, last = state.get(key, (capacity, now))
    # lazy refill: top up by elapsed time when we look
    tokens = min(capacity, tokens + (now - last) * refill_rate)
    if tokens >= 1:
        state[key] = (tokens - 1, now)
        return True
    state[key] = (tokens, now)
    return False   # deny, retry_after = (1 - tokens) / refill_rate

Real-world cases:

Stripe: combines four limiter classes (request rate, concurrent requests, fleet-wide load shedder, worker-utilization load shedder), all backed by token bucket + Redis Lua. See their engineering post "Scaling your API with rate limiters".
Cloudflare: distributed rate limiting across 300+ PoPs — local short-window decision, async aggregation between nodes (see Cloudflare's post "How we built rate limiting capable of scaling to millions of domains").
GitHub API: authenticated users get 5000 req/h (hourly token bucket); the Search API has its own much tighter bucket.
Redis official: the redis-cell module implements GCRA — a single CL.THROTTLE command does it all.

2. Distributed limiting: central Redis vs local share vs async aggregation

Principle: a token bucket on one machine is trivial. The hard problem is coordinating a global quota shared across the fleet.

Trade-off:

Central Redis (single instance): ✅ globally exact, simple (atomic Lua); ❌ single-instance ceiling (~300-500k QPS), the network RTT dominates limiter latency, Redis failure = business failure (must fail-open).
Local share (static slice): each gateway gets total_limit / N. ✅ no dependency, low latency; ❌ skewed traffic leaves some nodes saturated while others idle, overall utilization drops; quota drifts as the fleet scales.
Local decision + async aggregation: each node runs a local bucket and pushes consumption to a coordinator every 100ms-1s, which redistributes quota. Cloudflare uses a gossip-like variant. ✅ low latency and near-exact; ❌ complex; allows brief over-admit windows (~20% over budget for one period).
Tiered (local quick-deny + central slow path): obvious over-limit denied locally, the rest is verified in Redis. ✅ shields 99% of hot-key traffic from Redis; ❌ requires heartbeat sync of "currently blacklisted" state.

-- Atomic token bucket in Redis (Lua, Stripe-style)
-- KEYS[1]=bucket key  ARGV={capacity, refill_rate, now, cost}
local cap, rate, now, cost = tonumber(ARGV[1]), tonumber(ARGV[2]),
                              tonumber(ARGV[3]), tonumber(ARGV[4])
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or cap
local last = tonumber(data[2]) or now
tokens = math.min(cap, tokens + (now - last) * rate)
local allowed = tokens >= cost and 1 or 0
if allowed == 1 then tokens = tokens - cost end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], math.ceil(cap / rate) * 2)
return {allowed, tokens}     -- atomic, no race

Real-world cases:

Stripe: central Redis Lua counters, with "if Redis is unavailable, let the request through" as an explicit principle in their engineering post.
Cloudflare edge: local short-window decisions with gossip-style async aggregation across edge nodes — see the Cloudflare engineering post linked above.
AWS API Gateway: per-region token buckets; rate and burst map directly to refill rate and bucket capacity.

3. Multi-dimensional limiting and priority shedding

Principle: a single dimension cannot contain real traffic. One key may be fine, but the sum of all keys saturates an endpoint; an endpoint may be fine, but its downstream DB is dying. Check multiple dimensions independently, tightest wins.

Typical stack (each layer is an independent bucket; all must pass):

L1 — per API key: tenant fairness (100/s);
L2 — per IP: anti-anonymous-scrape (10/s);
L3 — per endpoint × key: protect heavy endpoints (/export 0.1/s);
L4 — global concurrency: protect downstream DB connection pool (in-flight < 5000);
L5 — priority shedding: under overload, drop low-priority first (export → list → create-payment is dropped last).

# Multi-dimensional composition (pseudo-code)
def admit(req):
    keys = [
        ("key", req.api_key, 100, 1),
        ("ip", req.ip, 10, 1),
        ("ep", f"{req.api_key}:{req.path}", endpoint_limit(req.path), 1),
    ]
    for kind, k, cap, cost in keys:
        if not bucket_allow(f"{kind}:{k}", cap, refill_for(cap), cost):
            return reject(retry_after_from(cap, k), reason=kind)
    if global_inflight() > SHED_THRESHOLD and priority(req) < HIGH:
        return reject(retry_after=30, reason="overload")
    return ALLOW

Real-world cases:

Stripe: splits traffic into critical (create charge) and non-critical (reporting / analytics); during overload, non-critical is shed first. This "protect the money path" priority shedding is called out explicitly in their engineering post.
GitHub: REST has one bucket, GraphQL another (charged by query complexity), and Search has a stricter small bucket — same token, multiple dimensions.
Twitter/X API: each endpoint has its own bucket with very different limits; the response header X-Rate-Limit-Resource tells you which bucket was triggered.

4. The 429 contract: let callers back off gracefully

Principle: a limiter is not just a "deny" — it is a machine-readable backoff instruction for the caller. Without it, clients spin in a tight retry loop and pile on a second time.

Standard headers:

Retry-After: 30 — required, seconds or HTTP-date. RFC 9110.
X-RateLimit-Limit: 100 — the quota.
X-RateLimit-Remaining: 0 — how many left.
X-RateLimit-Reset: 1735689600 — reset epoch.
X-RateLimit-Resource: search — which bucket triggered (critical when there are many).
Body: JSON with error_code and doc_url so the caller can self-serve.

Crucial pitfall: clients must jitter. If 1000 callers all receive Retry-After: 30, 30 seconds later they all hit you again — thundering herd reborn. Either the server returns a jittered value (30 ± 5s), or the client adds jitter itself (Stripe SDK and AWS SDK both bake in exponential backoff + jitter).

Real-world cases:

Stripe: 429 always includes Retry-After; the official SDK bundles idempotency keys + exponential backoff for auto-retry.
GitHub: over-limit returns 403 with X-RateLimit-Reset (epoch); the docs explicitly say "wait until reset, otherwise keep getting 403".
AWS: in addition to 429, you may see ThrottlingException; every official SDK auto-retries with jitter.

Extensions & Optimization

Hot key beyond a single Redis node: shard one high-traffic key across multiple Redis shards (key:0..15), each with a fractional quota, client hashes to shard. Cost: precision loss (local over-admit).
From hard limits to adaptive limiting: drive the limit from downstream latency / error rate (Netflix concurrency-limits, Google SRE adaptive throttling) — "let more through when healthy, fewer when sick".
Cross-region global quota: per-region local quota plus per-second async cross-region reconciliation. A strongly-consistent global quota costs too much and rarely earns its keep.
Cost-weighted requests: not every request is "one token" — weight by endpoint CPU cost (GraphQL query complexity = points charged, CPU-heavy endpoints charge 10).
Fail-open vs fail-close: when the limiter itself dies, let traffic through (a limiter failure should not become a business failure), but downstream is suddenly exposed — pair with a circuit breaker.

Common Pitfalls / Interview Follow-ups

"Why not sliding window log?" — Memory. 1M users × 100 reqs = 100M timestamps, several GB. Counter approximations get to < 5% error.
"What if Redis goes down?" — Fail-open and circuit-break; Stripe states this as an iron rule.
"Same user, multiple processes — how do they share a local token bucket?" — Per-process buckets don't; you need Redis, or "local sub-quota + async reconciliation".
"429 vs 503 — when each?" — 429 = caller over its quota (the caller's problem); 503 = service overloaded (your problem). Confusing them breaks the caller's retry strategy.
"How do you compute Retry-After?" — With a token bucket: (1 - tokens) / refill_rate. With a sliding window: milliseconds until the next second tick. And the server must add jitter.
"Why is token bucket more popular than leaky bucket?" — Public-API callers naturally arrive in bursts (a cron firing 50 requests at once); token bucket allows that legitimate burst. Strict leaky-bucket smoothing punishes normal users.

Deep Resources

Stripe Engineering Blog — "Scaling your API with rate limiters" (Paul Tarjan): four limiter classes, Redis Lua, the fail-open principle. The companion GitHub gist has runnable Ruby code.
Cloudflare Engineering Blog — "How we built rate limiting capable of scaling to millions of domains": distributed limiting at the edge — local decision plus async aggregation.
Brandur Leach — "Rate Limiting, Cells, and GCRA" (brandur.org): the clearest explainer of GCRA, with a Redis implementation.
Google SRE Book — Ch 21 "Handling Overload": adaptive throttling, priority shedding, client-side retry budgets.
Marc Brooker's distributed systems essays: the right intuition for fail-open and backpressure as system properties.

Deeper Questions (click to expand)

1. How do you relate token-bucket capacity and refill rate? Is capacity = refill reasonable?

Common mistake: "100 req/s, so cap=100 and rate=100." But cap=100 means the user can save 100 tokens and instantly fire 100 requests — 1 ms of peak QPS at 100,000, which flattens downstream.

The right framing: cap is the allowed burst size, rate is the long-run average. They are independent:

How big a burst can downstream absorb? → choose cap. Common cap = rate × 1–3 (1–3 seconds of burst).
What's the daily allowance per user? → choose rate.

Stripe's posture is "rate is the contract, cap is the mercy" — cap is slightly above rate so normal bursts don't error, but not so big that downstream crashes. For extreme cases (financial reconciliation) cap = rate forces strict smoothing — effectively a leaky bucket.

2. 1M QPS won't fit through one Redis node (~300k QPS ceiling). How do you squeeze 1M limit decisions per second through it?

Stack three ideas:

Local quick-deny: each gateway keeps a local list of "recently over-limit keys" with 100 ms TTL. Over-limit keys 429 without touching Redis. 99% of hot-key traffic is shielded locally.
Local sub-quota: at startup, each gateway claims N seconds of quota (say 100 tokens) from Redis and spends it locally. When empty, claim the next batch. Redis QPS drops from 1M to 1M / 100 = 10k. Cost: uneven distribution across gateways; quota drifts during scale-in/out.
Redis Cluster shards: hash by user_id to different shards; each shard absorbs 300k QPS. Cost: a single global quota cannot be expressed as one Redis key.

In production the combo local quick-deny + local sub-quota + Redis Cluster typically lands Redis at 50–100k QPS.

3. Clients all receive 429 + Retry-After=30s, then retry simultaneously — second wave overload. What can the server do?

Server returns jittered Retry-After: not a fixed 30 but 30 + uniform(0, 30) — spreads retries over an interval. Cost: some clients wait longer.
Stagger window resets per client: anchor the bucket's expiry to hash(api_key) so different users' window boundaries are naturally offset.
Force backoff + jitter in the SDK: the official SDK does retry_after × random(0.5, 1.5), covering all callers. Stripe, AWS, Google all do this.
Soft signals in normal responses: while X-RateLimit-Remaining is approaching 0, well-built SDKs proactively slow down — don't wait for 429 to react.
Continuous-recovery semantics: don't have everyone go from 0 to full quota at the same reset instant; GCRA's continuous rolling refill naturally staggers retry timing.

4. The limiter itself crashes — fail-open or fail-close? Which kind of business goes which way?

Default fail-open: the reasoning is "limiting is protective; a limiter failure should not become a business failure". Stripe and AWS both do this. Fail-open means attack traffic flows straight to the backend the moment the limiter dies — pair it with a circuit breaker so downstream can defend itself.

When fail-close is the right call:

Paid quota / billing quota: going over budget = you lose money (e.g., a free tier of 1000 calls / month). Better to deny than admit.
Irreversible downstream effects: SMS sends, debit operations — blind admit can fan out into a bulk mistake.
Regulatory constraints: some financial / healthcare limits are mandated; if you can't enforce, you must stop.

The pragmatic shape: DDoS layer fail-open (save the site), billing-quota layer fail-close (save the money), deployed as separate components.

5. A single GraphQL query can consume 100× the CPU. How do you define "one request" for rate-limiting?

REST's hidden assumption — "all requests cost about the same" — falls apart under GraphQL. One nested query can scan tens of thousands of rows and run a dozen joins.

The answer is cost weighting:

Static cost: analyze the query at submit time, summing per-field cost (list field × an assumed 100 rows), and charge the bucket that many tokens up front. GitHub's GraphQL API does exactly this — 5000 "points" per hour, not 5000 requests.
Dynamic cost: after execution, reconcile tokens by actual time / rows scanned. Cost: charge is retroactive, so a malicious query has already burned the CPU.
Hard depth / field caps: before the limiter, reject queries with depth > 10 or > 1000 fields at the parsing stage. Apollo has built-in support.
Persisted queries: only allow pre-reviewed query hashes; dynamic queries can't even reach your gateway. Facebook and Shopify run this at scale.

The same shape applies elsewhere: bulk APIs (one call with 100 IDs), file uploads (charge per byte), AI inference (charge per token). Any time a single request's cost can vary by > 10×, you need weights.

← Back to Index