Scenario & Constraints
Design the rate-limiting system for a public API platform (think Stripe / GitHub). 100k+ third-party integrations, peak 1M QPS, per-key limit 100 req/s, total platform capacity 500k QPS. One runaway script can saturate every worker and starve out paying-customer charge requests — which is why rate limiting is a survival baseline, not an anti-scraper add-on.
- Fairness: a misbehaving key must not drag down the others; high-priority calls (create charge) must not be starved by low-priority ones (export report).
- Precision: a 100 req/s limit cannot legally allow 200 requests crammed into the millisecond around a second boundary.
- Low overhead: the limiter latency must be far below the business latency — target p99 < 1 ms.
- Multi-dimensional: limit per user / IP / endpoint / org separately, and compose them (the tightest layer wins).
- Caller-friendly: 429 must be machine-retriable —
Retry-After must be present and accurate.
- Limiter failure cannot become business failure: when Redis is down, letting traffic through (fail-open) beats taking down the whole site.
Today: algorithm choice, distributed implementation, accuracy vs performance, and the 429 contract.
High-Level Architecture
graph LR
C["Client / 3rd-party app"]
GW["API Gateway
(Envoy / Kong / homegrown)"]
L1["Local token bucket
(per worker)"]
L2["Central counter
Redis + Lua"]
SHED["Load Shedder
(priority drop)"]
APP["Business service"]
C -->|"req + API key"| GW
GW -->|"① fast path: 90% deny/allow"| L1
L1 -->|"② slow path: periodic sync"| L2
GW -.overload.-> SHED
GW -->|✓| APP
GW -.->|"✗ 429 + Retry-After"| C
classDef edge fill:#0e2030,stroke:#64c8ff,color:#e8eef5
classDef store fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef origin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class GW,L1 edge
class L2 store
class APP,SHED origin
Gateway uses local token bucket for the fast path; Redis is the central source of truth. Decisions return in under 1 ms.
Key Technical Points
1. Algorithm choice: Token Bucket vs Leaky Bucket vs Sliding Window vs GCRA
Principle: the four algorithms answer different questions — are bursts allowed? is output smoothed? how does the window roll?
| Algorithm | Core idea | Allows burst | Output rate | Complexity |
| Fixed Window | Reset counter every second | No | Bumpy | Trivial |
| Sliding Window Log | Store every request timestamp | — | Exact | High (memory) |
| Sliding Window Counter | Current + prior window pro-rated | — | Approximate | Low |
| Token Bucket | Bucket of tokens, can be saved up | ✓ Natively | Bursty | Low |
| Leaky Bucket | Queue requests, drain at fixed rate | Smoothed via queue | Strictly even | Medium |
| GCRA | Leaky bucket expressed as a single "next allowed time" | Tunable | Rolling, smooth | Medium (no queue) |
Trade-off:
- Fixed Window: ✅ one counter; ❌ boundary double-burst (100 at t=0.999s, 100 at t=1.001s — 200 requests in 2ms all admitted).
- Sliding Window Log: ✅ exact; ❌ stores N timestamps per key — 1M keys × 100 reqs = 100M entries, Redis memory explodes.
- Token Bucket: ✅ naturally supports bursts (save tokens); ❌ does not enforce smoothing — a saved-up burst can still flood downstream. Stripe picks it precisely to allow legitimate bursts.
- Leaky Bucket: ✅ downstream sees a flat rate; ❌ needs a queue; a pure form introduces latency even when the queue is short but no token is ready.
- GCRA: ✅ single-variable state (TAT — Theoretical Arrival Time), O(1) memory, rolling window, configurable burst; ❌ math is not intuitive. Redis module
redis-cell and many Cloudflare workflows use it.
# Token bucket core (Python pseudo-code)
def allow(key, capacity, refill_rate):
now = time.time()
tokens, last = state.get(key, (capacity, now))
# lazy refill: top up by elapsed time when we look
tokens = min(capacity, tokens + (now - last) * refill_rate)
if tokens >= 1:
state[key] = (tokens - 1, now)
return True
state[key] = (tokens, now)
return False # deny, retry_after = (1 - tokens) / refill_rate
Real-world cases:
- Stripe: combines four limiter classes (request rate, concurrent requests, fleet-wide load shedder, worker-utilization load shedder), all backed by token bucket + Redis Lua. See their engineering post "Scaling your API with rate limiters".
- Cloudflare: distributed rate limiting across 300+ PoPs — local short-window decision, async aggregation between nodes (see Cloudflare's post "How we built rate limiting capable of scaling to millions of domains").
- GitHub API: authenticated users get 5000 req/h (hourly token bucket); the Search API has its own much tighter bucket.
- Redis official: the
redis-cell module implements GCRA — a single CL.THROTTLE command does it all.
2. Distributed limiting: central Redis vs local share vs async aggregation
Principle: a token bucket on one machine is trivial. The hard problem is coordinating a global quota shared across the fleet.
Trade-off:
- Central Redis (single instance): ✅ globally exact, simple (atomic Lua); ❌ single-instance ceiling (~300-500k QPS), the network RTT dominates limiter latency, Redis failure = business failure (must fail-open).
- Local share (static slice): each gateway gets
total_limit / N. ✅ no dependency, low latency; ❌ skewed traffic leaves some nodes saturated while others idle, overall utilization drops; quota drifts as the fleet scales.
- Local decision + async aggregation: each node runs a local bucket and pushes consumption to a coordinator every 100ms-1s, which redistributes quota. Cloudflare uses a gossip-like variant. ✅ low latency and near-exact; ❌ complex; allows brief over-admit windows (~20% over budget for one period).
- Tiered (local quick-deny + central slow path): obvious over-limit denied locally, the rest is verified in Redis. ✅ shields 99% of hot-key traffic from Redis; ❌ requires heartbeat sync of "currently blacklisted" state.
-- Atomic token bucket in Redis (Lua, Stripe-style)
-- KEYS[1]=bucket key ARGV={capacity, refill_rate, now, cost}
local cap, rate, now, cost = tonumber(ARGV[1]), tonumber(ARGV[2]),
tonumber(ARGV[3]), tonumber(ARGV[4])
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or cap
local last = tonumber(data[2]) or now
tokens = math.min(cap, tokens + (now - last) * rate)
local allowed = tokens >= cost and 1 or 0
if allowed == 1 then tokens = tokens - cost end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], math.ceil(cap / rate) * 2)
return {allowed, tokens} -- atomic, no race
Real-world cases:
- Stripe: central Redis Lua counters, with "if Redis is unavailable, let the request through" as an explicit principle in their engineering post.
- Cloudflare edge: local short-window decisions with gossip-style async aggregation across edge nodes — see the Cloudflare engineering post linked above.
- AWS API Gateway: per-region token buckets; rate and burst map directly to refill rate and bucket capacity.
3. Multi-dimensional limiting and priority shedding
Principle: a single dimension cannot contain real traffic. One key may be fine, but the sum of all keys saturates an endpoint; an endpoint may be fine, but its downstream DB is dying. Check multiple dimensions independently, tightest wins.
Typical stack (each layer is an independent bucket; all must pass):
- L1 — per API key: tenant fairness (100/s);
- L2 — per IP: anti-anonymous-scrape (10/s);
- L3 — per endpoint × key: protect heavy endpoints (/export 0.1/s);
- L4 — global concurrency: protect downstream DB connection pool (in-flight < 5000);
- L5 — priority shedding: under overload, drop low-priority first (export → list → create-payment is dropped last).
# Multi-dimensional composition (pseudo-code)
def admit(req):
keys = [
("key", req.api_key, 100, 1),
("ip", req.ip, 10, 1),
("ep", f"{req.api_key}:{req.path}", endpoint_limit(req.path), 1),
]
for kind, k, cap, cost in keys:
if not bucket_allow(f"{kind}:{k}", cap, refill_for(cap), cost):
return reject(retry_after_from(cap, k), reason=kind)
if global_inflight() > SHED_THRESHOLD and priority(req) < HIGH:
return reject(retry_after=30, reason="overload")
return ALLOW
Real-world cases:
- Stripe: splits traffic into critical (create charge) and non-critical (reporting / analytics); during overload, non-critical is shed first. This "protect the money path" priority shedding is called out explicitly in their engineering post.
- GitHub: REST has one bucket, GraphQL another (charged by query complexity), and Search has a stricter small bucket — same token, multiple dimensions.
- Twitter/X API: each endpoint has its own bucket with very different limits; the response header
X-Rate-Limit-Resource tells you which bucket was triggered.
4. The 429 contract: let callers back off gracefully
Principle: a limiter is not just a "deny" — it is a machine-readable backoff instruction for the caller. Without it, clients spin in a tight retry loop and pile on a second time.
Standard headers:
Retry-After: 30 — required, seconds or HTTP-date. RFC 9110.
X-RateLimit-Limit: 100 — the quota.
X-RateLimit-Remaining: 0 — how many left.
X-RateLimit-Reset: 1735689600 — reset epoch.
X-RateLimit-Resource: search — which bucket triggered (critical when there are many).
- Body: JSON with
error_code and doc_url so the caller can self-serve.
Crucial pitfall: clients must jitter. If 1000 callers all receive Retry-After: 30, 30 seconds later they all hit you again — thundering herd reborn. Either the server returns a jittered value (30 ± 5s), or the client adds jitter itself (Stripe SDK and AWS SDK both bake in exponential backoff + jitter).
Real-world cases:
- Stripe: 429 always includes
Retry-After; the official SDK bundles idempotency keys + exponential backoff for auto-retry.
- GitHub: over-limit returns 403 with
X-RateLimit-Reset (epoch); the docs explicitly say "wait until reset, otherwise keep getting 403".
- AWS: in addition to 429, you may see
ThrottlingException; every official SDK auto-retries with jitter.
Extensions & Optimization
- Hot key beyond a single Redis node: shard one high-traffic key across multiple Redis shards (
key:0..15), each with a fractional quota, client hashes to shard. Cost: precision loss (local over-admit).
- From hard limits to adaptive limiting: drive the limit from downstream latency / error rate (Netflix concurrency-limits, Google SRE adaptive throttling) — "let more through when healthy, fewer when sick".
- Cross-region global quota: per-region local quota plus per-second async cross-region reconciliation. A strongly-consistent global quota costs too much and rarely earns its keep.
- Cost-weighted requests: not every request is "one token" — weight by endpoint CPU cost (GraphQL query complexity = points charged, CPU-heavy endpoints charge 10).
- Fail-open vs fail-close: when the limiter itself dies, let traffic through (a limiter failure should not become a business failure), but downstream is suddenly exposed — pair with a circuit breaker.
Common Pitfalls / Interview Follow-ups
- "Why not sliding window log?" — Memory. 1M users × 100 reqs = 100M timestamps, several GB. Counter approximations get to < 5% error.
- "What if Redis goes down?" — Fail-open and circuit-break; Stripe states this as an iron rule.
- "Same user, multiple processes — how do they share a local token bucket?" — Per-process buckets don't; you need Redis, or "local sub-quota + async reconciliation".
- "429 vs 503 — when each?" — 429 = caller over its quota (the caller's problem); 503 = service overloaded (your problem). Confusing them breaks the caller's retry strategy.
- "How do you compute Retry-After?" — With a token bucket:
(1 - tokens) / refill_rate. With a sliding window: milliseconds until the next second tick. And the server must add jitter.
- "Why is token bucket more popular than leaky bucket?" — Public-API callers naturally arrive in bursts (a cron firing 50 requests at once); token bucket allows that legitimate burst. Strict leaky-bucket smoothing punishes normal users.
Deep Resources
- Stripe Engineering Blog — "Scaling your API with rate limiters" (Paul Tarjan): four limiter classes, Redis Lua, the fail-open principle. The companion GitHub gist has runnable Ruby code.
- Cloudflare Engineering Blog — "How we built rate limiting capable of scaling to millions of domains": distributed limiting at the edge — local decision plus async aggregation.
- Brandur Leach — "Rate Limiting, Cells, and GCRA" (brandur.org): the clearest explainer of GCRA, with a Redis implementation.
- Google SRE Book — Ch 21 "Handling Overload": adaptive throttling, priority shedding, client-side retry budgets.
- Marc Brooker's distributed systems essays: the right intuition for fail-open and backpressure as system properties.
Deeper Questions (click to expand)
1. How do you relate token-bucket capacity and refill rate? Is capacity = refill reasonable?
Common mistake: "100 req/s, so cap=100 and rate=100." But cap=100 means the user can save 100 tokens and instantly fire 100 requests — 1 ms of peak QPS at 100,000, which flattens downstream.
The right framing: cap is the allowed burst size, rate is the long-run average. They are independent:
- How big a burst can downstream absorb? → choose cap. Common cap = rate × 1–3 (1–3 seconds of burst).
- What's the daily allowance per user? → choose rate.
Stripe's posture is "rate is the contract, cap is the mercy" — cap is slightly above rate so normal bursts don't error, but not so big that downstream crashes. For extreme cases (financial reconciliation) cap = rate forces strict smoothing — effectively a leaky bucket.
2. 1M QPS won't fit through one Redis node (~300k QPS ceiling). How do you squeeze 1M limit decisions per second through it?
Stack three ideas:
- Local quick-deny: each gateway keeps a local list of "recently over-limit keys" with 100 ms TTL. Over-limit keys 429 without touching Redis. 99% of hot-key traffic is shielded locally.
- Local sub-quota: at startup, each gateway claims
N seconds of quota (say 100 tokens) from Redis and spends it locally. When empty, claim the next batch. Redis QPS drops from 1M to 1M / 100 = 10k. Cost: uneven distribution across gateways; quota drifts during scale-in/out.
- Redis Cluster shards: hash by user_id to different shards; each shard absorbs 300k QPS. Cost: a single global quota cannot be expressed as one Redis key.
In production the combo local quick-deny + local sub-quota + Redis Cluster typically lands Redis at 50–100k QPS.
3. Clients all receive 429 + Retry-After=30s, then retry simultaneously — second wave overload. What can the server do?
- Server returns jittered Retry-After: not a fixed 30 but
30 + uniform(0, 30) — spreads retries over an interval. Cost: some clients wait longer.
- Stagger window resets per client: anchor the bucket's expiry to
hash(api_key) so different users' window boundaries are naturally offset.
- Force backoff + jitter in the SDK: the official SDK does
retry_after × random(0.5, 1.5), covering all callers. Stripe, AWS, Google all do this.
- Soft signals in normal responses: while
X-RateLimit-Remaining is approaching 0, well-built SDKs proactively slow down — don't wait for 429 to react.
- Continuous-recovery semantics: don't have everyone go from 0 to full quota at the same reset instant; GCRA's continuous rolling refill naturally staggers retry timing.
4. The limiter itself crashes — fail-open or fail-close? Which kind of business goes which way?
Default fail-open: the reasoning is "limiting is protective; a limiter failure should not become a business failure". Stripe and AWS both do this. Fail-open means attack traffic flows straight to the backend the moment the limiter dies — pair it with a circuit breaker so downstream can defend itself.
When fail-close is the right call:
- Paid quota / billing quota: going over budget = you lose money (e.g., a free tier of 1000 calls / month). Better to deny than admit.
- Irreversible downstream effects: SMS sends, debit operations — blind admit can fan out into a bulk mistake.
- Regulatory constraints: some financial / healthcare limits are mandated; if you can't enforce, you must stop.
The pragmatic shape: DDoS layer fail-open (save the site), billing-quota layer fail-close (save the money), deployed as separate components.
5. A single GraphQL query can consume 100× the CPU. How do you define "one request" for rate-limiting?
REST's hidden assumption — "all requests cost about the same" — falls apart under GraphQL. One nested query can scan tens of thousands of rows and run a dozen joins.
The answer is cost weighting:
- Static cost: analyze the query at submit time, summing per-field cost (list field × an assumed 100 rows), and charge the bucket that many tokens up front. GitHub's GraphQL API does exactly this — 5000 "points" per hour, not 5000 requests.
- Dynamic cost: after execution, reconcile tokens by actual time / rows scanned. Cost: charge is retroactive, so a malicious query has already burned the CPU.
- Hard depth / field caps: before the limiter, reject queries with depth > 10 or > 1000 fields at the parsing stage. Apollo has built-in support.
- Persisted queries: only allow pre-reviewed query hashes; dynamic queries can't even reach your gateway. Facebook and Shopify run this at scale.
The same shape applies elsewhere: bulk APIs (one call with 100 IDs), file uploads (charge per byte), AI inference (charge per token). Any time a single request's cost can vary by > 10×, you need weights.