The Problem
Your home page has to return in under 200ms. A DB query takes 80ms and the page needs seven of them — without a cache, you're dead. But caching is never "just throw Redis at it." Facebook lost 2.5 hours of global availability in 2010 to a Memcached invalidation storm. Discord found LRU ruining their hot-channel message list in 2017 and switched to LFU, lifting hit rate from 78% to 96%. GitHub's PR page uses Russian-doll caching at 99.3% hit rate.
This issue covers three things: where caches live, how they evict, and how they handle writes. The hardest of these is invalidation — Phil Karlton famously called it one of the two hard problems in CS (the other being naming).
Requirements and Constraints (the interview opener)
- Read/write ratio: 100:1 is great for a cache; 1:1 with write-heavy workload usually doesn't pay (hit rate is too low).
- Hotness distribution: Pareto 80/20? Long-tail traffic caches poorly. Severe hot spots produce single-key storms.
- Staleness tolerance: a user avatar that's an hour old is fine; a one-second-old balance is an incident.
- Object size: 1KB or 1MB? Memcached's default slab maxes at 1MB; bigger needs splitting or Redis.
- Hit-rate target: under 80% usually means the cache layer isn't earning its place; over 95% to actually offload the backend.
- Invalidation strategy: TTL, explicit invalidate, or write-through?
High-Level Architecture (Multi-Layer Cache)
graph LR
B["Browser cache
localStorage / SW"]
CDN["CDN Edge
Cloudflare / Akamai"]
RP["Reverse Proxy
Varnish / Nginx"]
APP["App + L1
in-proc · Caffeine"]
L2["Redis / Memcached
shared L2"]
DB[("DB + buffer pool")]
B -->|①| CDN -->|②| RP -->|③| APP -->|④| DB
APP <-.->|miss → backfill| L2
L2 -.miss.-> DB
classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef cache fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef origin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class B client
class CDN,RP edge
class APP,L2 cache
class DB origin
Hit rate increases left to right; on return, each layer backfills so the next request short-circuits earlier
Key Technical Points
1. Cache Layers: Browser / CDN / App / DB Buffer Pool
Principle: each layer solves a different latency. Browser cache saves the network round trip (0ms vs 50ms). CDN puts assets within 50 km of users at PoPs, p50 around 10ms. In-process cache (Guava, Caffeine, Python's functools.lru_cache) saves a Redis RTT (0.1ms vs 1ms). Redis is shared across processes; on hit, it skips the DB (1ms vs 50ms). DB buffer pool (Postgres shared_buffers, InnoDB buffer pool) saves disk I/O.
Trade-off:
- In-process cache: ✅ ~0.1ms latency, no network failures; ❌ N processes = N copies (memory waste), invalidation requires broadcast (hard to do right).
- Distributed cache (Redis): ✅ one view, clean invalidation; ❌ one extra RTT, Redis itself needs HA and scaling.
- CDN: ✅ global reach, DDoS protection; ❌ only useful for publicly shareable content, invalidation is slow (Cloudflare global purge ~30s).
Real cases:
- Facebook: TAO (in-memory graph cache) plus a Memcached ocean (tens of thousands of nodes), >99% hit rate, a billion+ reads per second. The paper "Scaling Memcache at Facebook" is required reading.
- Twitter: a user's entire timeline page is Redis-cached; for top accounts it's pre-generated via fanout.
- GitHub Rails: Russian-doll caching (nested fragment caches) gives the PR detail page 99.3% hit rate.
- Stack Overflow: one data center, nine servers handling global traffic, riding five cache layers (HTTP, Redis, SQL Server buffer, .NET CLR, client-side).
2. Eviction Policies: LRU vs LFU vs ARC vs TinyLFU
Principle: cache capacity is bounded, so you need a rule for "who dies first."
| Policy | What gets evicted | Best for | Hit rate |
| LRU | least recently accessed | strong temporal locality | medium |
| LFU | least frequently accessed | frequency-locality, stable hot set | high (steady state) |
| ARC | adaptive LRU+LFU | shifting access patterns, no tuning | high |
| W-TinyLFU | LFU with admission control | bursty traffic, scan-resistant | very high |
| FIFO/Random | first-in / random | O(1) minimalism, pure streaming | low |
Trade-off:
- LRU: ✅ trivial to implement (hash map + doubly linked list = O(1)); ❌ poisoned by one-off scans (e.g. a crawler hitting every page) — poor scan resistance.
- LFU: ✅ scan-resistant; ❌ old entries "ossify," and new hot keys can't push through (needs aging / decay).
- ARC: ✅ adaptive; ❌ IBM patent (Linux uses 2Q instead), complex implementation.
- W-TinyLFU (used by Caffeine): ✅ Count-Min Sketch estimates frequency, admission control keeps rare keys out; ❌ probabilistic structure has error.
# Minimal Python LRU (OrderedDict)
from collections import OrderedDict
class LRU:
def __init__(self, cap): self.cap, self.d = cap, OrderedDict()
def get(self, k):
if k not in self.d: return None
self.d.move_to_end(k) # recent access → tail
return self.d[k]
def put(self, k, v):
if k in self.d: self.d.move_to_end(k)
self.d[k] = v
if len(self.d) > self.cap:
self.d.popitem(last=False) # evict head (least-recent)
# Redis uses *approximate* LRU: sample 5 random keys, evict the oldest
# maxmemory-policy allkeys-lru / allkeys-lfu / volatile-ttl
Real cases:
- Redis: default is
allkeys-lru, but implemented as approximate LRU (samples 5–10 keys to compare, saving the overhead of maintaining a true LRU list). Redis 4.0+ supports LFU.
- Caffeine (Java): uses W-TinyLFU, the default cache in Cassandra, HBase, and Kafka Streams. 5–15% higher hit rate than LRU.
- Linux page cache: uses 2Q (active/inactive lists) — approximate LRU but scan-resistant.
- Discord: early message cache used LRU, but hot channels had bad "least-recent" misjudgments. Switching to LFU lifted hit rate from 78% to 96%.
3. Write Strategies: Cache-aside vs Write-through vs Write-back
Principle: on a write, the order in which you touch cache and DB — and who owns the write — determines consistency and complexity.
| Cache-aside (lazy) | Write-through | Write-back |
| Read | miss → DB → backfill cache | always cache hit | cache hit |
| Write | write DB → delete cache | write cache → sync to DB | write cache only → async to DB |
| Consistency | eventual (small window) | strong (cache = DB) | weak (crash loses unflushed) |
| Write latency | low (DB only) | high (two writes) | very low (memory only) |
| Fits | Redis + DB, the default | finance, strong consistency reads | CPU cache, SSD FTL |
# Cache-aside canonical form (Python)
def get_user(uid):
user = redis.get(f"u:{uid}")
if user: return json.loads(user)
user = db.query("SELECT * FROM users WHERE id=%s", uid)
redis.setex(f"u:{uid}", 3600, json.dumps(user)) # TTL safety net
return user
def update_user(uid, data):
db.update("UPDATE users SET ... WHERE id=%s", uid, data)
redis.delete(f"u:{uid}") # ⚠️ delete, not update — avoids concurrent-write disorder
# ⚠️ another concern: a read right after delete may backfill the old value → double-delete / delayed double-delete
Trade-off:
- Cache-aside: ✅ simple, write path independent of cache health; ❌ first miss is slow, you have to invalidate by hand (easy to get wrong).
- Write-through: ✅ cache is always truth; ❌ every write pays double; cache down = write down.
- Write-back: ✅ fastest; ❌ cache crash = data lost. Never for account balances. But CPU L1/L2 and SSD FTL run on it.
Real cases:
- Instagram: feed is cache-aside on Redis; posting deletes the related cache keys.
- LinkedIn: messaging center is write-through to Voldemort (their KV store), so reads always see the latest.
- MySQL InnoDB: buffer pool is write-back, with redo log + checkpoint guaranteeing durability.
- Apple iCloud Photos: thumbnails are write-through to S3 so the CDN always finds something to serve.
4. The Hardest Part: TTL, Thundering Herd, Avalanche, Penetration
Principle: four classic invalidation traps, every one of which has burned production somewhere.
- Thundering herd (cache stampede): a hot key expires; 10k concurrent requests blast through to the DB. Fixes: mutex (single-flight), logical expiration (never truly expires; refresh in the background), proactive renewal.
- Cache avalanche: many keys expire at the same instant, the DB takes the full hit. Fix: add jitter to TTLs (
3600 + random(0, 300)).
- Cache penetration: lookups for non-existent keys keep falling through to the DB. Fixes: cache the null result (short TTL); a Bloom filter in front to short-circuit.
- Stale-while-revalidate: return the stale value while refreshing in the background (CDNs, HTTP, Next.js ISR all do this) — zero latency, eventual consistency.
# Single-flight to prevent stampede (Go)
var g singleflight.Group
func GetHot(key string) (any, error) {
v, err, _ := g.Do(key, func() (any, error) {
if v := redis.Get(key); v != nil { return v, nil }
v := db.Query(key)
redis.SetEX(key, v, ttlWithJitter(3600))
return v, nil
})
return v, err
}
Real cases:
- Facebook Memcached lease: on a miss, only one client gets a lease token to go to the origin; others wait. Detailed in the paper.
- Cloudflare: stale-while-revalidate, serves expired content for another 60s while refreshing in the background — almost no p99 spikes.
- Pinterest: a 30-minute outage in their history was a Memcached cluster-wide TTL avalanche; the postmortem added jitter.
- Vercel ISR: Next.js Incremental Static Regeneration is stale-while-revalidate productized.
Scaling Up: What to Do After Growth
- Hit rate won't climb: analyze the access distribution. Long tail? Consider pre-warming or larger capacity. Hot spot? Shard the hot key across multiple sub-keys (
user:123:part:{0..15}).
- One Redis instance isn't enough: move to Redis Cluster (16384 slots) or a proxy layer (Twemproxy / Codis); hot writes need consistent hashing.
- Cross-region latency: per-region independent Redis clusters, async invalidation broadcasts on DB writes (Facebook uses mcsqueal).
- Objects too large: switch to Redis (supports 512MB values) or split; consider caching only an ID list and going to DB for details (saves memory).
- Finding the bottleneck: monitor hit rate, p99 latency, and origin QPS. A drop in hit rate is usually a new feature that wasn't cached or an invalidation bug.
Common Pitfalls (Even Senior Engineers Slip Here)
1. Update cache, or delete cache? Always delete. Updates create write ordering hazards (A writes v1 → B writes v2 → B updates cache → A updates cache; cache stays at v1 forever).
2. Write DB first, or delete cache first? Standard is "DB first, then delete cache." Deleting first leaves a window where a read can backfill the old value. Worst case use "delayed double delete": delete → write DB → sleep 500ms → delete again.
3. The cache-DB consistency window is never zero. Acknowledge that; define the staleness budget at the business level.
4. Don't treat Redis as a DB: Redis persistence (RDB / AOF) isn't built for it as the design goal — crash RPO is at least a few seconds. For account balances, Redis is a cache, not the source of truth.
5. Ignoring cold start: a restart or cluster swap drops hit rate to 0% and the DB takes the full load. Pre-warm or ramp traffic in slowly.
Sample Interview Questions
- Design Instagram's home feed cache: 1B users, p99 < 200ms.
- One Redis hot key (a celebrity's follower list) at 1M QPS — how do you survive?
- What's your cache consistency model? Discuss the failure cases for "DB first then delete cache" both ways.
- Implementation details for LRU vs LFU, time complexity, and which fits which scenario?
- How do you stop cache penetration (malicious lookups for non-existent keys)? Compute a Bloom filter's false positive rate.
Key Resources
- Designing Data-Intensive Applications, Ch.1 + Ch.5 (Kleppmann): caching and consistency from first principles.
- "Scaling Memcache at Facebook" (NSDI 2013): the bible of distributed caching — leases, gutter pool, cross-region replication, all in one paper.
- Caffeine wiki: design details for W-TinyLFU; author Ben Manes's benchmarks are excellent.
English Summary
Caching is a multi-layer art spanning browser, CDN, reverse proxy, in-process, distributed (Redis/Memcached), and DB buffer pools. Choose eviction policy by access pattern: LRU for recency, LFU for frequency, W-TinyLFU (Caffeine) when in doubt. Write strategies trade consistency for performance: cache-aside is the default, write-through ensures correctness, write-back maximizes throughput at the cost of durability. The hardest part is invalidation — beware thundering herd, avalanche (use TTL jitter), penetration (bloom filter), and the eternal question of delete vs update cache (always delete).
Going Deeper (click to expand)
1. Hit rate drops from 95% to 90%. How many times more load does the DB see — and why isn't it simply 2×?
Naive math: miss rate 5% → 10%, origin QPS doubles, DB load = 2×.
Reality is worse, at least three amplifiers:
- Misses also cause writes: every miss is followed by a cache SET; under high concurrency you also get a thundering herd — 1000 concurrent requests for the same key all hit the DB once each, not collectively once.
- The DB's own caches collapse in turn: Postgres shared_buffers / MySQL buffer pool hit rate sinks too, disk I/O spikes, p99 isn't just 2× anymore.
- Connection-pool contention: the pool is sized for 95% hit rate; doubling the QPS exhausts it, new requests queue, p99 avalanches.
Real numbers: at 90% hit rate, DB load is usually 3–5× nominal. Hence the rule of thumb: a 5-point drop in hit rate is a P2.
2. Cache-aside "write DB then delete cache" still occasionally returns stale data in production. List three plausible causes and the matching fix.
- Concurrent read interleaved with write: A reads, cache miss → A reads DB (gets v1) → B writes DB (v2) → B deletes cache → A backfills with v1. Cache stuck at v1. Fix: delayed double-delete (B deletes → sleep 500ms → delete again); or single-flight to limit concurrent backfills.
- Replication lag: write goes to primary, cache deleted, but a follow-up read routes to a not-yet-replicated replica and returns stale data. Fix: read from primary for a window after delete, or "write primary → wait for replica → then delete cache."
- Delete failed: DB write succeeded but Redis delete dropped due to network jitter, with no retry. Fix: outbox / CDC pattern — durably emit the invalidation event (Debezium → Kafka → consumer deletes cache), at-least-once delivery.
- Multi-region desync: US-East deleted, US-West didn't. Fix: cross-region invalidation broadcast (Facebook's mcsqueal).
- Client-side local cache: in-process Caffeine still serves the old value even after Redis is invalidated. Fix: shorter local TTL or Redis pub/sub to push invalidation.
3. Redis Cluster doesn't support MULTI / transactions across slots. How does that affect key design? What does the hash tag ({user:123}) solve, and what does it introduce?
The constraint: user:123:profile and user:123:posts can hash to different slots on different nodes; a MULTI across them gets rejected (CROSSSLOT). Same goes for cross-slot MGET/MSET.
Hash tag usage: write keys as {user:123}:profile and {user:123}:posts; the content inside {} is the hash input, forcing both keys into the same slot.
What it introduces:
- Skew: all of a user's keys land on one node; a celebrity user becomes a single-node hot spot with pegged CPU.
- Big key risk: binding multiple collections together means one heavy user is a memory blowup on one node.
- Painful resharding: cluster resize has to migrate the whole group as a unit; writes to those keys stall during migration.
- Anti-pattern temptation: developers tag every "might be read together" key, until the cluster degenerates into a single-node Redis.
Field rule: only use hash tags when atomicity is truly required (cart + inventory). For multi-key display pages, do concurrent GETs from the app rather than rely on transactions.
4. A hot key sees 1M QPS at a single Redis node, which can't handle it. List three solutions across different axes and their costs.
- ① Key sharding (split on write, merge on read): turn
celebrity:123:followers_count into celebrity:123:followers_count:{0..15}; on write, increment a random shard; on read, SUM 16 shards. Cost: 16× read amplification, weak consistency (transient sum may dip), only works for aggregable data.
- ② Multi-replica reads (local replica): replicate the hot key to several Redis nodes; clients pick one at random. Cost: write must fan out to N replicas, weak consistency from async replication. Redis Cluster's READONLY mode does this.
- ③ Multi-layer cache: in-process L1: each app instance keeps the hot key in local Caffeine with a 1s TTL. 100 app servers × 1 origin call/s = 100 QPS to Redis. Cost: up to 1s of staleness, memory waste (100 copies), invalidation pain.
- ④ CDN / edge cache: for public data (celebrity follower count), put it on the CDN; edges absorb 1M QPS. Cost: only for publicly cacheable content, slow invalidation (30s+).
- ⑤ Pre-compute + push: don't let users query at all — pre-generate into each user's timeline cache (Twitter fanout-on-write). Cost: huge write amplification, unrealistic for 100M-follower celebrities.
Production combo: usually ① + ③ together — shards take writes, local L1 takes reads.
5. On restart, hit rate is 0% and the DB melts. How do you pre-warm? And how do you keep pre-warming itself from killing the DB?
When to pre-warm:
- Before a new release (blue-green: warm green, then swing);
- During a Redis cluster swap (old dump → new import);
- The night before a big sale (load predictable hot data ahead of time).
How to pre-warm:
- Offline analysis + bulk load: from the last 7 days' access logs, compute top 100k keys, dump KV pairs offline, write them to Redis via RESTORE or a pipeline (not one SET at a time).
- Slow-start in production: a new instance gets 1% traffic on launch, cache fills gradually, ramp to 100% over 5 minutes. Envoy's slow_start_config does this out of the box.
- Dual-cache switchover: old Redis still serves; mirror a shadow stream to the new Redis to warm it; cut over once hit rate is high.
Don't melt the DB while pre-warming:
- Rate-limit: the pre-warm script uses a token bucket (e.g., 5k QPS cap) so DB connections don't peg.
- Read replica: warm from replicas to spare the primary's transaction capacity.
- Off-peak: warm during the trough (3–5am); or piggyback on natural traffic ("first user triggers origin, everyone else benefits").
- Single-flight: pre-warming needs herd protection too — one worker per key.
Real story: Pinterest melted the DB for 30 minutes during a 2014 Memcached cluster swap with no pre-warm; the fix was to run old and new in parallel for an hour.