The interviewer says: "Design a photo-sharing feed (Instagram-like) for 100M DAU." Before you draw a single box, answer one question: is this a one-machine job or a thousand-machine job? The answer dictates the architecture — single DB vs sharding, synchronous fanout vs async, how large the cache needs to be. Capacity estimation (back-of-the-envelope, BOTE) uses a few assumptions and order-of-magnitude constants to land QPS, storage, bandwidth and memory within 10× of the truth in two minutes, collapsing countless candidate architectures down to one or two.
Its purpose is not precision — precision is the job of load tests and real metrics. Its purpose is to eliminate the obviously infeasible: if you compute 200 GB/s of write bandwidth, the "single machine + local disk" plan is dead on the spot. Fix the input assumptions for this issue:
50:2 = 25:1 — read-heavy, as nearly all feed/social systems are.Capacity estimation is not a pile of numbers — it is a directed derivation chain: every quantity flows from two root assumptions, DAU and "per-user behavior." Each arrow below is one multiply/divide. Memorize the chain and you never miss a term.
graph TD
DAU["DAU = 100M
+ per-user behavior"]
W["writes/day
200M"]
R["reads/day
5B"]
AQ["avg QPS
÷ 1e5 s/day"]
PQ["peak QPS
× peak factor 2~3"]
ST["storage
writes × obj size × retention"]
BW["bandwidth
QPS × payload"]
MEM["cache memory
hot data × 80/20"]
DAU --> W --> AQ
DAU --> R --> AQ
AQ --> PQ
W --> ST
R --> BW
R --> MEM
classDef root fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef mid fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef out fill:#0e2030,stroke:#5eead4,color:#e8eef5
class DAU root
class W,R,AQ mid
class PQ,ST,BW,MEM out
Two root assumptions (scale + per-user behavior) decide everything; QPS sets compute, storage/bandwidth/memory set the shape
Core trade-off: memorize a tiny set of constants → mental-math speed, at the cost of ±2× error. The bedrock is three groups of "free" numbers. First, powers of two and data units: 2^10≈10^3=KB, 2^20=MB, 2^30=GB, 2^40=TB, 2^50=PB. Second, time constants: a day ≈ 86400 ≈ 10^5 seconds (this approximation turns "N times per day" into "÷100k = QPS," the single most-used step in estimation). Third, Jeff Dean's latency numbers: they tell you which step is the bottleneck.
| Operation | Magnitude | Meaning |
|---|---|---|
| L1 / main-memory reference | ~1 ns / ~100 ns | CPU-local, essentially free |
| Sequential memory read 1 MB | ~microseconds (µs) | memory bandwidth is huge |
| SSD random read | ~16 µs | 100× slower than memory |
| Same-datacenter round trip (RTT) | ~0.5 ms | floor for one RPC |
| Disk seek / SSD seq read 1 MB | ~ms | batch beats random |
| Cross-continent RTT (CA↔NL) | ~150 ms | speed of light, can't optimize |
Core trade-off: estimating with average QPS is easy, but a system provisioned for the average dies at peak. Three steps: ① requests/day = DAU × times-per-user; ② average QPS = requests/day ÷ 10^5; ③ peak QPS = average × peak factor. The peak factor comes from intraday unevenness — evening peaks, lunch, viral spikes; social apps typically use 2~3×, while flash sales/livestream openings can exceed 10×.
# This issue's numbers (100M DAU)
write: 1e8 × 2 = 2e8 /day → avg 2,000 QPS → peak ~5k write QPS
read: 1e8 × 50 = 5e9 /day → avg 50k QPS → peak ~100k read QPS
read:write ≈ 25:1 → the read path is the design center (cache, replicas, fanout)
Core trade-off: estimate "metadata" and "large objects/media" separately, or one order-of-magnitude slip wrecks the whole table. Each quantity has a formula; the key is picking the right multiplier.
# storage = writes/day × object size × retention
metadata: 2e8/day × 1KB × 365 × 5yr ≈ 365 TB → shard + hot/cold tiering
media: assume 30% of posts carry an image, 200KB each
2e8 × 0.3 × 200KB ≈ 12 TB/day → object store (S3) + CDN, not in DB
# bandwidth = peak QPS × bytes per response
read egress: 100k QPS × (20 posts × 1KB) = 2 GB/s ≈ 16 Gbps (metadata only)
media bandwidth rides the CDN, absorbed at the edge, not origin
# cache memory = hot data × hit target
hot timelines: 20% DAU = 20M active × 200 ids/user × ~50B
≈ 200 GB → fits a mid-size Redis cluster
Core trade-off: an estimate's credibility lies not in decimals but in whether the assumptions are stated and survive "what if this number were off by 2×?" A good estimate writes every input explicitly with its source: which are given (DAU), which are industry common sense (read:write), which you simply guessed (peak factor). Then run a sensitivity check — which assumption, if off by 2×, flips the conclusion? That one is the one to confirm with the interviewer, or to monitor closely in production.
# Assumption list (the real deliverable of estimation)
DAU = 1e8 [given]
posts/user/d = 2 [assumed · IG-like median user]
views/user/d = 50 [assumed · on the high side, a sensitive item] ← 2× off → read QPS doubles
peak factor = 2.5 [assumed · evening-peak heuristic]
object size = 1 KB [metadata; media counted separately]
retention = 5 yr [product decision, to confirm]
sensitivity: read QPS is linear in "views/user" → verify first
storage is linear in "retention" → confirm with product
Common follow-ups: ① Roughly how many seconds in a day, and why approximate to 10^5? ② Can this run on one machine — how do you tell at a glance? ③ If DAU grows 10×, which quantity breaks first and how does the architecture change? ④ Where did your peak factor come from, and how does the conclusion change if it's off by 2×? ⑤ Should storage include replicas and retention, and how do you estimate hot/cold tiering?
The truth is 86400 ≈ 8.64×10^4; using 10^5 overstates by ~16%. In estimation this is benignly wrong: dividing by 10^5 makes average QPS ~16% low, but we immediately multiply by a peak factor of 2~3, and that bias is utterly drowned by the peak factor's own uncertainty — your grip on the peak factor isn't anywhere near 16% precision. So at the order-of-magnitude level, zero impact.
Where it bites: when traffic is not spread across the day but highly concentrated — e.g. a flash sale that runs 5 minutes at 8am. Then "÷10^5" average QPS is meaningless; the real instantaneous QPS = day's total ÷ 300 seconds, possibly hundreds of times the "average." Lesson: amortizing over a day only holds for roughly uniform traffic; concentrated loads must use the actual active window as the denominator.
The root is read:write ratio compounded by amplification. First, views/user (50) is naturally an order of magnitude higher than writes/user (2). Second, one "read feed" on the backend is usually a fanout — one request → pull N posts + counts + follow graph — so read amplification means even more backend read ops. Third, writes have a natural ceiling (a human can't post hundreds of times a day), while reads can be scrolled endlessly — the product shape makes reads effectively uncapped.
So social/content systems are almost always "easy write DB, strained read path." This is exactly why their core weapons are caching, read replicas, precomputed timelines (fanout-on-write) — not write sharding. The counterexample is IoT/log ingestion — writes far exceed reads, and the architecture's center flips entirely to Kafka + LSM sharded writes. In one line: the read:write ratio decides which side you pile machines on.
You can't just divide 365 TB by per-disk capacity. First inflate logical volume into physical footprint: ① ×3 replicas (HA) → ~1.1 PB; ② add indexes, WAL/redo, fragmentation and reserved space, often another ×1.3~2 in practice; ③ disks can't be filled — leave 30% headroom. Physical need can approach the 2 PB range.
Then divide by per-node effective capacity — and the bottleneck may not be capacity: a single DB node usually hits IOPS/connections/memory limits before disk space. Stuffing 10 TB on one node that can't sustain the QPS is routine. So the right answer is "size node count by whichever of capacity and QPS is tighter, then ×replicas." Quoting only "365 TB ÷ 16 TB ≈ 23 nodes" is a classic point-loser — it drops replicas, overhead and the QPS constraint all at once.
Because architecture options often differ by more than an order of magnitude, and the decision only needs to tell magnitudes apart. "One machine vs a thousand," "single write primary suffices vs needs distributed storage," "data fits in memory vs must spill to tiered disk" — these forks are 10×, 100× gulfs, and estimation's ±3× error can't cross them, so it's plenty to push you to the right side.
Conversely, estimation can't answer fine-tuning within a magnitude: 8 nodes or 12, cache at 50 GB or 80 GB — those differences are indistinguishable within estimation's precision and must go to load tests and production metrics. So the correct use is: use estimation to pick the architecture shape (coarse decision), use load tests and monitoring to fix the exact capacity (fine decision). Treating estimation as a precise budget, or refusing it for being imprecise, are both misuses.
Total users = 10k × 500 = 5M, two orders smaller than consumer — but the center of gravity shifts. First, B2B "per-user behavior" is wholly different: high-frequency during work hours, near-zero at night, so the peak factor is more extreme (concentrated 9–18 on weekdays), yet spread across time zones smooths the peak — correct by tenant time-zone distribution. Second, the binding constraint moves from "total volume" to "per-tenant ceiling and tenant isolation": the largest tenant may be 100× the median (power-law), so estimating by the average lets a giant tenant crush shared resources — estimate the "largest tenant," not the "average tenant." Third, multi-tenant storage must count per-tenant overhead (the fixed cost of a dedicated schema/index × 10k); for small tenants, the sum of fixed overheads can exceed the data itself. In one line: consumer estimation looks at total volume and peak; B2B estimation looks at the tail of the tenant distribution and the cost of isolation (echoing Day 27, multi-tenancy and cost engineering).