The Problem
Day one of your product: 1k DAU, one EC2 box running Postgres and Node, everything fine. Six months later you're at 1M DAU and 3k QPS, p99 latency has gone from 80ms to 2.4s, and the DB drops out occasionally. This story is almost universal — Instagram carried 14M users with three engineers in 2010; Notion moved from a single Postgres instance to a sharded fleet around 2019. Same arc, every time.
This issue isn't "throw more boxes at it." It's about figuring out which loads should be scaled up vs out, when you actually have to go horizontal, what hidden costs come with that, and why stateless services are the price of admission for horizontal scale.
Requirements and Constraints (the interview opener)
- Read/write ratio: 100:1 (social feeds) or 1:1 (instant messaging)? Determines whether you scale reads with replicas first, or shard writes.
- QPS magnitude: 100, 10k, or 1M? Below 10k QPS you almost never need distributed anything; at 1M you're forced into sharding and CDN.
- Data volume and growth curve: 1TB or 1PB? A single Postgres instance is comfortable up to about 10TB in practice.
- Latency SLO: p99 at 100ms (search) or 10ms (ad bidding)? The latter rules out any cross-region hop.
- Consistency requirements: bank-transfer strong, or like-button eventual?
- Budget: an m5.large is a few tens of dollars a month; a c7gn.16xlarge is thousands. Often overlooked as a constraint.
High-Level Architecture (one box to N tiers)
graph TD
C[Client]
DNS["DNS / GeoDNS"]
L4["L4 LB · LVS
~1M QPS ingress"]
L7["L7 LB · Nginx/Envoy
routing · TLS · rate limit"]
A1["App-1
stateless"]
A2["App-2
stateless"]
AN["App-N
stateless"]
R[("Redis
Cluster")]
PG[("PG Primary")]
R1[("Replica-1
read replica")]
R2[("Replica-2
read replica")]
C --> DNS --> L4 --> L7
L7 --> A1 & A2 & AN
A1 & A2 & AN --> R
A1 & A2 & AN --> PG
PG -.replicate.-> R1
PG -.replicate.-> R2
classDef edge fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef app fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef data fill:#1a1a30,stroke:#ffb450,color:#e8eef5
class C,DNS,L4,L7 edge
class A1,A2,AN app
class R,PG,R1,R2 data
Key Technical Points
1. Vertical vs Horizontal Scaling
Principle: vertical scaling = bigger machine (more CPU / RAM / NVMe); horizontal scaling = more machines. Vertical scaling stays roughly linear in price-performance until you hit x86 memory bandwidth or NUMA limits. A single box today can reach 192 vCPU / 2TB RAM (c7g.metal, x2idn.32xlarge) — so don't go distributed too early.
Trade-off:
- Vertical: ✅ no code changes, ACID intact, simple ops; ❌ single point of failure, non-linear pricing (96 → 192 vCPU costs 2.3×), hard physical ceiling.
- Horizontal: ✅ linear cost, fault tolerance; ❌ forces statelessness, introduces network partitions, debugging is N× harder, you need distributed locks or transactions.
Real cases: Stack Overflow's 2016 architecture writeup — 9 web servers + 4 SQL Servers (one machine with 96 cores and 1.5TB RAM), handling 6000 req/s. Their philosophy: "Hardware is cheap, programmers are expensive." They scaled vertically until 2024 before seriously sharding. The counter-example: Pinterest in 2012 rode MySQL vertical scaling until the master couldn't take it, then did an emergency 6-month sharding migration under fire.
2. Load Balancer: L4 vs L7
Principle: L4 (TCP/UDP) only inspects the four-tuple, forwards without unpacking, and can do 2M+ QPS per box (LVS/IPVS, AWS NLB). L7 (HTTP) can route by path, header, or cookie and handle TLS termination, compression, and rate limiting — but the CPU cost is much higher (Envoy does roughly 50k QPS per core). In production you typically see L4 in front, L7 behind.
Load balancing algorithms:
- Round Robin: simple, but ignores actual backend load; long-lived connections skew quickly.
- Least Connections: the choice for long connections (WebSocket, gRPC).
- Consistent Hashing: required when you need session affinity or cache locality (CDNs, Redis proxies).
- EWMA / P2C: "Power of Two Choices" — pick two backends at random, take the less loaded one. Envoy's default. Minimal logic, near-optimal behavior.
# P2C pseudocode (Envoy default)
def pick_backend(backends):
a, b = random.sample(backends, 2)
return a if a.active_requests <= b.active_requests else b
# Math: with N backends, peak load drops from O(log N) to O(log log N)
Real cases: Cloudflare uses LVS plus its own Unimog L4 LB to absorb global traffic, hundreds of Gbps per data center. AWS ALB is a managed L7 service built on a modified Envoy; NLB is L4, runs on Hyperplane, with sub-100μs latency.
3. Stateless Services (the ticket to horizontal scale)
Principle: service instances hold no "state needed for the next request" — sessions, caches, counters all live externally in Redis or a DB. Then the LB can route any request to any instance, and a crash hurts no one.
Trade-off: stateless ≠ no caching. Local caches (Caffeine, in-process LRU) are still fine, but you accept that they disagree across instances. Pushing everything to Redis adds about a 0.5ms hop and a single-point risk. The usual compromise: local cache + 30s TTL + pub/sub invalidation.
# Anti-pattern: sessions in memory
sessions = {} # ❌ lost on restart, scale-up drops users, A's session invisible to B
# Correct: JWT (stateless) or Redis (shared state)
def auth(req):
token = req.headers["Authorization"]
return jwt.verify(token, PUBLIC_KEY) # zero server-side state
Real cases: Netflix mandates stateless microservices, with releases handled by Spinnaker red-black deploys — bring up a new fleet, swing traffic, kill the old one. Counter-example: Discord's voice service has to be stateful (a voice channel is anchored to a single server). They solve it with consistent hashing plus a state-migration protocol, and a 2020 talk admitted that's the hardest part of their architecture.
4. Capacity Planning and "Good Enough"
Principle: don't reach for Kubernetes and microservices on day one. Shopify handled 76M peak QPS on Black Friday with a monolithic Rails app on a large pod fleet. Stay monolithic as long as you can; add machines before sharding. Back-of-envelope math is core interview material:
# Example: designing a Twitter timeline
# 300M DAU × 100 refreshes/day = 3e10 reads/day
# = 3e10 / 86400 ≈ 350k QPS average, peak ~3× = 1M QPS
# A 4-core box handles ~5k QPS of static reads
# → at least 200 read instances (with 2× headroom)
Interview tip: the interviewer will almost certainly ask you to estimate QPS, storage, and bandwidth. Memorize three numbers: 1 day ≈ 1e5 seconds, 1 QPS × 1 year ≈ 30M rows, 1 Gbps ≈ 125 MB/s.
Scaling Up: Where the Next Bottlenecks Live
- The first bottleneck is usually DB writes: reads relax with replicas; writes either scale vertically or shard. When you see p99 write latency > 100ms, it's time.
- The second is cache hot keys: one popular key saturates a Redis shard. You need a local cache plus multiple replicas.
- The third is the network: cross-AZ traffic is billed per GB (~$0.01/GB on AWS); too-fine microservice slicing gets schooled by transfer fees and RTT.
- Multi-region: when users span more than two continents, you introduce GeoDNS, in-region closure, and async global replication.
Common Pitfalls
- Going distributed too early: 1k QPS on Kubernetes with eight microservices, and debugging takes 10× longer than the monolith would have.
- The sticky-session trap: binding instances by cookie hash means scale-up or restart drops many users, and a hot instance has no relief valve.
- LB with no health checks: a half-dead instance (accepts connections, doesn't respond) still gets traffic, and every RTT spikes. Always configure active health checks + outlier ejection (Envoy's P2C also auto-ejects).
- No capacity drills: not load-testing before Black Friday, then learning your connection pool ceiling during the actual event. Netflix's Chaos Monkey culture exists for this reason.
- Ignoring the thundering herd: when a cache expires, 1000 instances hit the DB at once. Solve with singleflight, mutexes, or probabilistic early refresh.
Sample Interview Questions
- "You're at 5k QPS p99 200ms; you need 50k QPS p99 100ms. What's your five-step plan?" (Expect to ask about read/write ratio and current bottleneck before proposing solutions.)
- "Why not just buy one 192-core box?" (Tests your grasp of vertical scaling ceilings, blast radius, and SPOF.)
- "Where do L4 and L7 LBs go, and why not just use L7 everywhere?"
- "If you move sessions to stateless, do you pick JWT or Redis sessions?" (Tests trade-offs around revocation, payload size, security.)
- "How does round robin fall apart with long-lived connections? How do you fix it?" (Leads into least-conn / P2C.)
Key Resources
- 📕 Designing Data-Intensive Applications (Kleppmann), Ch. 1: Reliable, Scalable, and Maintainable Applications.
- 📝 Stack Overflow Architecture (2016, Nick Craver) — the canonical case for vertical scaling to its limit.
nickcraver.com/blog/2016/02/17/stack-overflow-the-architecture-2016-edition/
- 📝 High Scalability blog: the classic "Power of Two Random Choices" article.
English Summary
Scaling means surviving load growth without rewriting from scratch. Vertical scaling (bigger box) is underrated — Stack Overflow runs billions of pageviews on a handful of fat SQL servers. Horizontal scaling demands stateless services, smart load balancing (L4 for throughput, L7 for routing; P2C beats round-robin under skewed load), and capacity planning via back-of-envelope math. The cardinal sin is premature distribution: complexity grows superlinearly, debuggability collapses. Scale when you must, not when you can.
Going Deeper (click to expand)
1. Service goes from 1k to 100k QPS, and adding machines makes p99 worse. List at least three possible causes at different layers.
- A shared bottleneck: the DB connection pool, a single Redis node, a file lock — adding app servers just lengthens the queue. Classic signs: DB CPU at 100%, connection pool exhausted.
- The LB itself: an L7 LB (Envoy/Nginx) tops out around 50k QPS per instance; TLS handshakes or bad keepalive settings can make the LB the bottleneck. Check LB CPU and file descriptors.
- Distributed coordination cost: distributed locks (Redlock), cross-node cache invalidation, service-discovery gossip. More machines means O(N²) coordination overhead.
- Downstream amplification: each app instance opens 20 DB connections; 100 instances = 2000 connections, but Postgres's default max_connections=100 starts rejecting.
- Cross-AZ networking: scaling out moves traffic from intra-AZ to cross-AZ, adding ~0.5ms per hop. Ten hops = 5ms of pure overhead.
Diagnostic order: look at metrics that scale linearly with machine count (CPU should drop, but latency rises = shared resource is saturating).
2. Why is statelessness the ticket to horizontal scale? When the business needs sessions (game rooms, long transactions), how do you keep state and still scale?
Why: stateless means any request can land on any instance; the LB doesn't need memory, a crashed instance loses nothing, scaling is instant. Stateful means the LB needs sticky sessions, scaling requires state migration, and a failed node means dropped users.
Solutions for stateful services:
- Externalize the state: move sessions to Redis or DynamoDB; the service itself stays stateless. Most common pattern.
- Sticky routing + state sharding: consistent-hash route (user_id → node); failure has only local impact. Game rooms typically pin to one server. Cost: scale-up has to rehash a fraction of users.
- Actor model: Erlang, Akka, Orleans — each actor manages its own state and the framework handles migration. Discord uses Elixir, one process per channel.
- Distributed state machine + Raft: small amounts of strongly consistent state on a Raft cluster (etcd, TiKV pattern) — multi-replica, leader election, automatic failover.
Field guide: 90% of businesses are fine with "Redis externalized"; gaming and IM use sticky routing; financial ledgers reach for Raft.
3. Do you really have to choose between L4 and L7? Why is "L4 outside, L7 inside" the canonical big-tech layout — and what breaks if you reverse it?
It's not a binary choice. Big tech runs the canonical "L4 → L7 → App" sandwich.
Why L4 is at the edge:
- L4 (LVS, AWS NLB, Google Maglev) sees only the TCP five-tuple, can sustain millions of QPS and hundreds of Gbps per box, and is your first wall against DDoS.
- L4 doesn't terminate TLS or parse HTTP — CPU is nearly free, perfect for an anycast ingress.
- L4 with DSR (Direct Server Return) lets responses bypass the LB entirely, cutting outbound LB bandwidth in half.
Why L7 sits inside:
- L7 (Envoy, Nginx) has to terminate TLS, parse HTTP headers, and run routing/rate-limit logic — about 50k QPS per box. It can't take the public flood.
- L7 gives you rich features (path routing, retry, circuit breakers) — but only valuable on already-filtered traffic.
If you reverse it: L7 at the edge gets its connection table blown apart by the first SYN flood, and TLS handshake CPU pegs at 100%.
4. One c7g.metal (192 vCPU) vs sixteen c7g.4xlarge (16 vCPU) — same total cost. What workload fits each, and how do their blast radii compare?
One big box fits:
- Memory-heavy workloads that don't shard cleanly: in-memory databases (Redis, Memcached), OLTP databases with large buffer pools (Postgres, MySQL).
- Latency-critical services that need NUMA-local access: high-frequency trading, ad bidding.
- Workloads with high shared-cache hit rates: one large JVM heap beats 16 small ones for hot-key sharing.
Sixteen small boxes fit:
- Stateless, CPU-bound, easily parallel: rendering, transcoding, ML inference.
- HA requirements: rolling deploys take down 6.25% of traffic per node; killing the big box takes down 100%.
Blast radius: the big box dies = 100% outage (unless you hot-spare, doubling cost). 1 of 16 dies = 6.25% capacity gone; can the remaining 15 absorb it? Plan ~10% headroom.
Bottom line: CPU-bound and stateless → many small. Memory/IO heavy with locality → one big (but at least two for HA).
5. Black Friday is expected at 5× normal traffic. How do you capacity-plan? How much headroom? Why can't you just "add 5× machines"?
Headroom is typically 30–50%: traffic isn't uniform; p95 peaks are 2–3× the mean, and flash-sale peaks can be 10× the mean. So "5× traffic" in practice means provisioning for 7–8× peak.
Why "just add machines" doesn't work:
- DBs don't scale linearly: 5× app means 5× DB QPS, but DBs scale vertically or via read replicas (with replication-lag risk).
- Cache hit rates change: new users and new products mean more cache misses, so DB pressure is worse than expected.
- Downstream dependencies: payments, fraud, third-party APIs each have their own rate limits and will start rejecting you.
- Cold starts: new instances have cold JIT, empty connection pools, empty caches — the first few minutes are slower, not faster. Use slow-start ramp.
- Cost runaway: keeping 5× provisioned daily = 5× spend. You need auto-scaling that can shrink within seconds after the event.
Operational drill: 5× load test (offline + shadow traffic) a week ahead, degradation switches ready (turn off non-core features to free resources), CDN/cache pre-warming, scripted scale-up playbook ("scale N×" as one command), tightened monitoring thresholds for the event.