Design the backend for a 200M DAU video-on-demand platform with tens of millions of concurrent plays (Netflix, YouTube, Disney+). Chat (Day 15) is connection amplification; Feed (Day 14) is read amplification; video streaming is the double whammy of bandwidth + compute amplification: one steady 4K stream is ~15–25 Mbps, so ten million concurrent is hundreds of Tbps of egress — larger than any backbone can carry, so you cannot serve it from a central data center. Meanwhile a 2-hour movie must be transcoded into dozens of bitrate/resolution/codec variants — a heavy CPU/GPU job. The hard part isn't "storing video," it's pushing the same content to every screen worldwide, using the least bandwidth, smoothly, under fluctuating networks.
The key is to fully separate "upload+transcode" from "playback+delivery." The left is the offline/async transcoding pipeline: chop the source into chunks, encode the bitrate ladder in parallel, package as HLS/DASH, write to object storage — slow is fine, transcode once, play a billion times. The right is the online playback path: the client fetches the manifest, then pulls segments according to network conditions; 99% of bytes are served by CDN edge, with origin hit only on edge miss. The control plane (playback API, DRM licensing, recommendations) carries metadata only, never video bytes.
Left: offline transcode (once). Right: online playback (a billion times). All bandwidth on the CDN edge; origin only backstops.
Principle: each source must be transcoded into a set of (resolution, bitrate, codec) combos — the bitrate ladder — so the player can pick by network speed. The source is split into multi-second chunks at shot/GOP boundaries, fanned out to hundreds–thousands of machines for parallel encoding, then quality-checked (QC) and packaged. Key insight: compression difficulty varies wildly by content — animation/flat scenes are perfect at 1 Mbps, sports/explosions are still mushy at 8 Mbps. A one-size-fits-all fixed ladder wastes bandwidth on easy content and under-delivers quality on hard content.
| Approach | How | Cost | Bandwidth gain |
|---|---|---|---|
| Fixed ladder | Same bitrates for all titles (e.g. fixed 235k→5800k rungs) | Simple, fast transcode | baseline |
| Per-Title | Pick optimal ladder per title's complexity (convex hull) | Needs complexity analysis | ↓~20%+ |
| Per-Shot | Per-shot resolution/QP, VMAF-optimal | Encode units ×10, compute explodes | ↓~30% |
# Parallel chunked transcode (pseudo-code)
chunks = split_by_shot(source) # split at shot/GOP boundaries → independently decodable
ladder = per_title_ladder(source) # complexity analysis → convex hull picks rungs
for c in chunks: # fan-out to hundreds of workers
for rung in ladder:
enqueue(encode_task(c, rung)) # (chunk, resolution, bitrate, codec)
# after all done, reduce: concat + QC (VMAF spot-check) + package HLS/DASH
assemble_and_package(wait_all()) # any failed chunk only needs that one re-run
Principle: networks fluctuate constantly, so the player must decide per segment which bitrate to pull next, balancing "high quality" vs "no rebuffering" in real time. This is a client-side decision (the server only offers multiple segment rungs). Two schools: throughput-based uses recent download speeds to predict how high to go; buffer-based (e.g. BOLA) looks only at the current buffer level — more buffer → go up, less buffer → go down. Modern players are hybrid: at startup the buffer is empty, so probe with throughput; in steady state use buffer to resist jitter.
# Hybrid ABR rung selection (simplified, one decision per segment)
def pick_bitrate(buf_sec, recent_bw, ladder):
safe_bw = 0.85 * ema(recent_bw) # safety margin against jitter
cap = max(r for r in ladder if r <= safe_bw)
if buf_sec < LOW: return min(ladder) # buffer critical → drop to floor, mush over stall
if buf_sec > HIGH: return up_one(cap) # buffer healthy → step up boldly
return cap # steady state follows bandwidth
Principle: 99%+ of video bytes must be served from the edge node nearest the user, with origin only as backstop. The hard parts are hit rate and fill strategy: when a hot new show launches, rather than wait for users to miss and hammer the origin, you push-prefill (warm) it to edges in advance. Cold long-tail content is pulled on demand. Netflix takes this to the extreme — placing cache servers (Open Connect Appliances) directly inside ISP facilities, so video never traverses the public backbone; ISP and Netflix both win (saves ISP transit fees, faster for users).
Principle: startup time (TTFF) = DNS/TLS + fetch manifest + get DRM license + download and decode the first segment. To get under 2s, squeeze every step: pre-warm connections (warm CDN connection), slim the manifest, start the first segment at a low bitrate (show something, then quietly step up), and front-load keyframes. Segment duration is the core knob: short (2s) switches responsively and starts fast but means more requests and lower coding efficiency (dense keyframes); long (6–10s) is efficient but makes ABR sluggish and startup slow. VOD usually settles on 4–6s.
Common follow-ups: ① Estimate the egress bandwidth for ten million concurrent, and the order-of-magnitude transcode/storage cost of one movie. ② How does a video go from upload to playable worldwide? Draw the pipeline. ③ Failure modes of throughput-based vs buffer-based ABR — how do you hybridize? ④ How do you avoid an origin storm on a hot premiere? ⑤ VOD vs live architectural differences (real-time transcode, latency, buffering strategy)?
10M × 4 Mbps = 4×10⁷ Mbps = 40 Tbps (peak 4K is more, up to hundreds of Tbps). For comparison: a single mega data center's total egress is typically a few to ~teens of Tbps; a major submarine cable trunk is also Tbps-class. 40 Tbps means you must spread traffic across hundreds–thousands of edge points worldwide, each carrying tens–hundreds of Gbps — no single center can aggregate that egress. That's why CDN is an architectural prerequisite, not an optimization — and at third-party CDN per-GB pricing, this bandwidth bill would eat all profit, the economic motive behind Netflix building Open Connect. When costing it, remember: transcode is one-time CPU, bandwidth is a forever monthly bill, so trade compute for permanent bandwidth.
It's about the amortization ratio of transcode cost vs bandwidth savings. A Netflix hit is played hundreds of millions of times; per-shot saving 30% bitrate = permanently saving 30% of enormous bandwidth, while the one-time extra encode compute is amortized to negligible across a billion plays. A startup differs on two fronts: ① low play counts — the absolute bandwidth saved is small and can't amortize the extra encode cost; ② engineering complexity — per-shot needs analysis encodes, convex-hull maintenance, and scheduling/retry/QC of a huge number of encode units, a dedicated team's job. So the rational order is: fixed ladder to ship → per-title once volume grows (the ROI inflection) → per-shot only for top content. This is a classic premature optimization anti-example: burning money on per-shot before PMF, the bandwidth saved won't even cover the engineers' salaries.
Pure throughput: at the moment of the drop it still picks a high rung based on "last segment was 50M," that segment can't finish → buffer drains → rebuffer; on recovery it aggressively steps up, hits jitter, rebuffers again, quality oscillates. Pure buffer-based (BOLA): it only watches buffer level, so on the drop it reacts as the buffer falls and steps down smoothly — better jitter resistance; but the cost is lagging reaction, and at startup the buffer is at 0 with no history, so it's conservative and slow to start. Hybrid takes both strengths: at startup, buffer is empty, use throughput to probe fast and get the first frame up; in steady state, switch to buffer-based for jitter resistance; and keep a safety margin on bandwidth estimates (e.g. ×0.85) so you don't ride the edge and rebuffer. The essence: "different phases carry different amounts of information — use the most reliable signal for the moment."
Pure pull: the new show's segments aren't on any edge node yet, so millions of requests all miss → all go to origin → origin and the origin links are instantly hammered (this is the cache version of a thundering herd / avalanche, see Day 2). Result: startup timeouts, mass play failures — right when you most need stability. Prevention: ① Push prefill — known to be hot, prefill all bitrate variants to global edges during off-peak nights before launch (Netflix Open Connect's core play); ② add single-flight on origin, letting only one origin request per segment through while others wait; ③ tier the edge so a regional mid-tier absorbs a wave before origin, converging the origin fan-out; ④ stagger the launch with phased rollout or a countdown to spread the click peak. The core idea matches cache warming: predictable hotspots should be filled proactively, don't wait for the miss.
The conflict: buffer is the ammunition against jitter, but deeper buffer means further behind the live edge. Traditional HLS must wait for a full segment (e.g. 6s) to be generated and written before it can be pulled — that step alone accumulates a segment's worth of latency. LL-HLS's trick: split a segment further into partial segments (hundreds of ms); the encoder emits as it encodes, the CDN forwards via chunked transfer as it receives, and the player gets the latest small chunk without waiting for the whole segment — latency drops from "whole segment" to "chunk" level. It also uses preload hints / blocking playlist reload so the player precisely prefetches the next chunk and avoids polling churn. The key elegance: it's still an HTTP + HLS framework — reusing existing CDN caching and the multi-rung ABR mechanism, unlike WebRTC which spins up a separate, uncacheable real-time channel. The cost is more frequent requests, a CDN requirement for chunked transfer support, and a thinner buffer with less jitter margin — fundamentally it moves to a new point on the "latency vs smoothness" trade-off rather than eliminating the conflict.