Day 16 Hard Video Streaming HLS/DASH CDN Transcoding

Video Streaming — 200M DAU VOD, 2s Startup, <0.5% RebufferTranscoding Pipeline, Adaptive Bitrate, CDN Design, Playback Optimization

Scenario + Requirements

Design the backend for a 200M DAU video-on-demand platform with tens of millions of concurrent plays (Netflix, YouTube, Disney+). Chat (Day 15) is connection amplification; Feed (Day 14) is read amplification; video streaming is the double whammy of bandwidth + compute amplification: one steady 4K stream is ~15–25 Mbps, so ten million concurrent is hundreds of Tbps of egress — larger than any backbone can carry, so you cannot serve it from a central data center. Meanwhile a 2-hour movie must be transcoded into dozens of bitrate/resolution/codec variants — a heavy CPU/GPU job. The hard part isn't "storing video," it's pushing the same content to every screen worldwide, using the least bandwidth, smoothly, under fluctuating networks.

Scale: 200M DAU, tens of millions concurrent at peak; catalog of millions of hours; each source explodes into dozens of variants → PB-scale storage.
Bandwidth: peak egress tens to hundreds of Tbps — the core constraint, forcing self-built/embedded CDN rather than central serving.
Experience SLO: startup time (TTFF) p90 < 2s; rebuffer ratio < 0.5%; play failure < 0.1% — buffering and slow startup are the #1 churn drivers.
Heterogeneity: phone / TV / Web / old devices; codecs (H.264/HEVC/AV1), resolutions, networks (5G down to 1 Mbps) — full spectrum.
Non-goal: VOD doesn't need ultra-low latency (a few seconds of buffer is an asset); only live needs to push end-to-end down to 2–5s (see Extensions).

High-Level Architecture

The key is to fully separate "upload+transcode" from "playback+delivery." The left is the offline/async transcoding pipeline: chop the source into chunks, encode the bitrate ladder in parallel, package as HLS/DASH, write to object storage — slow is fine, transcode once, play a billion times. The right is the online playback path: the client fetches the manifest, then pulls segments according to network conditions; 99% of bytes are served by CDN edge, with origin hit only on edge miss. The control plane (playback API, DRM licensing, recommendations) carries metadata only, never video bytes.

graph LR UP[Creator/Studio upload] --> ING[Ingest] ING --> PIPE[Transcode Pipeline
chunk·parallel encode·QC·package] PIPE --> OBJ[(Object store S3
HLS/DASH segments)] OBJ -->|prefill popular| EDGE[CDN Edge
embedded in ISPs] subgraph Playback Path CLI[Player
ABR engine] -->|1 get manifest| API[Playback API] CLI -->|2 pull segments| EDGE EDGE -.miss to origin.-> OBJ CLI -->|DRM license| DRM[License Server] end classDef off fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef on fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef origin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class ING,PIPE off class CLI,API,EDGE,DRM on class OBJ origin

Left: offline transcode (once). Right: online playback (a billion times). All bandwidth on the CDN edge; origin only backstops.

Key Technical Points

1. Transcoding Pipeline & Bitrate Ladder — fixed ladder saves CPU, content-adaptive saves bandwidth

Principle: each source must be transcoded into a set of (resolution, bitrate, codec) combos — the bitrate ladder — so the player can pick by network speed. The source is split into multi-second chunks at shot/GOP boundaries, fanned out to hundreds–thousands of machines for parallel encoding, then quality-checked (QC) and packaged. Key insight: compression difficulty varies wildly by content — animation/flat scenes are perfect at 1 Mbps, sports/explosions are still mushy at 8 Mbps. A one-size-fits-all fixed ladder wastes bandwidth on easy content and under-delivers quality on hard content.

Approach	How	Cost	Bandwidth gain
Fixed ladder	Same bitrates for all titles (e.g. fixed 235k→5800k rungs)	Simple, fast transcode	baseline
Per-Title	Pick optimal ladder per title's complexity (convex hull)	Needs complexity analysis	↓~20%+
Per-Shot	Per-shot resolution/QP, VMAF-optimal	Encode units ×10, compute explodes	↓~30%

Trade-off: bandwidth is a video platform's biggest cost, and transcode compute is one-time while bandwidth is forever, so "spend more CPU to permanently save bandwidth" is almost always worth it — exactly the per-title / per-shot logic. The cost is pipeline complexity: per-shot runs an analysis encode before the production encode, the number of encode units jumps an order of magnitude, and scheduling/retries/QC all get harder. Quality must be measured with VMAF (perceptual, ~0.9+ correlation with human eyes), not PSNR, or you "save bitrate but users see mush."

# Parallel chunked transcode (pseudo-code)
chunks = split_by_shot(source)            # split at shot/GOP boundaries → independently decodable
ladder = per_title_ladder(source)         # complexity analysis → convex hull picks rungs
for c in chunks:                          # fan-out to hundreds of workers
    for rung in ladder:
        enqueue(encode_task(c, rung))     # (chunk, resolution, bitrate, codec)
# after all done, reduce: concat + QC (VMAF spot-check) + package HLS/DASH
assemble_and_package(wait_all())          # any failed chunk only needs that one re-run

Real-world:

Netflix Per-Title / Dynamic Optimizer: per-title launched 2015, evolved to per-shot "Dynamic Optimizer"; published x264/VP9/x265 average savings of 28%/38%/34% at equal quality.
Netflix VMAF: in-house perceptual quality metric, open-sourced, now the de-facto industry standard for transcode quality.
YouTube: massive UGC — on upload, quickly transcode a low-res playable version, then backfill higher bitrates/AV1; only popular videos are worth re-encoding to AV1.
Netflix microservices pipeline: refactored the monolithic encoding system into microservices for elasticity and scale (TechBlog "Rebuilding Netflix Video Processing Pipeline").

2. Adaptive Bitrate (ABR) — throughput reacts fast but jitters, buffer-based is stable but starts slow

Principle: networks fluctuate constantly, so the player must decide per segment which bitrate to pull next, balancing "high quality" vs "no rebuffering" in real time. This is a client-side decision (the server only offers multiple segment rungs). Two schools: throughput-based uses recent download speeds to predict how high to go; buffer-based (e.g. BOLA) looks only at the current buffer level — more buffer → go up, less buffer → go down. Modern players are hybrid: at startup the buffer is empty, so probe with throughput; in steady state use buffer to resist jitter.

Trade-off:

Pure throughput: ✅ fast reaction, fast startup; ❌ bandwidth estimates jitter badly on shared networks, causing frequent rung switches (quality oscillates, poor experience).
Pure buffer-based (BOLA): ✅ stable, with theoretical optimality; ❌ at startup the buffer is at 0 with no signal, so it's conservative and slow to start.
HLS vs DASH: HLS (Apple ecosystem, TS/CMAF) has the best compatibility; DASH (codec-agnostic, open standard) is more flexible but needs workarounds on Apple devices. The trend is CMAF unifying packaging — one segment feeds both HLS and DASH, halving storage.

# Hybrid ABR rung selection (simplified, one decision per segment)
def pick_bitrate(buf_sec, recent_bw, ladder):
    safe_bw = 0.85 * ema(recent_bw)          # safety margin against jitter
    cap = max(r for r in ladder if r <= safe_bw)
    if buf_sec < LOW:    return min(ladder)   # buffer critical → drop to floor, mush over stall
    if buf_sec > HIGH:   return up_one(cap)   # buffer healthy → step up boldly
    return cap                                # steady state follows bandwidth

Real-world:

dash.js / BOLA: BOLA (Lyapunov optimization) is the reference player's default ABR, making Lyapunov-optimal decisions purely from buffer level.
Netflix: client ABR combines network prediction with the per-title ladder, aiming to minimize rebuffering while maximizing VMAF.
YouTube: default DASH, selecting rungs dynamically by device/network/data-saver settings, with a mobile data-saver cap.
Twitch (live): low-latency mode shortens segments and tightens buffer, trading some jitter resistance for 2–5s latency.

3. CDN Design — self-built is cheap but heavy investment, third-party is fast to launch but expensive and uncontrollable

Principle: 99%+ of video bytes must be served from the edge node nearest the user, with origin only as backstop. The hard parts are hit rate and fill strategy: when a hot new show launches, rather than wait for users to miss and hammer the origin, you push-prefill (warm) it to edges in advance. Cold long-tail content is pulled on demand. Netflix takes this to the extreme — placing cache servers (Open Connect Appliances) directly inside ISP facilities, so video never traverses the public backbone; ISP and Netflix both win (saves ISP transit fees, faster for users).

Trade-off:

Third-party CDN (Akamai/CloudFront/Fastly): ✅ instant, global coverage; ❌ per-GB pricing is brutally expensive at video volume, cache policy is uncontrollable, hot premieres miss easily.
Self-built + embedded ISP (Open Connect): ✅ marginal bandwidth cost → near zero, fully controllable, pre-fillable; ❌ huge upfront investment, must negotiate deployment with thousands of ISPs, only pays off at massive scale.
Push prefill vs Pull on origin miss: push popular content during off-peak nights (known to be hot, fill proactively); pull long-tail on demand (save storage). Hybrid is standard.

Real-world:

Netflix Open Connect: self-built CDN, OCAs deployed into thousands of ISPs and IXPs worldwide; tiered into large Storage Appliances (near-full catalog) + ISP-embedded Edge (regional hits), prefilled nightly.
YouTube (Google Global Cache): similarly places cache nodes inside ISPs, reusing Google's backbone.
Disney+ / most providers: multi-CDN (several at once), scheduling/switching between CDNs by real-time quality/cost.
Twitch: self-built PoPs + relays; live demands more on edge ingest and transcode proximity.

4. Playback & Startup Optimization — fuller manifest means slower start, shorter segments switch better but cost more

Principle: startup time (TTFF) = DNS/TLS + fetch manifest + get DRM license + download and decode the first segment. To get under 2s, squeeze every step: pre-warm connections (warm CDN connection), slim the manifest, start the first segment at a low bitrate (show something, then quietly step up), and front-load keyframes. Segment duration is the core knob: short (2s) switches responsively and starts fast but means more requests and lower coding efficiency (dense keyframes); long (6–10s) is efficient but makes ABR sluggish and startup slow. VOD usually settles on 4–6s.

Trade-off:

Short segment (2s): ✅ fast startup, responsive ABR, low live latency; ❌ HTTP requests multiply, more keyframes lower coding efficiency.
Long segment (10s): ✅ high coding efficiency, fewer requests; ❌ slow recovery after a stall, sluggish ABR switching.
Conservative vs aggressive startup: lowest rung first → fastest start but mushy first frame; rung by estimated bandwidth → crisp first frame but occasionally slow start. Most pick "low then step up."

Real-world:

Netflix: app pre-connects to CDN and prefetches the manifest at launch, so the first segment is in flight the instant you click; low-bitrate first segment for instant start, then step up.
YouTube: pre-connect + first-segment priority, mobile startup often <1s; predictively prefetches the next likely-clicked video.
Apple LL-HLS: uses partial segments + preload hints to push live latency to ~2s while keeping HLS ABR.

Extensions & Optimization (as you grow)

Live: transcode must be real-time (encode slower than playback = collapse); adopt LL-HLS / low-latency DASH with chunked CMAF + chunked transfer to emit sub-segment chunks as they encode, end-to-end 2–5s.
New-codec dividend: H.264→HEVC→AV1 each saves ~30–50% bitrate, but encoding is costly and old devices can't decode it — only worth AV1 re-encoding for high-frequency popular content, leave the long tail on H.264.
Multi-CDN scheduling: route per request by each CDN's rebuffer/cost/availability in real time; auto-failover on single-CDN outage.
QoE loop: clients continuously report startup/rebuffer/switch events → real-time dashboards → feed back into ABR tuning and CDN scheduling, locating issues in minutes.
Storage tiering: keep popular variants on SSD/edge, sink long-tail to cheap object storage or even "transcode on demand" (generate rare variants only on first play).

Pitfalls + Interview Questions

1. Serving streams from a central data center. Ten million concurrent × a few Mbps = hundreds of Tbps — no central egress can carry it; CDN/edge is not an optimization, it's a prerequisite. Not mentioning CDN is an instant fail.

2. Putting ABR on the server. Only the client knows its own buffer level and instantaneous bandwidth in real time; ABR must live in the player; the server only offers multiple segment rungs.

3. One-size-fits-all fixed ladder. Wastes bandwidth on easy content, under-delivers on hard. A senior answer names per-title/per-shot + VMAF.

4. Ignoring startup and rebuffer metrics. Users churn on "won't open / keeps spinning," not "not crisp enough." Anchor SLOs to TTFF and rebuffer ratio, not average bitrate.

Common follow-ups: ① Estimate the egress bandwidth for ten million concurrent, and the order-of-magnitude transcode/storage cost of one movie. ② How does a video go from upload to playable worldwide? Draw the pipeline. ③ Failure modes of throughput-based vs buffer-based ABR — how do you hybridize? ④ How do you avoid an origin storm on a hot premiere? ⑤ VOD vs live architectural differences (real-time transcode, latency, buffering strategy)?

Deep-Dive Resources

Netflix TechBlog — Per-Title Encode Optimization: the seminal work on content-adaptive ladders and the convex hull.
Netflix TechBlog — Dynamic Optimizer (per-shot): per-shot encoding optimization framework.
Netflix TechBlog — VMAF perceptual quality metric: why VMAF over PSNR.
Netflix TechBlog — Rebuilding Video Processing Pipeline with Microservices: engineering the transcode pipeline.
Netflix Open Connect + Apple HTTP Live Streaming: self-built CDN and HLS/LL-HLS official docs.

Deep Thinking (click to expand)

1. Estimate: 10M concurrent, 4 Mbps average — how much egress? Why does that number alone rule out "serve from a central data center"?

10M × 4 Mbps = 4×10⁷ Mbps = 40 Tbps (peak 4K is more, up to hundreds of Tbps). For comparison: a single mega data center's total egress is typically a few to ~teens of Tbps; a major submarine cable trunk is also Tbps-class. 40 Tbps means you must spread traffic across hundreds–thousands of edge points worldwide, each carrying tens–hundreds of Gbps — no single center can aggregate that egress. That's why CDN is an architectural prerequisite, not an optimization — and at third-party CDN per-GB pricing, this bandwidth bill would eat all profit, the economic motive behind Netflix building Open Connect. When costing it, remember: transcode is one-time CPU, bandwidth is a forever monthly bill, so trade compute for permanent bandwidth.

2. Per-shot encoding raises encode units by an order of magnitude, exploding pipeline complexity. Why is this worth it for Netflix but maybe not for a startup?

It's about the amortization ratio of transcode cost vs bandwidth savings. A Netflix hit is played hundreds of millions of times; per-shot saving 30% bitrate = permanently saving 30% of enormous bandwidth, while the one-time extra encode compute is amortized to negligible across a billion plays. A startup differs on two fronts: ① low play counts — the absolute bandwidth saved is small and can't amortize the extra encode cost; ② engineering complexity — per-shot needs analysis encodes, convex-hull maintenance, and scheduling/retry/QC of a huge number of encode units, a dedicated team's job. So the rational order is: fixed ladder to ship → per-title once volume grows (the ROI inflection) → per-shot only for top content. This is a classic premature optimization anti-example: burning money on per-shot before PMF, the bandwidth saved won't even cover the engineers' salaries.

3. A user on the subway drops from 50 Mbps to 1 Mbps, then recovers. How do pure-throughput vs pure-buffer ABR behave? Why is hybrid better?

Pure throughput: at the moment of the drop it still picks a high rung based on "last segment was 50M," that segment can't finish → buffer drains → rebuffer; on recovery it aggressively steps up, hits jitter, rebuffers again, quality oscillates. Pure buffer-based (BOLA): it only watches buffer level, so on the drop it reacts as the buffer falls and steps down smoothly — better jitter resistance; but the cost is lagging reaction, and at startup the buffer is at 0 with no history, so it's conservative and slow to start. Hybrid takes both strengths: at startup, buffer is empty, use throughput to probe fast and get the first frame up; in steady state, switch to buffer-based for jitter resistance; and keep a safety margin on bandwidth estimates (e.g. ×0.85) so you don't ride the edge and rebuffer. The essence: "different phases carry different amounts of information — use the most reliable signal for the moment."

4. A hugely anticipated show launches Friday 8pm; millions click instantly. What happens with pure pull-on-demand? How do you prevent it?

Pure pull: the new show's segments aren't on any edge node yet, so millions of requests all miss → all go to origin → origin and the origin links are instantly hammered (this is the cache version of a thundering herd / avalanche, see Day 2). Result: startup timeouts, mass play failures — right when you most need stability. Prevention: ① Push prefill — known to be hot, prefill all bitrate variants to global edges during off-peak nights before launch (Netflix Open Connect's core play); ② add single-flight on origin, letting only one origin request per segment through while others wait; ③ tier the edge so a regional mid-tier absorbs a wave before origin, converging the origin fan-out; ④ stagger the launch with phased rollout or a countdown to spread the click peak. The core idea matches cache warming: predictable hotspots should be filled proactively, don't wait for the miss.

5. VOD wants "deeper buffer = smoother," live wants "low latency" — directly opposed buffering strategies. How does LL-HLS lower latency without giving up HLS's ABR and CDN compatibility?

The conflict: buffer is the ammunition against jitter, but deeper buffer means further behind the live edge. Traditional HLS must wait for a full segment (e.g. 6s) to be generated and written before it can be pulled — that step alone accumulates a segment's worth of latency. LL-HLS's trick: split a segment further into partial segments (hundreds of ms); the encoder emits as it encodes, the CDN forwards via chunked transfer as it receives, and the player gets the latest small chunk without waiting for the whole segment — latency drops from "whole segment" to "chunk" level. It also uses preload hints / blocking playlist reload so the player precisely prefetches the next chunk and avoids polling churn. The key elegance: it's still an HTTP + HLS framework — reusing existing CDN caching and the multi-rung ABR mechanism, unlike WebRTC which spins up a separate, uncacheable real-time channel. The cost is more frequent requests, a CDN requirement for chunked transfer support, and a thinner buffer with less jitter margin — fundamentally it moves to a new point on the "latency vs smoothness" trade-off rather than eliminating the conflict.

← Back to index