Day 16 Hard Video Streaming HLS/DASH CDN Transcoding

Video Streaming — 200M DAU VOD, 2s Startup, <0.5% RebufferTranscoding Pipeline, Adaptive Bitrate, CDN Design, Playback Optimization

Scenario + Requirements

Design the backend for a 200M DAU video-on-demand platform with tens of millions of concurrent plays (Netflix, YouTube, Disney+). Chat (Day 15) is connection amplification; Feed (Day 14) is read amplification; video streaming is the double whammy of bandwidth + compute amplification: one steady 4K stream is ~15–25 Mbps, so ten million concurrent is hundreds of Tbps of egress — larger than any backbone can carry, so you cannot serve it from a central data center. Meanwhile a 2-hour movie must be transcoded into dozens of bitrate/resolution/codec variants — a heavy CPU/GPU job. The hard part isn't "storing video," it's pushing the same content to every screen worldwide, using the least bandwidth, smoothly, under fluctuating networks.

High-Level Architecture

The key is to fully separate "upload+transcode" from "playback+delivery." The left is the offline/async transcoding pipeline: chop the source into chunks, encode the bitrate ladder in parallel, package as HLS/DASH, write to object storage — slow is fine, transcode once, play a billion times. The right is the online playback path: the client fetches the manifest, then pulls segments according to network conditions; 99% of bytes are served by CDN edge, with origin hit only on edge miss. The control plane (playback API, DRM licensing, recommendations) carries metadata only, never video bytes.

graph LR UP[Creator/Studio upload] --> ING[Ingest] ING --> PIPE[Transcode Pipeline
chunk·parallel encode·QC·package] PIPE --> OBJ[(Object store S3
HLS/DASH segments)] OBJ -->|prefill popular| EDGE[CDN Edge
embedded in ISPs] subgraph Playback Path CLI[Player
ABR engine] -->|1 get manifest| API[Playback API] CLI -->|2 pull segments| EDGE EDGE -.miss to origin.-> OBJ CLI -->|DRM license| DRM[License Server] end classDef off fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef on fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef origin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class ING,PIPE off class CLI,API,EDGE,DRM on class OBJ origin

Left: offline transcode (once). Right: online playback (a billion times). All bandwidth on the CDN edge; origin only backstops.

Key Technical Points

1. Transcoding Pipeline & Bitrate Ladder — fixed ladder saves CPU, content-adaptive saves bandwidth

Principle: each source must be transcoded into a set of (resolution, bitrate, codec) combos — the bitrate ladder — so the player can pick by network speed. The source is split into multi-second chunks at shot/GOP boundaries, fanned out to hundreds–thousands of machines for parallel encoding, then quality-checked (QC) and packaged. Key insight: compression difficulty varies wildly by content — animation/flat scenes are perfect at 1 Mbps, sports/explosions are still mushy at 8 Mbps. A one-size-fits-all fixed ladder wastes bandwidth on easy content and under-delivers quality on hard content.

ApproachHowCostBandwidth gain
Fixed ladderSame bitrates for all titles (e.g. fixed 235k→5800k rungs)Simple, fast transcodebaseline
Per-TitlePick optimal ladder per title's complexity (convex hull)Needs complexity analysis↓~20%+
Per-ShotPer-shot resolution/QP, VMAF-optimalEncode units ×10, compute explodes↓~30%
Trade-off: bandwidth is a video platform's biggest cost, and transcode compute is one-time while bandwidth is forever, so "spend more CPU to permanently save bandwidth" is almost always worth it — exactly the per-title / per-shot logic. The cost is pipeline complexity: per-shot runs an analysis encode before the production encode, the number of encode units jumps an order of magnitude, and scheduling/retries/QC all get harder. Quality must be measured with VMAF (perceptual, ~0.9+ correlation with human eyes), not PSNR, or you "save bitrate but users see mush."

# Parallel chunked transcode (pseudo-code)
chunks = split_by_shot(source)            # split at shot/GOP boundaries → independently decodable
ladder = per_title_ladder(source)         # complexity analysis → convex hull picks rungs
for c in chunks:                          # fan-out to hundreds of workers
    for rung in ladder:
        enqueue(encode_task(c, rung))     # (chunk, resolution, bitrate, codec)
# after all done, reduce: concat + QC (VMAF spot-check) + package HLS/DASH
assemble_and_package(wait_all())          # any failed chunk only needs that one re-run
Real-world:

2. Adaptive Bitrate (ABR) — throughput reacts fast but jitters, buffer-based is stable but starts slow

Principle: networks fluctuate constantly, so the player must decide per segment which bitrate to pull next, balancing "high quality" vs "no rebuffering" in real time. This is a client-side decision (the server only offers multiple segment rungs). Two schools: throughput-based uses recent download speeds to predict how high to go; buffer-based (e.g. BOLA) looks only at the current buffer level — more buffer → go up, less buffer → go down. Modern players are hybrid: at startup the buffer is empty, so probe with throughput; in steady state use buffer to resist jitter.

Trade-off:
# Hybrid ABR rung selection (simplified, one decision per segment)
def pick_bitrate(buf_sec, recent_bw, ladder):
    safe_bw = 0.85 * ema(recent_bw)          # safety margin against jitter
    cap = max(r for r in ladder if r <= safe_bw)
    if buf_sec < LOW:    return min(ladder)   # buffer critical → drop to floor, mush over stall
    if buf_sec > HIGH:   return up_one(cap)   # buffer healthy → step up boldly
    return cap                                # steady state follows bandwidth
Real-world:

3. CDN Design — self-built is cheap but heavy investment, third-party is fast to launch but expensive and uncontrollable

Principle: 99%+ of video bytes must be served from the edge node nearest the user, with origin only as backstop. The hard parts are hit rate and fill strategy: when a hot new show launches, rather than wait for users to miss and hammer the origin, you push-prefill (warm) it to edges in advance. Cold long-tail content is pulled on demand. Netflix takes this to the extreme — placing cache servers (Open Connect Appliances) directly inside ISP facilities, so video never traverses the public backbone; ISP and Netflix both win (saves ISP transit fees, faster for users).

Trade-off:
Real-world:

4. Playback & Startup Optimization — fuller manifest means slower start, shorter segments switch better but cost more

Principle: startup time (TTFF) = DNS/TLS + fetch manifest + get DRM license + download and decode the first segment. To get under 2s, squeeze every step: pre-warm connections (warm CDN connection), slim the manifest, start the first segment at a low bitrate (show something, then quietly step up), and front-load keyframes. Segment duration is the core knob: short (2s) switches responsively and starts fast but means more requests and lower coding efficiency (dense keyframes); long (6–10s) is efficient but makes ABR sluggish and startup slow. VOD usually settles on 4–6s.

Trade-off:
Real-world:

Extensions & Optimization (as you grow)

Pitfalls + Interview Questions

1. Serving streams from a central data center. Ten million concurrent × a few Mbps = hundreds of Tbps — no central egress can carry it; CDN/edge is not an optimization, it's a prerequisite. Not mentioning CDN is an instant fail.
2. Putting ABR on the server. Only the client knows its own buffer level and instantaneous bandwidth in real time; ABR must live in the player; the server only offers multiple segment rungs.
3. One-size-fits-all fixed ladder. Wastes bandwidth on easy content, under-delivers on hard. A senior answer names per-title/per-shot + VMAF.
4. Ignoring startup and rebuffer metrics. Users churn on "won't open / keeps spinning," not "not crisp enough." Anchor SLOs to TTFF and rebuffer ratio, not average bitrate.

Common follow-ups: ① Estimate the egress bandwidth for ten million concurrent, and the order-of-magnitude transcode/storage cost of one movie. ② How does a video go from upload to playable worldwide? Draw the pipeline. ③ Failure modes of throughput-based vs buffer-based ABR — how do you hybridize? ④ How do you avoid an origin storm on a hot premiere? ⑤ VOD vs live architectural differences (real-time transcode, latency, buffering strategy)?

Deep-Dive Resources

Deep Thinking (click to expand)

1. Estimate: 10M concurrent, 4 Mbps average — how much egress? Why does that number alone rule out "serve from a central data center"?

10M × 4 Mbps = 4×10⁷ Mbps = 40 Tbps (peak 4K is more, up to hundreds of Tbps). For comparison: a single mega data center's total egress is typically a few to ~teens of Tbps; a major submarine cable trunk is also Tbps-class. 40 Tbps means you must spread traffic across hundreds–thousands of edge points worldwide, each carrying tens–hundreds of Gbps — no single center can aggregate that egress. That's why CDN is an architectural prerequisite, not an optimization — and at third-party CDN per-GB pricing, this bandwidth bill would eat all profit, the economic motive behind Netflix building Open Connect. When costing it, remember: transcode is one-time CPU, bandwidth is a forever monthly bill, so trade compute for permanent bandwidth.

2. Per-shot encoding raises encode units by an order of magnitude, exploding pipeline complexity. Why is this worth it for Netflix but maybe not for a startup?

It's about the amortization ratio of transcode cost vs bandwidth savings. A Netflix hit is played hundreds of millions of times; per-shot saving 30% bitrate = permanently saving 30% of enormous bandwidth, while the one-time extra encode compute is amortized to negligible across a billion plays. A startup differs on two fronts: ① low play counts — the absolute bandwidth saved is small and can't amortize the extra encode cost; ② engineering complexity — per-shot needs analysis encodes, convex-hull maintenance, and scheduling/retry/QC of a huge number of encode units, a dedicated team's job. So the rational order is: fixed ladder to ship → per-title once volume grows (the ROI inflection) → per-shot only for top content. This is a classic premature optimization anti-example: burning money on per-shot before PMF, the bandwidth saved won't even cover the engineers' salaries.

3. A user on the subway drops from 50 Mbps to 1 Mbps, then recovers. How do pure-throughput vs pure-buffer ABR behave? Why is hybrid better?

Pure throughput: at the moment of the drop it still picks a high rung based on "last segment was 50M," that segment can't finish → buffer drains → rebuffer; on recovery it aggressively steps up, hits jitter, rebuffers again, quality oscillates. Pure buffer-based (BOLA): it only watches buffer level, so on the drop it reacts as the buffer falls and steps down smoothly — better jitter resistance; but the cost is lagging reaction, and at startup the buffer is at 0 with no history, so it's conservative and slow to start. Hybrid takes both strengths: at startup, buffer is empty, use throughput to probe fast and get the first frame up; in steady state, switch to buffer-based for jitter resistance; and keep a safety margin on bandwidth estimates (e.g. ×0.85) so you don't ride the edge and rebuffer. The essence: "different phases carry different amounts of information — use the most reliable signal for the moment."

4. A hugely anticipated show launches Friday 8pm; millions click instantly. What happens with pure pull-on-demand? How do you prevent it?

Pure pull: the new show's segments aren't on any edge node yet, so millions of requests all miss → all go to origin → origin and the origin links are instantly hammered (this is the cache version of a thundering herd / avalanche, see Day 2). Result: startup timeouts, mass play failures — right when you most need stability. Prevention: ① Push prefill — known to be hot, prefill all bitrate variants to global edges during off-peak nights before launch (Netflix Open Connect's core play); ② add single-flight on origin, letting only one origin request per segment through while others wait; ③ tier the edge so a regional mid-tier absorbs a wave before origin, converging the origin fan-out; ④ stagger the launch with phased rollout or a countdown to spread the click peak. The core idea matches cache warming: predictable hotspots should be filled proactively, don't wait for the miss.

5. VOD wants "deeper buffer = smoother," live wants "low latency" — directly opposed buffering strategies. How does LL-HLS lower latency without giving up HLS's ABR and CDN compatibility?

The conflict: buffer is the ammunition against jitter, but deeper buffer means further behind the live edge. Traditional HLS must wait for a full segment (e.g. 6s) to be generated and written before it can be pulled — that step alone accumulates a segment's worth of latency. LL-HLS's trick: split a segment further into partial segments (hundreds of ms); the encoder emits as it encodes, the CDN forwards via chunked transfer as it receives, and the player gets the latest small chunk without waiting for the whole segment — latency drops from "whole segment" to "chunk" level. It also uses preload hints / blocking playlist reload so the player precisely prefetches the next chunk and avoids polling churn. The key elegance: it's still an HTTP + HLS framework — reusing existing CDN caching and the multi-rung ABR mechanism, unlike WebRTC which spins up a separate, uncacheable real-time channel. The cost is more frequent requests, a CDN requirement for chunked transfer support, and a thinner buffer with less jitter margin — fundamentally it moves to a new point on the "latency vs smoothness" trade-off rather than eliminating the conflict.