Day 34 Hard Real-time UDP/WebRTC Game Netcode CRDT

Real-time Systems — Keeping the World Consistent Within 100msTransport, Game Netcode, Authoritative State, Collaborative Sync

Problem Scenario + Constraints

Design the backend for a real-time multiplayer battle with 100k concurrent players (.io-style / lightweight FPS): players in the same room move and shoot, and everyone must see everyone else's actions "almost instantly." This is a different species from a request-response web backend. Its SLO isn't "p99 < 200ms" — it's end-to-end p95 < 100ms with bounded jitter, because the human eye is extremely sensitive to input lag above ~50ms and to position teleports.

Latency SLO: local input response < 16ms (one frame); a peer's action visible < 100ms. Jitter is deadlier than absolute latency.
Update rate: server tick 20–60Hz, each room broadcasts state 20–60 times/sec; 20–100 players per room.
Scale: 100k players ÷ 50 per room = 2000 concurrent rooms; 30Hz × 50 players ≈ 1500 outbound packets/sec/room, millions of pps globally.
Consistency: games need an authoritative server (anti-cheat); collaborative editing needs eventual consistency (never drop a keystroke). The two flavors of real-time have completely different consistency models.
Reliability: dropping one position packet is fine (next tick overwrites it); dropping a "fire" event is an incident — reliable and unreliable channels must be separated.

High-Level Architecture

graph TD
    C["Client
local prediction + render interp."]
    MM["Matchmaker
HTTPS · assign room/region"]
    GW["Edge Gateway
terminate UDP/WebRTC · nearest PoP"]
    GS["Authoritative Game Server
room process · tick loop"]
    R[("Redis
session/room state")]
    DB[("Persistent DB
results/leaderboard")]

    C -->|① find match HTTPS| MM
    MM -->|② return server+token| C
    C <-->|③ realtime bidir UDP| GW
    GW <-->|④ forward| GS
    GS -.room state.-> R
    GS -.async persist.-> DB

    classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class C client
    class MM,GW edge
    class GS core
    class R,DB store

Control plane (matchmaking, login) goes over HTTPS; the data plane (real-time state) uses a dedicated low-latency channel — the two paths are decoupled

Responsibilities: the Matchmaker is a plain stateless web service that assigns players to rooms in the nearest region by geography and skill, returning a server address + auth token. The Edge Gateway terminates UDP/WebRTC within ~30ms of the user, handles NAT traversal and DDoS protection. The Game Server is a stateful room process running a fixed-rate tick loop: collect all player inputs this tick → advance physics → broadcast a new snapshot. State lives in memory; Redis only does crash recovery and cross-process coordination; the DB asynchronously records results. The key is to fully separate the "real-time data plane" from the "durable control plane."

Key Technical Points

1. Transport Choice: Why Real-time Systems Flee TCP

Core trade-off: TCP's reliable ordering is precisely the poison for real-time systems.

Principle: TCP guarantees the byte stream arrives in order — which means once a packet is lost, packets that have already arrived must queue in the kernel buffer waiting for retransmission; the application can't read them. This is Head-of-Line (HoL) blocking. In a game, if position packet #100 is lost, #101 and #102 have arrived but are stuck; by the time #100 is retransmitted, they are already stale — you paid an RTT of stutter for data nobody wanted. A real-time system would rather drop stale data and move on than stay in order. So native games use UDP (implementing selective reliability on top), browsers use WebRTC DataChannel (UDP-based, configurable unreliable/unordered) or the newer WebTransport (QUIC) — QUIC implements multiple streams in user space, so loss on one stream doesn't block others, sidestepping kernel TCP's HoL blocking.

Trade-off:

Channel	Reliable/Ordered	HoL Blocking	Use case
TCP / WebSocket	reliable+ordered	yes (fatal)	chat, collab editing, control plane
UDP (raw)	neither	none	native games, self-built reliability
WebRTC DataChannel	configurable	can disable	browser real-time games/P2P
WebTransport (QUIC)	per-stream independent	none across streams	modern browser real-time data

The practical approach is dual channels: position/orientation over unreliable UDP (drop = no problem, next tick overwrites); "fire/pickup/death" critical events over a reliable channel (custom ACK retransmit, or WebRTC's reliable mode).

Real-world cases:

Valve Source engine: a custom UDP-based protocol distinguishing reliable/unreliable data, with authoritative server-side simulation (Source Multiplayer Networking).
Discord voice: uses WebRTC (UDP underneath); the client is a C++ media engine built on the WebRTC native library, because voice equally cannot tolerate HoL blocking (Discord Engineering).
QUIC/HTTP3: one of QUIC's design goals was eliminating TCP's HoL blocking; today's WebTransport opens this capability to real-time web apps.

2. Client-Side Prediction + Server Reconciliation: Hiding the Speed of Light

Core trade-off: trade optimistic local simulation (that may need rollback) for zero-perceived-latency feel.

Principle: even at 50ms RTT, "press → send to server → wait for reply → then move" feels sluggish. The fix is a trio: ① Client-side prediction — move locally immediately on input without waiting for the server, while sending the sequence-numbered input. ② Server reconciliation — the server is authoritative; its reply carries "processed up to input #N + authoritative position." The client takes that position as the baseline and replays all local inputs after #N that haven't yet been acknowledged, arriving at a corrected current position. If the prediction was right, the screen doesn't move; if interrupted (e.g. hit a wall), it corrects smoothly. ③ Entity interpolation — other players' states arrive discretely per tick; the client deliberately renders ~100ms behind, interpolating between two known snapshots for smoothness instead of teleporting. The cost: the "others" you see are always 100ms in the past — which is what lag compensation solves: when resolving a shot, the server rewinds the target to where the shooter actually saw it before computing the hit.

# Client prediction + reconciliation (pseudocode)
pending = []                      # unacknowledged local inputs
def on_input(cmd):
    cmd.seq = next_seq()
    apply_local(cmd)              # ① move locally now, zero-latency feel
    pending.append(cmd)
    net.send(cmd)                 # send seq-numbered input to authoritative server

def on_server_snapshot(snap):
    me.pos = snap.pos             # ② trust authoritative position
    # drop inputs already confirmed by the server
    pending = [c for c in pending if c.seq > snap.last_processed_seq]
    for c in pending:             # replay still-unconfirmed inputs
        apply_local(c)            # → right prediction: no jump; wrong: smooth correction

# other players: buffer two frames, interpolate at render_time = now - 100ms (③)

Real-world cases:

Gabriel Gambetta's series is the canonical tutorial for this paradigm, demonstrating prediction/reconciliation/interpolation/lag compensation in <500 lines of JS with a live demo (Fast-Paced Multiplayer).
Valve: lag compensation in CS/Source games keeps ~1 second of player position history and rewinds the world by the shooter's ping when resolving hits (Valve: Lag Compensation) — hit detection must be server-authoritative, or the client can cheat.

3. Authoritative Server & State Sync: Ticks, Snapshots, Interest Management

Core trade-off: sync fidelity vs bandwidth/CPU — sending "the whole world" to everyone every tick gets killed by both bandwidth and CPU.

Principle: the authoritative server runs a fixed-rate tick loop (e.g. 30Hz): absorb all inputs this tick → advance deterministic physics → produce new world state → broadcast. The broadcast has three layers of optimization: ① Delta compression — send only changes relative to the last snapshot this client acknowledged; static objects cost no bandwidth. ② Interest management (AOI, Area of Interest) — players only need nearby entities; using a grid/quadtree spatial index, each player receives only updates within their view. A 100-player room broadcasting to all is O(n²)=10k messages/tick; with AOI each player sees only ~10 nearby, dropping to O(n·k). ③ Priority/rate tiering — nearby entities update at high frequency, distant ones lower or not at all. This turns "full state every tick" from a bandwidth disaster into an engineerable, stable stream.

Trade-off (sync models):

Lockstep: sync only inputs; each side deterministically re-simulates the same world. ✅ tiny bandwidth (RTS with thousands of units only sends commands); ❌ requires bit-exact determinism (even floats must match), one laggy player stalls the room, and cheaters can see the whole map.
State sync: server computes state and pushes it down. ✅ authoritative anti-cheat, tolerates packet loss, free join/leave; ❌ bandwidth grows with entity count, compressed via delta+AOI. Mainstream for FPS/MMO.
Full snapshot vs Delta: full is simple and recovers from any loss but costly; delta saves bandwidth but must track "which frame each client acknowledged" as a baseline — more complex.

Real-world cases:

Discord voice uses an SFU (Selective Forwarding Unit) rather than a P2P mesh: in large rooms P2P connection count is O(n²) and explodes, so a server selectively forwards instead, letting one server handle huge concurrency — 850+ voice servers across 13 regions support 2.6M concurrent voice users (Discord Engineering) — the "central fan-out + interest pruning" idea applied to audio/video.
Valve Source: the server maintains the authoritative world per tick and sends incremental updates to each client, which does prediction and interpolation (Source Networking).

4. Collaborative Sync: The Other Real-time — Eventual, Not Authoritative

Core trade-off: OT's centralized simplicity vs CRDT's decentralized freedom.

Principle: games want "authoritative + anti-cheat," but Figma/Notion/Google Docs collaborative editing is a different real-time: it must work offline, allow concurrent edits to the same document, never drop a keystroke, and eventually converge everyone to the same state. Two schools: OT (Operational Transformation) — transform concurrent operations against each other to preserve intent (you insert at position 5, I delete at position 3, OT adjusts your index); relies heavily on a central server for ordering, complex logic but compact state (Google Docs). CRDT (Conflict-free Replicated Data Type) — design the data so it is mathematically commutative and converges regardless of apply order, naturally supporting P2P and offline, but with large metadata overhead (each character carries a unique ID and tombstone). LWW (Last-Write-Wins) is the simplest CRDT form: each field independent, the last writer wins.

# LWW register (simplest CRDT): a total order on (timestamp, replica_id) decides
def merge(local, remote):
    if (remote.ts, remote.node) > (local.ts, local.node):
        return remote          # "newer" write wins; merge is commutative/idempotent
    return local
# The hard part isn't merge, it's the GRANULARITY of LWW:
# whole object? edits overwrite each other and get lost;
# per-attribute? concurrent edits to different attrs never conflict — Figma chose the latter

Real-world cases:

Figma: by their own account "inspired by CRDTs but not a true CRDT" — a central server + a per-object, per-property LWW register, so concurrent edits to different properties of an object don't interfere, while edits to the same property are "last writer wins"; tree structure is kept consistent via a parent pointer + the server rejecting cycle-creating operations; transport is WebSocket (How Figma's multiplayer technology works).
Google Docs is the OT exemplar, ordering operations via a central server. CRDTs are common in local-first/offline-first and collaboration libraries (e.g. the Yjs/Automerge ecosystem).

Scaling & Optimization

Room sharding & nearest access: route players to the geographically nearest region (GeoDNS / Anycast ingress); cross-region matches use relays. A room is a natural shard unit — a single room can't be split, so scale horizontally by adding machines for more rooms.
Elasticity of stateful services: game servers are stateful; you can't kill processes freely like stateless web. Use draining: stop assigning new rooms and let existing matches finish before retiring.
Crash recovery: when a room process crashes, all in-memory state is lost. Periodically checkpoint critical state (score, key events) to Redis to rebuild an approximate state; pure position state is acceptable to lose.
Anti-cheat: never trust the client. Validate movement server-side (speed caps, wall-clip detection); hit detection is server-authoritative, the client only "renders + predicts."
Bandwidth keeps growing: beyond delta, add quantized compression (positions as fixed-point not float), bit-packing, distance-based precision reduction.

Common Pitfalls + Interview Follow-ups

1. Using TCP/WebSocket for high-frequency game sync. HoL blocking creates cascading stutter on loss. Low-frequency turn-based is fine; a 30Hz action game must use a UDP-class channel. Chat/collab editing, conversely, should use a reliable channel.

2. Trusting the client. "If the client says it hit, it hit" = cheater paradise. Authoritative decisions live on the server; client prediction is only for feel.

3. Treating absolute latency as the only metric. A steady 80ms beats a connection jittering between 30–150ms. Monitor jitter and loss, and smooth with a receiver-side jitter buffer.

4. Full broadcast + O(n²) fan-out. A 100-player room without AOI/delta blows up bandwidth and CPU. Spatial index + interest management is the lifeline for large rooms.

5. Whole-document LWW for collaborative editing. Too coarse a granularity overwrites and loses edits. Define merge at field/character level and think through the conflict semantics (Figma per-property, text editors per-character).

Frequent interview questions: ① Why don't games use TCP? How exactly does HoL blocking hurt? ② How do you correct a wrong prediction without it looking jarring? ③ How do you keep bandwidth from exploding with 100 players per room per tick? ④ Why must hit detection be server-authoritative, and how does lag compensation stay fair? ⑤ How do the consistency models of Figma collaboration and an FPS match differ, and why?

Deeper Resources

Gabriel Gambetta — Fast-Paced Multiplayer series: prediction/reconciliation/interpolation/lag compensation in four steps, with a runnable demo (gabrielgambetta.com).
Valve Developer Community — Source Multiplayer Networking & Lag Compensation: first-hand docs on industrial-grade authoritative-server netcode.
Discord Engineering — Handling 2.5M Concurrent Voice Users with WebRTC: SFU, regionalization, silence optimization, large-scale real-time A/V experience.
Figma Blog — How Figma's multiplayer technology works: LWW register, tree consistency, why not a true CRDT.
Designing Data-Intensive Applications (Kleppmann), Ch. 5: replication and concurrent write conflicts — the foundation for understanding LWW/CRDT convergence.

Deeper Thinking (click to expand)

1. Prediction lets "me" move with zero latency, but the others I see are 100ms in the past. When I shoot, whose view should win? What fairness trade-off is hiding here?

This is the core tension of lag compensation. The server has two choices: (a) judge against the "now" world — but the shooter aimed where the target was 100ms ago, so they often miss, and high-ping players are penalized; (b) rewind the world to what the shooter actually saw at the moment of firing (Valve's approach, keeping ~1s of position history) — "what you see is what you get" for the shooter, which is fair to the shooting side.

The cost shifts to the victim: you've already ducked behind a wall, yet because of the opponent's high ping you were still standing in the open in their "past moment," so you get "shot behind cover." So this doesn't eliminate unfairness — it transfers it from shooter to victim. Most action games choose (b) because "if I aimed, it should hit" matters more to feel. Designs cap the rewind window (e.g. ≤200ms) to bound the unfairness from extreme ping.

2. Raising the tick rate from 30Hz to 60Hz — how do latency and bandwidth each change? Is higher always better?

Latency: the server processes input only once per tick, so the average "input waiting to be processed" time is about half a tick. 30Hz≈33ms/tick → ~16ms avg wait; 60Hz≈16ms → ~8ms. Raising the rate does cut this component, making it feel more responsive.

Bandwidth & CPU: broadcast frequency doubles, so outbound packets and pps roughly double (each packet has fixed IP/UDP header overhead, amortized worse when small packets dominate); the server runs twice the physics ticks/sec, so CPU doubles too. 2000 rooms × doubling = a meaningful jump in operating cost.

Not always better: diminishing returns — 30→60Hz cutting 8ms is worth it, 60→120Hz cuts only 4 more ms while doubling cost again. And client rendering/network jitter is often the larger latency source. Competitive FPS often use 64/128 tick; casual games are fine at 20–30Hz. This is the classic feel vs cost engineering trade-off, decided by genre.

3. CRDTs naturally support offline & P2P and never conflict, so why did Figma — a "must be strongly collaborative" product — choose a central server + LWW instead of pure CRDT?

Several practical reasons: ① Metadata overhead — general CRDTs (especially sequences/text) must carry per-element unique IDs, version vectors, and tombstones (deletes can't truly delete, or concurrent merges break), so metadata bloats as the document is edited — costly for Figma's massive vector-object scenario.

② With a central server, the hardest problem vanishes: CRDT complexity mostly exists to handle "no authoritative orderer." Figma already has an online server that can arbitrate a total order, so most fields degrade to a simple LWW register — no need to carry CRDT's full baggage.

③ Controllable conflict semantics matter more: Figma does LWW "per object per property," so concurrent edits to different properties don't interfere and same-property edits are last-writer-wins — this deterministic, user-understandable outcome is more controllable than a CRDT's auto-merged "semantically correct but surprising" result. The tree uses parent pointers + server rejection of cycles — again trading central authority for simplicity.

Conclusion: CRDTs solve the "no-center" problem; as long as you're willing and able to keep a center, OT/LWW is often cheaper and more controllable. CRDT's true home is local-first / offline-first.

4. A room process crashes and all 50 players' in-memory match state is lost. How do you design so you neither lose critical data nor let checkpointing drag down the 30Hz tick?

The key is to distinguish "must persist" from "discardable" state, tiered by importance:

Position/orientation (high-frequency, droppable): don't checkpoint at all. After a crash and reconnect, clients re-report and it converges within a few frames; the loss is negligible.
Critical events/score (low-frequency, must not drop): events that "change the outcome" — hits, scores, item pickups — write asynchronously to Redis/an event log, not synchronously persisted in the tick main loop — use a separate IO thread/queue to avoid blocking the 30Hz physics.
Periodic checkpoint: every few seconds, snapshot the compressed critical state to Redis; after a crash, the new process loads the latest checkpoint + replays the subsequent event log to rebuild an approximate match.

This is essentially tiered RPO: position state can have "infinite" RPO (just recompute), score state needs RPO near 0. Keeping expensive persistence off the tick hot path is the universal technique for stateful real-time services that mustn't drop frames. Competitive scenarios also demand "a crash doesn't break judgment fairness," so authoritative state should be made deterministically replayable.

5. Both are "real-time," so why should chat/collab editing use TCP/WebSocket while action games must use UDP? What happens if you swap the channels?

The dividing line is which matters more: the data's timeliness or its undroppability:

Action games: position data is high-frequency and self-overwriting — frame N's position is garbage once stale, replaced by the next frame. Here "drop a stale packet and move on" beats "wait for retransmit of a value nobody wants." TCP creates cascading stutter on loss, wrecking feel.
Collab editing/chat: each operation is undroppable and order-sensitive — lose a character or reorder operations and the document is wrong, never converging. Here you want exactly TCP's reliable ordering; the occasional tens of ms of latency is imperceptible to humans (typing isn't a 60Hz operation).

Consequences of swapping: a game on TCP → stutters on loss, gets outplayed; collab editing on raw UDP → dropped op = lost text, reordering = corrupted document, and you'd have to reimplement TCP's reliable ordering at the app layer anyway — wasted effort. So "real-time" isn't a single requirement; first ask "is dropping one datum a non-issue or an incident?" — the answer directly decides the channel. This is why many game backends use both channels: position over UDP, critical events/chat over a reliable one.

← Back to index