Design the backend for a real-time multiplayer battle with 100k concurrent players (.io-style / lightweight FPS): players in the same room move and shoot, and everyone must see everyone else's actions "almost instantly." This is a different species from a request-response web backend. Its SLO isn't "p99 < 200ms" — it's end-to-end p95 < 100ms with bounded jitter, because the human eye is extremely sensitive to input lag above ~50ms and to position teleports.
graph TD
C["Client
local prediction + render interp."]
MM["Matchmaker
HTTPS · assign room/region"]
GW["Edge Gateway
terminate UDP/WebRTC · nearest PoP"]
GS["Authoritative Game Server
room process · tick loop"]
R[("Redis
session/room state")]
DB[("Persistent DB
results/leaderboard")]
C -->|① find match HTTPS| MM
MM -->|② return server+token| C
C <-->|③ realtime bidir UDP| GW
GW <-->|④ forward| GS
GS -.room state.-> R
GS -.async persist.-> DB
classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class C client
class MM,GW edge
class GS core
class R,DB store
Control plane (matchmaking, login) goes over HTTPS; the data plane (real-time state) uses a dedicated low-latency channel — the two paths are decoupled
Responsibilities: the Matchmaker is a plain stateless web service that assigns players to rooms in the nearest region by geography and skill, returning a server address + auth token. The Edge Gateway terminates UDP/WebRTC within ~30ms of the user, handles NAT traversal and DDoS protection. The Game Server is a stateful room process running a fixed-rate tick loop: collect all player inputs this tick → advance physics → broadcast a new snapshot. State lives in memory; Redis only does crash recovery and cross-process coordination; the DB asynchronously records results. The key is to fully separate the "real-time data plane" from the "durable control plane."
Core trade-off: TCP's reliable ordering is precisely the poison for real-time systems.
Principle: TCP guarantees the byte stream arrives in order — which means once a packet is lost, packets that have already arrived must queue in the kernel buffer waiting for retransmission; the application can't read them. This is Head-of-Line (HoL) blocking. In a game, if position packet #100 is lost, #101 and #102 have arrived but are stuck; by the time #100 is retransmitted, they are already stale — you paid an RTT of stutter for data nobody wanted. A real-time system would rather drop stale data and move on than stay in order. So native games use UDP (implementing selective reliability on top), browsers use WebRTC DataChannel (UDP-based, configurable unreliable/unordered) or the newer WebTransport (QUIC) — QUIC implements multiple streams in user space, so loss on one stream doesn't block others, sidestepping kernel TCP's HoL blocking.
| Channel | Reliable/Ordered | HoL Blocking | Use case |
|---|---|---|---|
| TCP / WebSocket | reliable+ordered | yes (fatal) | chat, collab editing, control plane |
| UDP (raw) | neither | none | native games, self-built reliability |
| WebRTC DataChannel | configurable | can disable | browser real-time games/P2P |
| WebTransport (QUIC) | per-stream independent | none across streams | modern browser real-time data |
The practical approach is dual channels: position/orientation over unreliable UDP (drop = no problem, next tick overwrites); "fire/pickup/death" critical events over a reliable channel (custom ACK retransmit, or WebRTC's reliable mode).
Core trade-off: trade optimistic local simulation (that may need rollback) for zero-perceived-latency feel.
Principle: even at 50ms RTT, "press → send to server → wait for reply → then move" feels sluggish. The fix is a trio: ① Client-side prediction — move locally immediately on input without waiting for the server, while sending the sequence-numbered input. ② Server reconciliation — the server is authoritative; its reply carries "processed up to input #N + authoritative position." The client takes that position as the baseline and replays all local inputs after #N that haven't yet been acknowledged, arriving at a corrected current position. If the prediction was right, the screen doesn't move; if interrupted (e.g. hit a wall), it corrects smoothly. ③ Entity interpolation — other players' states arrive discretely per tick; the client deliberately renders ~100ms behind, interpolating between two known snapshots for smoothness instead of teleporting. The cost: the "others" you see are always 100ms in the past — which is what lag compensation solves: when resolving a shot, the server rewinds the target to where the shooter actually saw it before computing the hit.
# Client prediction + reconciliation (pseudocode)
pending = [] # unacknowledged local inputs
def on_input(cmd):
cmd.seq = next_seq()
apply_local(cmd) # ① move locally now, zero-latency feel
pending.append(cmd)
net.send(cmd) # send seq-numbered input to authoritative server
def on_server_snapshot(snap):
me.pos = snap.pos # ② trust authoritative position
# drop inputs already confirmed by the server
pending = [c for c in pending if c.seq > snap.last_processed_seq]
for c in pending: # replay still-unconfirmed inputs
apply_local(c) # → right prediction: no jump; wrong: smooth correction
# other players: buffer two frames, interpolate at render_time = now - 100ms (③)
Core trade-off: sync fidelity vs bandwidth/CPU — sending "the whole world" to everyone every tick gets killed by both bandwidth and CPU.
Principle: the authoritative server runs a fixed-rate tick loop (e.g. 30Hz): absorb all inputs this tick → advance deterministic physics → produce new world state → broadcast. The broadcast has three layers of optimization: ① Delta compression — send only changes relative to the last snapshot this client acknowledged; static objects cost no bandwidth. ② Interest management (AOI, Area of Interest) — players only need nearby entities; using a grid/quadtree spatial index, each player receives only updates within their view. A 100-player room broadcasting to all is O(n²)=10k messages/tick; with AOI each player sees only ~10 nearby, dropping to O(n·k). ③ Priority/rate tiering — nearby entities update at high frequency, distant ones lower or not at all. This turns "full state every tick" from a bandwidth disaster into an engineerable, stable stream.
Core trade-off: OT's centralized simplicity vs CRDT's decentralized freedom.
Principle: games want "authoritative + anti-cheat," but Figma/Notion/Google Docs collaborative editing is a different real-time: it must work offline, allow concurrent edits to the same document, never drop a keystroke, and eventually converge everyone to the same state. Two schools: OT (Operational Transformation) — transform concurrent operations against each other to preserve intent (you insert at position 5, I delete at position 3, OT adjusts your index); relies heavily on a central server for ordering, complex logic but compact state (Google Docs). CRDT (Conflict-free Replicated Data Type) — design the data so it is mathematically commutative and converges regardless of apply order, naturally supporting P2P and offline, but with large metadata overhead (each character carries a unique ID and tombstone). LWW (Last-Write-Wins) is the simplest CRDT form: each field independent, the last writer wins.
# LWW register (simplest CRDT): a total order on (timestamp, replica_id) decides
def merge(local, remote):
if (remote.ts, remote.node) > (local.ts, local.node):
return remote # "newer" write wins; merge is commutative/idempotent
return local
# The hard part isn't merge, it's the GRANULARITY of LWW:
# whole object? edits overwrite each other and get lost;
# per-attribute? concurrent edits to different attrs never conflict — Figma chose the latter
Frequent interview questions: ① Why don't games use TCP? How exactly does HoL blocking hurt? ② How do you correct a wrong prediction without it looking jarring? ③ How do you keep bandwidth from exploding with 100 players per room per tick? ④ Why must hit detection be server-authoritative, and how does lag compensation stay fair? ⑤ How do the consistency models of Figma collaboration and an FPS match differ, and why?
This is the core tension of lag compensation. The server has two choices: (a) judge against the "now" world — but the shooter aimed where the target was 100ms ago, so they often miss, and high-ping players are penalized; (b) rewind the world to what the shooter actually saw at the moment of firing (Valve's approach, keeping ~1s of position history) — "what you see is what you get" for the shooter, which is fair to the shooting side.
The cost shifts to the victim: you've already ducked behind a wall, yet because of the opponent's high ping you were still standing in the open in their "past moment," so you get "shot behind cover." So this doesn't eliminate unfairness — it transfers it from shooter to victim. Most action games choose (b) because "if I aimed, it should hit" matters more to feel. Designs cap the rewind window (e.g. ≤200ms) to bound the unfairness from extreme ping.
Latency: the server processes input only once per tick, so the average "input waiting to be processed" time is about half a tick. 30Hz≈33ms/tick → ~16ms avg wait; 60Hz≈16ms → ~8ms. Raising the rate does cut this component, making it feel more responsive.
Bandwidth & CPU: broadcast frequency doubles, so outbound packets and pps roughly double (each packet has fixed IP/UDP header overhead, amortized worse when small packets dominate); the server runs twice the physics ticks/sec, so CPU doubles too. 2000 rooms × doubling = a meaningful jump in operating cost.
Not always better: diminishing returns — 30→60Hz cutting 8ms is worth it, 60→120Hz cuts only 4 more ms while doubling cost again. And client rendering/network jitter is often the larger latency source. Competitive FPS often use 64/128 tick; casual games are fine at 20–30Hz. This is the classic feel vs cost engineering trade-off, decided by genre.
Several practical reasons: ① Metadata overhead — general CRDTs (especially sequences/text) must carry per-element unique IDs, version vectors, and tombstones (deletes can't truly delete, or concurrent merges break), so metadata bloats as the document is edited — costly for Figma's massive vector-object scenario.
② With a central server, the hardest problem vanishes: CRDT complexity mostly exists to handle "no authoritative orderer." Figma already has an online server that can arbitrate a total order, so most fields degrade to a simple LWW register — no need to carry CRDT's full baggage.
③ Controllable conflict semantics matter more: Figma does LWW "per object per property," so concurrent edits to different properties don't interfere and same-property edits are last-writer-wins — this deterministic, user-understandable outcome is more controllable than a CRDT's auto-merged "semantically correct but surprising" result. The tree uses parent pointers + server rejection of cycles — again trading central authority for simplicity.
Conclusion: CRDTs solve the "no-center" problem; as long as you're willing and able to keep a center, OT/LWW is often cheaper and more controllable. CRDT's true home is local-first / offline-first.
The key is to distinguish "must persist" from "discardable" state, tiered by importance:
This is essentially tiered RPO: position state can have "infinite" RPO (just recompute), score state needs RPO near 0. Keeping expensive persistence off the tick hot path is the universal technique for stateful real-time services that mustn't drop frames. Competitive scenarios also demand "a crash doesn't break judgment fairness," so authoritative state should be made deterministically replayable.
The dividing line is which matters more: the data's timeliness or its undroppability:
Consequences of swapping: a game on TCP → stutters on loss, gets outplayed; collab editing on raw UDP → dropped op = lost text, reordering = corrupted document, and you'd have to reimplement TCP's reliable ordering at the app layer anyway — wasted effort. So "real-time" isn't a single requirement; first ask "is dropping one datum a non-issue or an incident?" — the answer directly decides the channel. This is why many game backends use both channels: position over UDP, critical events/chat over a reliable one.