Day 15 Hard Chat System WebSocket Delivery Guarantee E2E Encryption

Chat System — Real-time Delivery for 1B Users and 50M Concurrent ConnectionsWebSocket vs Long Polling, Message Storage, Delivery Guarantees, E2E Encryption

Problem & Constraints

Design the backend for an instant messenger with 1 billion users and 50 million concurrent connections (WhatsApp, Discord, WeChat). Feed (Day 14) is a read-amplification problem — one post read by millions; chat is the inverse, a connection-amplification problem — tens of millions of never-closing long connections that must be pushed to at any moment. The hard part isn't "storing messages" but maintaining a sea of long connections and guaranteeing every message is delivered exactly once, in order, without loss.

High-level Architecture

The core idea is to decouple "connection" from "logic": the Gateway layer only maintains long connections and shuttles bytes; the business layer is stateless and scales horizontally. The key component is the Session Registry — a routing table of "user → which Gateway node they're connected to" — the addressing core that lets a message find its recipient. The sender's message first hits durable storage (so it's never lost), then the Registry is queried to route it to the recipient's Gateway for push; if the recipient is offline, it stays in storage to be pulled on reconnect.

graph LR A[Sender App] -->|WebSocket| GW1[Gateway A
holds connections] GW1 --> CH[Chat Service
stateless logic] CH -->|1 persist| DB[(Message Store
ScyllaDB/Cassandra)] CH -->|2 lookup route| REG[(Session Registry
user→Gateway)] CH -->|3 deliver| MQ[Message Router/Queue] MQ --> GW2[Gateway B] GW2 -->|WebSocket push| B[Receiver App] GW2 -.offline.-> DB

Key Techniques

1. Connection Layer: WebSocket Persistent Connection vs Polling

One-line trade-off: pay the "complexity of holding tens of millions of stateful connections" to get "true bidirectional, low-latency, low-idle-overhead real-time push".

Principle: chat requires the server to push proactively. Three options: short polling (periodic HTTP pull) has high latency and floods the server with empty requests; long polling (hold the request until data arrives) saves some but still pays a full HTTP round-trip per message with frequent connection rebuilds; WebSocket upgrades one handshake into a full-duplex long connection, after which bidirectional push costs almost nothing — the modern default. The cost: the Gateway becomes stateful — each connection holds memory + fds, and node restarts/scaling drop huge numbers of connections. So the connection layer needs a lightweight high-concurrency runtime — WhatsApp uses Erlang/BEAM, Discord uses Elixir, both based on the actor model: each connection is an independent lightweight process (not an OS thread), scheduled by the VM, letting a single box hold millions of connections.

Trade-off:
ApproachReal-timeIdle overheadCost
Short pollingPoor (interval-bound)Very high (empty floods)Simple but unscalable
Long pollingMediumMedium (frequent rebuilds)HTTP-friendly stopgap
WebSocketGreat (<200ms)Low (connection reuse)Stateful gateway, hard to scale

Another key piece is presence and heartbeats: the server detects liveness via periodic heartbeats, and a dropped connection must be promptly removed from the Registry — otherwise messages get routed to dead connections. But presence broadcast is an amplification bomb: a single user going online/offline must notify all friends/group members, so it must be aggregated and rate-limited.

Real-world:

2. Message Storage & Fanout: Partition Design and Hot Spots

One-line trade-off: pay a "partition key bucketed by channel + time" to "avoid a huge group hammering a single partition into a hot spot".

Principle: messages must be persisted before delivery (never just in memory, or a node crash loses them). The read pattern is "pull the latest N messages / messages after a timestamp for a conversation", a natural fit for wide-column stores (Cassandra/ScyllaDB): partition key channel_id, clustering key a time-ordered message ID (Snowflake, echoing Day 11), so a conversation's messages are physically adjacent and time-sorted. The trap is hot partitions: a huge channel (a million people in one channel) crams every message into a single channel_id partition, pinning all read/write to one replica set and spiking p99. The fix is bucketing — change the key to (channel_id, time_bucket), splitting by time window (e.g. every N days) to spread a big partition horizontally. 1:1 and group use the same model: a 1:1 is just a two-person channel.

Trade-off: write fanout vs read fanout (mirror of Day 14)
# Bucketed message table (CQL-style pseudo-code)
CREATE TABLE messages (
  channel_id   bigint,
  bucket       int,        # time bucket, avoids hot mega-partitions
  message_id   bigint,     # Snowflake: time-ordered, doubles as sort key
  author_id    bigint,
  payload      blob,       # ciphertext under E2E
  PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
# Read latest: locate current bucket, take N by message_id DESC
Real-world:

3. Delivery Guarantees: Dedup, ACK, and Ordering

One-line trade-off: pay "at-least-once + client-side idempotent dedup" to get "something far simpler than exactly-once that still loses nothing and shows nothing twice".

Principle: networks drop packets and clients reconnect-and-resend; true exactly-once is brutally expensive in distributed systems (echoing Day 8). The standard answer is at-least-once delivery + endpoint idempotency: the sender attaches a client-generated idempotency ID (client message id) to each message, and the server dedups on it — retries never create two. Delivery rides an ACK chain: each hop (server received, peer received, peer read) returns an ACK; no ACK means resend. Ordering rides a per-conversation monotonic seq: the client sorts by seq and detects gaps (got 5 but not 4 → proactively fetch 4), offloading the "server guarantees total order" burden to per-conversation client-side ordering and avoiding expensive global ordering.

# Send + idempotent dedup + ACK retry (pseudo-code)
def send(msg):
    msg.client_id = uuid()              # client idempotency key
    while not acked(msg.client_id):
        conn.push(msg)
        if wait_ack(timeout=5s): break  # only durable after server ACK
    # on reconnect dedup by client_id; server: INSERT IF NOT EXISTS

def on_receive(msg):                     # receiver
    if msg.seq <= last_seen: return      # duplicate, drop
    if msg.seq > last_seen + 1:
        fetch_gap(last_seen+1, msg.seq)  # detect gap, backfill
    deliver(msg); send_ack(msg.seq); last_seen = msg.seq
Key insight: the authority for "no message lost" is the durable store, not any one WebSocket. A connection can drop at any time, so you must "persist before ACK" — once the client gets an ACK, the server is guaranteed to remember it and can re-serve it on reconnect. Defining delivery success as "written to the peer's socket" is the classic mistake.
Real-world:

4. E2E Encryption: Double Ratchet and Group Sender Keys

One-line trade-off: pay "the server can't read plaintext at all, plus complex key management" to get "forward secrecy + post-compromise recovery strong privacy".

Principle: under E2E the server is just a ciphertext courier. The industry standard is the Signal Protocol: X3DH negotiates an initial shared key between two parties (one may be offline), then the Double Ratchet continually evolves it — each message uses a one-time message key, discarded after use. Two ratchets: a DH ratchet (new DH public key exchanged each round) + a KDF ratchet (chained derivation). This yields two core security properties: forward secrecy (a key leak today can't decrypt yesterday's messages) + post-compromise security (after a leak, a fresh DH restores safety). Groups can't pair everyone N², so they use Sender Keys: each member has a sending-chain key distributed to the group, encrypting once and reusing within the group, cutting fanout from N² to N.

Trade-off: the cost of E2E
Real-world:

Scaling & Optimization

Pitfalls & Interview Questions

1. Treating the WebSocket connection as a reliable delivery channel. Connections drop anytime. "Delivered" must mean "persisted to storage", not "written to the socket". Persist first, then ACK.
2. Ignoring the Session Registry. Interviewers love asking "A sends to B — how does the system know which box B is connected to?" — no routing layer and the whole design is castles in the air.
3. Chasing exactly-once. The right answer is almost always at-least-once + client-side idempotent dedup (client message id). Insisting on exactly-once exposes a misunderstanding of distributed delivery semantics.
4. Write-fanning a huge group to every inbox. A million-person group will blow up writes. Big groups should use a shared conversation log, pulled on read, with unread tracked separately.
5. Hot partitions with no bucketing. Using channel_id alone as the partition key lets active channels hammer a single partition. Use (channel_id, bucket).

Common follow-ups: ① How do you hold 50M connections in memory, and what's the bottleneck? ② A sends to an offline B — what's the message's full lifecycle? ③ With multi-device login, how are messages synced across devices without dup or loss? ④ A million-member group — how does one message fan out without blowing up? ⑤ With E2E on, how do server-side search and moderation still work?

Deep Resources

Going Deeper (click to expand)

1. One Gateway holds 2M idle long connections, the CPU is barely busy, yet it can still fall over first. Where's the real bottleneck, and how do you estimate the limit?

The bottleneck is memory and file descriptors, not CPU. Idle connections burn no compute, but each one holds: kernel socket buffers (a few KB to tens of KB each way) + application session state + an fd. 2M connections of kernel buffers alone could be tens of GB. Estimate: at ~10KB combined per connection, 2M × 10KB ≈ 20GB — before any business state. So you ① tune the kernel (fd limits, socket buffers, ephemeral ports); ② pick a lightweight connection runtime (BEAM processes, not OS threads); ③ shrink per-connection app state. This is exactly why WhatsApp tuned FreeBSD and used Erlang — squeezing the "fixed cost per connection" to a minimum is what lets one box hold millions.

2. A messages B while B is offline. Walk the full lifecycle from persistence to B seeing it on reconnect. Which steps, if wrong, lose or duplicate the message?

Flow: ① A's message reaches the Chat Service, is persisted (with the client_id idempotency key) → an ACK returns to A (A shows "delivered to server" ✓); ② the Registry shows B has no active connection → the message stays in storage / B's pending queue; ③ B comes online, opens a WebSocket, the Registry records B→GatewayX; ④ B pulls offline messages "after my last seq" → delivery → B ACKs → the server marks delivered. Failure points: if step ① "ACKs before persisting", a crash loses it while A thinks it succeeded; if B doesn't dedup by seq after pulling, a reconnect-repull shows duplicates; if the Registry doesn't promptly record B online, new messages may briefly fail to route. Every step must be retryable + idempotent as a backstop.

3. A million-member group — how does one message reach all online members without blowing up the system? Is this the same problem as Day 14's Feed fanout?

Structurally it's the same push/pull choice as Feed. Writing a copy to each of a million inboxes = write-fanout explosion, infeasible. Big groups use a shared conversation log (one stored copy) + pull on read: members each pull deltas "after my last-read seq"; online members are found via the Registry and pushed a notification (not a full copy of the message). The difference: Feed is often async, seconds-tolerant, while chat wants real-time push, so "notify online members in real time + each pulls" is the compromise. Further optimization: aggregate online members of the same group by Gateway, so a Gateway receives one notification then locally fans to that node's members, cutting cross-node traffic — Discord's guild process is exactly this aggregation point.

4. With E2E on, the server can't read plaintext. So how do "search keywords across all chats" and "cloud multi-device history sync" still work? Why does Discord just skip E2E?

Search: the server can't build an inverted index (ciphertext is meaningless) → only the client can decrypt, build a local index, and search locally; switch devices and old history isn't searchable. Multi-device sync: either each device independently joins key negotiation (its own session), or session keys/history are securely transferred between devices — both far more complex than plaintext sync. Discord's choice: its core value is server-side capability — full-text search, content moderation, spam/minor-safety protection, bot ecosystem, AI features — all of which need the server to read plaintext. E2E fundamentally conflicts with these. So whether to do E2E is a product positioning decision: privacy-first (Signal/WhatsApp) vs server-capability-first (Discord/Slack). There's no purely technical "more correct".

5. A system upgrade restarts a batch of Gateways; millions of connections drop and reconnect at once. What happens, and how do you design to avoid a meltdown?

The risk is a thundering herd on reconnect: millions of clients drop → reconnect simultaneously → handshakes, auth, Registry writes, and offline-message pulls all flood in at once, potentially knocking over the just-recovered Gateway and backends again, oscillating in a "connect then get crushed" loop. Mitigations: ① client exponential backoff with jitter on reconnect (echoing Day 23's retry+jitter), spreading reconnects across the timeline; ② rolling restart of Gateways instead of all-at-once (Day 22), dropping only a small batch each time; ③ reconnect auth and offline pulls go through rate-limiting/queuing to protect backends; ④ use connection migration / graceful shutdown, telling clients ahead of time to move to another node rather than hard-dropping. Core idea: never let a huge client population's behavior synchronize — jitter is the key tool for breaking up that synchrony.