Day 15 Hard Chat System WebSocket Delivery Guarantee E2E Encryption

Chat System — Real-time Delivery for 1B Users and 50M Concurrent ConnectionsWebSocket vs Long Polling, Message Storage, Delivery Guarantees, E2E Encryption

Problem & Constraints

Design the backend for an instant messenger with 1 billion users and 50 million concurrent connections (WhatsApp, Discord, WeChat). Feed (Day 14) is a read-amplification problem — one post read by millions; chat is the inverse, a connection-amplification problem — tens of millions of never-closing long connections that must be pushed to at any moment. The hard part isn't "storing messages" but maintaining a sea of long connections and guaranteeing every message is delivered exactly once, in order, without loss.

Scale: 1B users, peak 50M concurrent long connections, tens of billions of messages/day (WhatsApp has publicly cited 100B+/day), send peaks in the millions of QPS.
Connection profile: the vast majority of connections are idle — held open waiting for pushes. The bottleneck is memory and file descriptors (socket + buffers per connection), not CPU.
Latency SLO: online-to-online end-to-end p99 < 200ms; offline messages backfilled within seconds of coming online.
Delivery semantics: messages must not be lost, not be reordered, not be shown twice — the biggest difference from Feed, since users have zero tolerance for lost messages.
Privacy: 1:1 and group chats need E2E encryption — the server routes ciphertext but cannot read plaintext.

High-level Architecture

The core idea is to decouple "connection" from "logic": the Gateway layer only maintains long connections and shuttles bytes; the business layer is stateless and scales horizontally. The key component is the Session Registry — a routing table of "user → which Gateway node they're connected to" — the addressing core that lets a message find its recipient. The sender's message first hits durable storage (so it's never lost), then the Registry is queried to route it to the recipient's Gateway for push; if the recipient is offline, it stays in storage to be pulled on reconnect.

Key Techniques

1. Connection Layer: WebSocket Persistent Connection vs Polling

One-line trade-off: pay the "complexity of holding tens of millions of stateful connections" to get "true bidirectional, low-latency, low-idle-overhead real-time push".

Principle: chat requires the server to push proactively. Three options: short polling (periodic HTTP pull) has high latency and floods the server with empty requests; long polling (hold the request until data arrives) saves some but still pays a full HTTP round-trip per message with frequent connection rebuilds; WebSocket upgrades one handshake into a full-duplex long connection, after which bidirectional push costs almost nothing — the modern default. The cost: the Gateway becomes stateful — each connection holds memory + fds, and node restarts/scaling drop huge numbers of connections. So the connection layer needs a lightweight high-concurrency runtime — WhatsApp uses Erlang/BEAM, Discord uses Elixir, both based on the actor model: each connection is an independent lightweight process (not an OS thread), scheduled by the VM, letting a single box hold millions of connections.

Trade-off:

Approach	Real-time	Idle overhead	Cost
Short polling	Poor (interval-bound)	Very high (empty floods)	Simple but unscalable
Long polling	Medium	Medium (frequent rebuilds)	HTTP-friendly stopgap
WebSocket	Great (<200ms)	Low (connection reuse)	Stateful gateway, hard to scale

Another key piece is presence and heartbeats: the server detects liveness via periodic heartbeats, and a dropped connection must be promptly removed from the Registry — otherwise messages get routed to dead connections. But presence broadcast is an amplification bomb: a single user going online/offline must notify all friends/group members, so it must be aggregated and rate-limited.

Real-world:

WhatsApp: uses Erlang + heavily tuned FreeBSD to reach 2M+ concurrent connections per box, with one Erlang process per connection delivering peer-to-peer (High Scalability's breakdown of the WhatsApp architecture).
Discord: used Elixir to scale its WebSocket gateway to 5M concurrent users, with one GenServer session process per connection talking to remote nodes and guild processes (Discord eng blog, "How Discord Scaled Elixir to 5,000,000 Concurrent Users").

2. Message Storage & Fanout: Partition Design and Hot Spots

One-line trade-off: pay a "partition key bucketed by channel + time" to "avoid a huge group hammering a single partition into a hot spot".

Principle: messages must be persisted before delivery (never just in memory, or a node crash loses them). The read pattern is "pull the latest N messages / messages after a timestamp for a conversation", a natural fit for wide-column stores (Cassandra/ScyllaDB): partition key channel_id, clustering key a time-ordered message ID (Snowflake, echoing Day 11), so a conversation's messages are physically adjacent and time-sorted. The trap is hot partitions: a huge channel (a million people in one channel) crams every message into a single channel_id partition, pinning all read/write to one replica set and spiking p99. The fix is bucketing — change the key to (channel_id, time_bucket), splitting by time window (e.g. every N days) to spread a big partition horizontally. 1:1 and group use the same model: a 1:1 is just a two-person channel.

Trade-off: write fanout vs read fanout (mirror of Day 14)

Shared conversation log (one copy per conversation, pull on read): ✅ write once, storage-cheap, big-group friendly; ❌ each member's "unread/inbox" must be tracked separately. Discord chose this — groups can reach millions, so copying to each person is infeasible.
Write-fanout to each inbox (one copy per recipient): ✅ pulling your own inbox is fast, unread is naturally per-user; ❌ write amplification explodes for big groups. Fits small-group + strong-unread IM.

# Bucketed message table (CQL-style pseudo-code)
CREATE TABLE messages (
  channel_id   bigint,
  bucket       int,        # time bucket, avoids hot mega-partitions
  message_id   bigint,     # Snowflake: time-ordered, doubles as sort key
  author_id    bigint,
  payload      blob,       # ciphertext under E2E
  PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
# Read latest: locate current bucket, take N by message_id DESC

Real-world:

Discord: the message table uses a (channel_id, bucket) composite partition key precisely to stop active channels forming hot partitions; later migrated from Cassandra to ScyllaDB (C++, no GC), shrinking 177 nodes to 72 and bringing historical-read p99 from 40–125ms down to a steady ~15ms (Discord eng blog, "How Discord Stores Trillions of Messages").

3. Delivery Guarantees: Dedup, ACK, and Ordering

One-line trade-off: pay "at-least-once + client-side idempotent dedup" to get "something far simpler than exactly-once that still loses nothing and shows nothing twice".

Principle: networks drop packets and clients reconnect-and-resend; true exactly-once is brutally expensive in distributed systems (echoing Day 8). The standard answer is at-least-once delivery + endpoint idempotency: the sender attaches a client-generated idempotency ID (client message id) to each message, and the server dedups on it — retries never create two. Delivery rides an ACK chain: each hop (server received, peer received, peer read) returns an ACK; no ACK means resend. Ordering rides a per-conversation monotonic seq: the client sorts by seq and detects gaps (got 5 but not 4 → proactively fetch 4), offloading the "server guarantees total order" burden to per-conversation client-side ordering and avoiding expensive global ordering.

# Send + idempotent dedup + ACK retry (pseudo-code)
def send(msg):
    msg.client_id = uuid()              # client idempotency key
    while not acked(msg.client_id):
        conn.push(msg)
        if wait_ack(timeout=5s): break  # only durable after server ACK
    # on reconnect dedup by client_id; server: INSERT IF NOT EXISTS

def on_receive(msg):                     # receiver
    if msg.seq <= last_seen: return      # duplicate, drop
    if msg.seq > last_seen + 1:
        fetch_gap(last_seen+1, msg.seq)  # detect gap, backfill
    deliver(msg); send_ack(msg.seq); last_seen = msg.seq

Key insight: the authority for "no message lost" is the durable store, not any one WebSocket. A connection can drop at any time, so you must "persist before ACK" — once the client gets an ACK, the server is guaranteed to remember it and can re-serve it on reconnect. Defining delivery success as "written to the peer's socket" is the classic mistake.

Real-world:

WhatsApp: offline messages live on the server until the recipient comes online, then are delivered and removed; the multiple ACKs (delivered ✓✓, read blue ticks) are essentially the ACK chain surfaced as product.
Message-queue semantics: at-least-once + consumer idempotency is the industry-standard recipe for "no loss, no dup" (echoing Day 8 delivery semantics).

4. E2E Encryption: Double Ratchet and Group Sender Keys

One-line trade-off: pay "the server can't read plaintext at all, plus complex key management" to get "forward secrecy + post-compromise recovery strong privacy".

Principle: under E2E the server is just a ciphertext courier. The industry standard is the Signal Protocol: X3DH negotiates an initial shared key between two parties (one may be offline), then the Double Ratchet continually evolves it — each message uses a one-time message key, discarded after use. Two ratchets: a DH ratchet (new DH public key exchanged each round) + a KDF ratchet (chained derivation). This yields two core security properties: forward secrecy (a key leak today can't decrypt yesterday's messages) + post-compromise security (after a leak, a fresh DH restores safety). Groups can't pair everyone N², so they use Sender Keys: each member has a sending-chain key distributed to the group, encrypting once and reusing within the group, cutting fanout from N² to N.

Trade-off: the cost of E2E

Limited server capability: can't read plaintext → server-side search, content moderation, cloud multi-device sync all need other approaches (client-side index / inter-device key sharing).
Multi-device problem: each device has its own keys, and adding a new device requires securely syncing session state.
No E2E: ✅ server-side search/moderation/AI features all work; ❌ a server breach = all history leaked. Discord defaults to no E2E (it needs server-side search/moderation), WhatsApp/Signal default to E2E — this is a product positioning choice, not a purely technical one.

Real-world:

Signal / WhatsApp: both use the Signal Protocol's Double Ratchet for 1:1 E2E and Sender Keys for groups; WhatsApp rolled it out to everyone in 2016 (Signal's official "The Double Ratchet Algorithm" spec).
Reach: the Signal Protocol has been adopted by WhatsApp, Google Messages (RCS), Messenger, and others, covering billions of daily users.

Scaling & Optimization

Scale gateway and logic separately: more connections → add Gateway nodes; more business load → add stateless Chat Services. Decoupled, each scales independently without dragging the other.
Geo-local ingress + multi-region: place Gateways near users to cut handshake latency; cross-region conversations route via async replication (echoing Day 5).
Session Registry availability: the routing table is the addressing lifeline — shard + replicate it; on node failure, connection migration and Registry cleanup must be fast, or messages go to dead connections.
Tiered storage: hot (recent) messages in fast storage + cache, cold ones archived to cheap media — chat history is rarely re-read but must be kept forever.
Bottleneck spotting: monitor per-box connections/memory, presence-broadcast amplification, hot-partition p99, offline-message backlog, Registry lookup latency — hot mega-groups and presence storms are the most common tail-latency root causes.

Pitfalls & Interview Questions

1. Treating the WebSocket connection as a reliable delivery channel. Connections drop anytime. "Delivered" must mean "persisted to storage", not "written to the socket". Persist first, then ACK.

2. Ignoring the Session Registry. Interviewers love asking "A sends to B — how does the system know which box B is connected to?" — no routing layer and the whole design is castles in the air.

3. Chasing exactly-once. The right answer is almost always at-least-once + client-side idempotent dedup (client message id). Insisting on exactly-once exposes a misunderstanding of distributed delivery semantics.

4. Write-fanning a huge group to every inbox. A million-person group will blow up writes. Big groups should use a shared conversation log, pulled on read, with unread tracked separately.

5. Hot partitions with no bucketing. Using channel_id alone as the partition key lets active channels hammer a single partition. Use (channel_id, bucket).

Common follow-ups: ① How do you hold 50M connections in memory, and what's the bottleneck? ② A sends to an offline B — what's the message's full lifecycle? ③ With multi-device login, how are messages synced across devices without dup or loss? ④ A million-member group — how does one message fan out without blowing up? ⑤ With E2E on, how do server-side search and moderation still work?

Deep Resources

Discord eng blog, "How Discord Stores Trillions of Messages": first-hand material on bucketed partition keys, the Cassandra→ScyllaDB migration, and hot-partition mitigation.
Discord eng blog, "How Discord Scaled Elixir to 5,000,000 Concurrent Users": actor-model WebSocket gateways, session processes, cross-node routing in practice.
High Scalability, "The WhatsApp Architecture": the classic breakdown of Erlang/FreeBSD millions-of-connections-per-box and a tiny team supporting a billion users.
Signal's "The Double Ratchet Algorithm" (Perrin & Marlinspike): the authoritative spec for E2E forward secrecy / post-compromise recovery.
"Designing Data-Intensive Applications" (Kleppmann): a systematic framework for derived data, ordered logs, and delivery semantics.

Going Deeper (click to expand)

1. One Gateway holds 2M idle long connections, the CPU is barely busy, yet it can still fall over first. Where's the real bottleneck, and how do you estimate the limit?

The bottleneck is memory and file descriptors, not CPU. Idle connections burn no compute, but each one holds: kernel socket buffers (a few KB to tens of KB each way) + application session state + an fd. 2M connections of kernel buffers alone could be tens of GB. Estimate: at ~10KB combined per connection, 2M × 10KB ≈ 20GB — before any business state. So you ① tune the kernel (fd limits, socket buffers, ephemeral ports); ② pick a lightweight connection runtime (BEAM processes, not OS threads); ③ shrink per-connection app state. This is exactly why WhatsApp tuned FreeBSD and used Erlang — squeezing the "fixed cost per connection" to a minimum is what lets one box hold millions.

2. A messages B while B is offline. Walk the full lifecycle from persistence to B seeing it on reconnect. Which steps, if wrong, lose or duplicate the message?

Flow: ① A's message reaches the Chat Service, is persisted (with the client_id idempotency key) → an ACK returns to A (A shows "delivered to server" ✓); ② the Registry shows B has no active connection → the message stays in storage / B's pending queue; ③ B comes online, opens a WebSocket, the Registry records B→GatewayX; ④ B pulls offline messages "after my last seq" → delivery → B ACKs → the server marks delivered. Failure points: if step ① "ACKs before persisting", a crash loses it while A thinks it succeeded; if B doesn't dedup by seq after pulling, a reconnect-repull shows duplicates; if the Registry doesn't promptly record B online, new messages may briefly fail to route. Every step must be retryable + idempotent as a backstop.

3. A million-member group — how does one message reach all online members without blowing up the system? Is this the same problem as Day 14's Feed fanout?

Structurally it's the same push/pull choice as Feed. Writing a copy to each of a million inboxes = write-fanout explosion, infeasible. Big groups use a shared conversation log (one stored copy) + pull on read: members each pull deltas "after my last-read seq"; online members are found via the Registry and pushed a notification (not a full copy of the message). The difference: Feed is often async, seconds-tolerant, while chat wants real-time push, so "notify online members in real time + each pulls" is the compromise. Further optimization: aggregate online members of the same group by Gateway, so a Gateway receives one notification then locally fans to that node's members, cutting cross-node traffic — Discord's guild process is exactly this aggregation point.

4. With E2E on, the server can't read plaintext. So how do "search keywords across all chats" and "cloud multi-device history sync" still work? Why does Discord just skip E2E?

Search: the server can't build an inverted index (ciphertext is meaningless) → only the client can decrypt, build a local index, and search locally; switch devices and old history isn't searchable. Multi-device sync: either each device independently joins key negotiation (its own session), or session keys/history are securely transferred between devices — both far more complex than plaintext sync. Discord's choice: its core value is server-side capability — full-text search, content moderation, spam/minor-safety protection, bot ecosystem, AI features — all of which need the server to read plaintext. E2E fundamentally conflicts with these. So whether to do E2E is a product positioning decision: privacy-first (Signal/WhatsApp) vs server-capability-first (Discord/Slack). There's no purely technical "more correct".

5. A system upgrade restarts a batch of Gateways; millions of connections drop and reconnect at once. What happens, and how do you design to avoid a meltdown?

The risk is a thundering herd on reconnect: millions of clients drop → reconnect simultaneously → handshakes, auth, Registry writes, and offline-message pulls all flood in at once, potentially knocking over the just-recovered Gateway and backends again, oscillating in a "connect then get crushed" loop. Mitigations: ① client exponential backoff with jitter on reconnect (echoing Day 23's retry+jitter), spreading reconnects across the timeline; ② rolling restart of Gateways instead of all-at-once (Day 22), dropping only a small batch each time; ③ reconnect auth and offline pulls go through rate-limiting/queuing to protect backends; ④ use connection migration / graceful shutdown, telling clients ahead of time to move to another node rather than hard-dropping. Core idea: never let a huge client population's behavior synchronize — jitter is the key tool for breaking up that synchrony.

← Back to index