Design a message ID scheme for a global chat product (think Discord / Slack / WhatsApp). Scale: 300M DAU, peak 1M messages/sec, messages sharded across hundreds of Cassandra nodes by channel. Each message needs an ID that determines its physical placement in the DB, its sort order on the client, and how it looks in URLs.
Key requirements:
WHERE id < cursor must hit an index.id_today - id_yesterday for DAU — a real Twitter early-days privacy incident.
graph TD
ZK["ZooKeeper / etcd
assigns worker_id"]
NTP["NTP / PTP clock sync
detects backward jumps"]
APP1["App instance #1
embedded ID lib
worker_id=42"]
APP2["App instance #2
embedded ID lib
worker_id=43"]
APPN["App instance #N
worker_id=...
(max 2¹⁰=1024)"]
CLIENT["Mobile client
UUIDv7 generated locally
for offline messages"]
CASS[("Cassandra
partition by channel
cluster by id DESC")]
ZK -.assigns.-> APP1 & APP2 & APPN
NTP -.syncs.-> APP1 & APP2 & APPN
APP1 & APP2 & APPN --> CASS
CLIENT -->|with client_id| APP1
classDef coord fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef client fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class ZK,NTP coord
class APP1,APP2,APPN app
class CLIENT client
class CASS store
ID generation is in library mode — no standalone service that can die; ZK is only touched at startup
Components: ZooKeeper assigns a worker_id (10 bits = 1024 slots) per app instance at boot; the slot is returned on shutdown. NTP keeps clocks aligned (typically < 10ms drift) — clock skew is the Achilles heel of Snowflake-style schemes. The ID lib runs in-process, zero network hop. Mobile clients generate UUIDv7 locally for offline sends and the server accepts the client-supplied ID (idempotent send).
Principle: UUID is 128 bits. v4 is 122 bits of pure randomness — collision probability ~2⁻¹²², zero coordination, but disastrous for B-tree indexes: inserts hit random leaf pages, causing page splits and buffer-pool misses. v7 (RFC 9562, finalized 2024) puts a 48-bit millisecond Unix timestamp in the prefix, with 74 bits of randomness/monotonic counter after. Lexicographic order equals time order: new inserts land on the right edge of the B-tree, page splits go nearly to zero.
# Minimal UUIDv7 generator (Python, RFC 9562)
import os, time
def uuid7() -> bytes:
ts_ms = int(time.time() * 1000) # 48 bits
rand_a = int.from_bytes(os.urandom(2), 'big') & 0x0FFF # 12 bits
rand_b = int.from_bytes(os.urandom(8), 'big') & ((1<<62)-1) # 62 bits
# Layout: ts(48) | ver=7(4) | rand_a(12) | var=10(2) | rand_b(62)
n = (ts_ms << 80) | (0x7 << 76) | (rand_a << 64) | (0b10 << 62) | rand_b
return n.to_bytes(16, 'big')
# Postgres 17+ native gen_random_uuid() is still v4; community uuidv7() extensions exist
ch_/pi_ prefixes + 16+ bytes of randomness — v4-style (KV stores don't suffer from random insertion, and strong unguessability is required).Principle: Twitter open-sourced Snowflake in 2010. 64 bits = 1 sign + 41-bit millisecond timestamp (custom epoch, ~69-year span) + 10-bit worker_id (1024 machines) + 12-bit sequence (4096 IDs per worker per ms). 4096 × 1024 = 4.19M IDs/ms = 4.19B/sec — far beyond any realistic single system. Half the bytes of UUID, naturally roughly time-ordered.
| Scheme | Width | Peak/sec | Coordination | Fatal Flaw |
|---|---|---|---|---|
| DB auto_increment | BIGINT 64 | ~100k/instance | None | Single point + leaks volume |
| UUIDv4 | 128 | Unlimited | None | Slow PK inserts |
| Snowflake | 64 | 4.19B | worker_id + clock | Clock skew |
| UUIDv7 | 128 | Unlimited | None | 2× the space |
| DB segment (batch) | 64 | Depends on batch | DB grants 1000 at a time | Restart wastes segment |
# Core Snowflake (Python pseudocode)
class Snowflake:
EPOCH = 1672531200000 # 2023-01-01 custom epoch
def __init__(self, worker_id):
assert 0 <= worker_id < 1024
self.worker_id = worker_id
self.seq = 0
self.last_ts = -1
def next_id(self):
ts = int(time.time() * 1000)
if ts < self.last_ts:
raise ClockBackwardError(self.last_ts - ts) # see next section
if ts == self.last_ts:
self.seq = (self.seq + 1) & 0xFFF
if self.seq == 0:
ts = self._wait_next_ms(self.last_ts) # wait next millisecond
else:
self.seq = 0
self.last_ts = ts
return ((ts - self.EPOCH) << 22) | (self.worker_id << 12) | self.seq
Principle: Snowflake's uniqueness depends entirely on monotonic timestamps. A 100ms NTP backward jump means IDs generated in that window can collide with IDs from the past. Leap seconds, VM pause-and-resume, ops typos when setting the clock — all have caused production incidents.
if ts < last_ts: sleep(last_ts - ts). Cost: large skew blocks ID generation = blocks writes; throw an error above ~5ms so the upper layer can degrade. Twitter's original approach.logical_ts = max(system_ts, last_ts + 1) in memory, never goes backward. Cost: drifts from real time; good for short skews.-x so NTP "drifts time slowly" instead of "jumping" — physically eliminates backward skew. Production standard, but code must still defend.# Robust handler for clock skew
def next_id(self):
ts = now_ms()
delta = self.last_ts - ts
if delta > 0:
if delta <= 5: # small skew — wait
time.sleep(delta / 1000)
ts = now_ms()
elif delta <= 2000: # medium — logical clock
ts = self.last_ts + 1
else: # large — refuse service
raise ClockBackwardError(delta)
# ... normal seq allocation
Principle: all three aim at "lexicographic = time order, fully distributed, no coordinator." They differ in width, encoding, and strictness of monotonicity.
| UUIDv7 | ULID | KSUID | |
|---|---|---|---|
| Width | 128 | 128 | 160 |
| Time precision | ms | ms | sec |
| Random bits | 74 | 80 | 128 |
| Text encoding | hex+dashes (36 chars) | Crockford Base32 (26 chars) | Base62 (27 chars) |
| Monotonicity | Within-ms optional | Within-ms strict (spec) | None (sec granularity) |
| Standardization | IETF RFC 9562 | Community spec (GitHub) | Segment open source |
| Best for | DB primary keys | User-facing short IDs | Event streams, logs |
# String forms of each
UUIDv4: 6ba7b810-9dad-11d1-80b4-00c04fd430c8 ← 36 chars
UUIDv7: 018f1c2e-a31b-7c12-9a4b-3e1c2d4f5a6b ← 36 chars, first 48 bits are time
ULID: 01HXY8K5T4ZNQ9P2M3RVW6S0BC ← 26 chars
KSUID: 2QqRwSdEf8VnT1bAaCpYlGmHkXjZ ← 27 chars
Snowflake: 1773294523890106368 ← 19 decimal digits
cus_, ch_) — not any standard, but the gold reference for "readable + type-safe + unguessable" product-grade IDs.region(3) + dc(3) + worker(4), letting the ID also encode physical location. Twitter does this internally.Likely interview follow-ups:
It's not CPU, not raw disk bandwidth — it's buffer pool hit rate collapse. InnoDB uses a clustered index — the PK determines the physical position of the row.
How v7 fixes it: the first 48 bits are a timestamp, so recently generated IDs are lexicographically close, all landing in a handful of right-side leaf pages — these stay hot, buffer pool essentially always hits. Percona / EnterpriseDB benchmarks at the billion-row scale show v7 inserts 5-10× faster than v4, with an order of magnitude less write amplification.
Implication: "v4 + a separate BIGINT auto-inc PK + UUID as unique key" was a historical compromise; with v7 you can go clean.
The bottleneck isn't the 4096 ceiling, it's the "wait until next ms" path. When the sequence runs out within a millisecond, the code has to sleep until next ms. But:
last_ts + seq. Lock contention bites long before 4.1M/sec — typically around 300k/sec.gettimeofday() on Linux is fast (vDSO), but per next_id call adds up — at 300k/sec it's already eating significant CPU.Engineering fixes:
next_batch(1000) returns a contiguous chunk to the caller, amortizing syscalls.Discord, Baidu's UidGenerator, and Meituan's Leaf have all done these optimizations at this scale.
The classic "primary-key evolution" problem. A direct PK swap = ALTER TABLE lock + index rebuild + every foreign key updated — hours of downtime. Zero-downtime path:
snowflake_id BIGINT UNIQUE; new writes populate both IDs. Backfill old rows in batches (by PK range) offline.parent_snowflake_id column, backfills, switches reads, drops the old FK. Most painful step — touches every JOIN.pt-online-schema-change / gh-ost shadow tables avoids table locks.Key pitfalls:
GET /v1/legacy_id/123 → snowflake).The whole process typically takes 3-6 months; the key discipline is not rushing to drop the old ID — keep dual writes 1-2 months to observe.
Core problem: the client's UUIDv7 timestamp is a week ago; when the message reaches the server, sorting by ID places it a week back in channel history — the UI never shows it.
Layered solution:
client_id (client-generated, for idempotent dedup) and server_ts (server receipt time, for ordering). Sort by server_ts, dedup by client_id.(channel_id, client_id); the client sending three retries of the same message succeeds only once.Deeper design wisdom: never let a single field carry "unique identifier," "time order," and "causal order" simultaneously — their semantics differ (generation time / arrival time / event time). Slack's ts field is the server_ts, client_msg_id is a UUID — clearly separated.
User-visible IDs have completely different requirements from backend PKs. Five dimensions matter:
/order/123 in the URL lets anyone enumerate others' orders. Visible IDs need ≥ 64 bits of randomness. UUIDv7's 74 random bits are sufficient, but the 48-bit timestamp prefix leaks order time — unacceptable for some financial scenarios.cus_ + 14 random chars hits the sweet spot.ord_, inv_) let support staff identify the ID type at a glance, avoiding "invoice ID looked up as order ID" mistakes. This is Stripe's most underrated engineering decision.Recommended: Stripe model — {type_prefix}_{base62(random)}, 22-30 chars. Backend PK is UUIDv7 / Snowflake, never exposed externally; add a mapping layer. The cost is a mapping table (or use the prefix + random directly as the PK), in exchange for API evolution freedom.
Counter-example: early Twitter URLs containing tweet IDs were auto-inc then Snowflake — exposed tweet volume; the timestamp is still extractable today. Product launch identifier design deserves caution.