Day 11 Medium Unique ID Snowflake UUIDv7 KSUID/ULID

Unique ID Generation — The Triangle of Time, Randomness, and IndexUUID, Snowflake, KSUID, ULID

Problem & Requirements

Design a message ID scheme for a global chat product (think Discord / Slack / WhatsApp). Scale: 300M DAU, peak 1M messages/sec, messages sharded across hundreds of Cassandra nodes by channel. Each message needs an ID that determines its physical placement in the DB, its sort order on the client, and how it looks in URLs.

Key requirements:

Globally unique: across data centers, no collisions at 1M/sec peak.
Roughly time-ordered: paginating channel history with WHERE id < cursor must hit an index.
Compact: one ID per message — 128-bit vs 64-bit is tens of TB/year in bandwidth + storage + index.
No central bottleneck: a single ID service going down cannot block all writes.
No business-volume leak: auto-increment IDs let competitors compute id_today - id_yesterday for DAU — a real Twitter early-days privacy incident.
Client-generable: offline messages get an ID locally; server accepts it as-is for idempotent send.

High-Level Architecture

graph TD
    ZK["ZooKeeper / etcd
assigns worker_id"]
    NTP["NTP / PTP clock sync
detects backward jumps"]
    APP1["App instance #1
embedded ID lib
worker_id=42"]
    APP2["App instance #2
embedded ID lib
worker_id=43"]
    APPN["App instance #N
worker_id=...
(max 2¹⁰=1024)"]
    CLIENT["Mobile client
UUIDv7 generated locally
for offline messages"]
    CASS[("Cassandra
partition by channel
cluster by id DESC")]

    ZK -.assigns.-> APP1 & APP2 & APPN
    NTP -.syncs.-> APP1 & APP2 & APPN
    APP1 & APP2 & APPN --> CASS
    CLIENT -->|with client_id| APP1

    classDef coord fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef client fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class ZK,NTP coord
    class APP1,APP2,APPN app
    class CLIENT client
    class CASS store

ID generation is in library mode — no standalone service that can die; ZK is only touched at startup

Components: ZooKeeper assigns a worker_id (10 bits = 1024 slots) per app instance at boot; the slot is returned on shutdown. NTP keeps clocks aligned (typically < 10ms drift) — clock skew is the Achilles heel of Snowflake-style schemes. The ID lib runs in-process, zero network hop. Mobile clients generate UUIDv7 locally for offline sends and the server accepts the client-supplied ID (idempotent send).

Key Techniques

1. UUID Versions: v4 Random vs v7 Time-Ordered

Principle: UUID is 128 bits. v4 is 122 bits of pure randomness — collision probability ~2⁻¹²², zero coordination, but disastrous for B-tree indexes: inserts hit random leaf pages, causing page splits and buffer-pool misses. v7 (RFC 9562, finalized 2024) puts a 48-bit millisecond Unix timestamp in the prefix, with 74 bits of randomness/monotonic counter after. Lexicographic order equals time order: new inserts land on the right edge of the B-tree, page splits go nearly to zero.

Trade-off:

v4: ✅ no coordination, client-generable; ❌ in Postgres / MySQL with v4 as PK, insert throughput collapses as the table grows past buffer pool capacity; 16 bytes binary or 36 bytes as char — 2-4× a BIGINT.
v7: ✅ index-friendly (sequential B-tree inserts), keeps UUID ecosystem (128-bit, same string APIs), language libraries already shipping; ❌ leaks the creation time, careful in privacy-sensitive contexts; clock backward still needs handling.
v1 (MAC + time): basically dead — leaks MAC, and the timestamp is in the low bits, so still effectively random for insertion.

# Minimal UUIDv7 generator (Python, RFC 9562)
import os, time
def uuid7() -> bytes:
    ts_ms = int(time.time() * 1000)          # 48 bits
    rand_a = int.from_bytes(os.urandom(2), 'big') & 0x0FFF  # 12 bits
    rand_b = int.from_bytes(os.urandom(8), 'big') & ((1<<62)-1)  # 62 bits
    # Layout: ts(48) | ver=7(4) | rand_a(12) | var=10(2) | rand_b(62)
    n = (ts_ms << 80) | (0x7 << 76) | (rand_a << 64) | (0b10 << 62) | rand_b
    return n.to_bytes(16, 'big')

# Postgres 17+ native gen_random_uuid() is still v4; community uuidv7() extensions exist

Real-world cases:

Postgres community benchmarks: inserting 10M+ rows into the same table, UUIDv4 runs 3-10× slower than UUIDv7 (v7 is sequential write, v4 is random write); BRIN indexes work well on v7 and are useless on v4.
Stripe API object IDs: use ch_/pi_ prefixes + 16+ bytes of randomness — v4-style (KV stores don't suffer from random insertion, and strong unguessability is required).
OpenJDK: JEP added UUIDv7 support, marking v7's entry into mainstream.

2. Snowflake: 64-bit Three-Segment Layout

Principle: Twitter open-sourced Snowflake in 2010. 64 bits = 1 sign + 41-bit millisecond timestamp (custom epoch, ~69-year span) + 10-bit worker_id (1024 machines) + 12-bit sequence (4096 IDs per worker per ms). 4096 × 1024 = 4.19M IDs/ms = 4.19B/sec — far beyond any realistic single system. Half the bytes of UUID, naturally roughly time-ordered.

Trade-off vs UUIDv7 / DB sequence:

Scheme	Width	Peak/sec	Coordination	Fatal Flaw
DB auto_increment	BIGINT 64	~100k/instance	None	Single point + leaks volume
UUIDv4	128	Unlimited	None	Slow PK inserts
Snowflake	64	4.19B	worker_id + clock	Clock skew
UUIDv7	128	Unlimited	None	2× the space
DB segment (batch)	64	Depends on batch	DB grants 1000 at a time	Restart wastes segment

# Core Snowflake (Python pseudocode)
class Snowflake:
    EPOCH = 1672531200000  # 2023-01-01 custom epoch
    def __init__(self, worker_id):
        assert 0 <= worker_id < 1024
        self.worker_id = worker_id
        self.seq = 0
        self.last_ts = -1
    def next_id(self):
        ts = int(time.time() * 1000)
        if ts < self.last_ts:
            raise ClockBackwardError(self.last_ts - ts)  # see next section
        if ts == self.last_ts:
            self.seq = (self.seq + 1) & 0xFFF
            if self.seq == 0:
                ts = self._wait_next_ms(self.last_ts)  # wait next millisecond
        else:
            self.seq = 0
        self.last_ts = ts
        return ((ts - self.EPOCH) << 22) | (self.worker_id << 12) | self.seq

Real-world cases:

Twitter: original version, went live on 2010-09-21, all tweet IDs became non-sequential; switched to embedded-library mode after 2014.
Discord: every message/user/channel is a Snowflake, with slightly different bit allocation; the front end decodes the creation timestamp directly from the ID (no DB lookup needed).
Instagram: 41-bit time + 13-bit shard_id + 10-bit seq, generated inside Postgres via stored procedure — no external service dependency. The canonical "Snowflake inside the database" pattern.
Sony's Sonyflake: 39-bit 10-ms-precision timestamp + 8-bit seq + 16-bit machine_id — trades QPS for 174-year lifespan and 65536 machines, ideal when machine count is high but per-machine QPS is modest.

3. Clock Skew: The Snowflake Killer

Principle: Snowflake's uniqueness depends entirely on monotonic timestamps. A 100ms NTP backward jump means IDs generated in that window can collide with IDs from the past. Leap seconds, VM pause-and-resume, ops typos when setting the clock — all have caused production incidents.

Three mainstream remedies:

① Wait for time to catch up (simplest): if ts < last_ts: sleep(last_ts - ts). Cost: large skew blocks ID generation = blocks writes; throw an error above ~5ms so the upper layer can degrade. Twitter's original approach.
② "Logical time" instead of system time: keep logical_ts = max(system_ts, last_ts + 1) in memory, never goes backward. Cost: drifts from real time; good for short skews.
③ Burn worker_id to escape (Sonyflake / Baidu UidGenerator): on detected skew, restart with a new worker_id (ZK has free slots), let the old worker_id cool down. Cost: reserve worker_id pool capacity.
④ Disable NTP step, allow only slew (ops mitigation): use -x so NTP "drifts time slowly" instead of "jumping" — physically eliminates backward skew. Production standard, but code must still defend.

# Robust handler for clock skew
def next_id(self):
    ts = now_ms()
    delta = self.last_ts - ts
    if delta > 0:
        if delta <= 5:          # small skew — wait
            time.sleep(delta / 1000)
            ts = now_ms()
        elif delta <= 2000:     # medium — logical clock
            ts = self.last_ts + 1
        else:                   # large — refuse service
            raise ClockBackwardError(delta)
    # ... normal seq allocation

Real incident patterns: VM restored from snapshot jumps hours back; one misconfigured NTP source rolls the entire fleet back 30 seconds simultaneously, generating IDs that collide with historical ones — the only fix is take the instance out + wait until real time exceeds last_ts.

4. ULID / KSUID / UUIDv7 Comparison

Principle: all three aim at "lexicographic = time order, fully distributed, no coordinator." They differ in width, encoding, and strictness of monotonicity.

	UUIDv7	ULID	KSUID
Width	128	128	160
Time precision	ms	ms	sec
Random bits	74	80	128
Text encoding	hex+dashes (36 chars)	Crockford Base32 (26 chars)	Base62 (27 chars)
Monotonicity	Within-ms optional	Within-ms strict (spec)	None (sec granularity)
Standardization	IETF RFC 9562	Community spec (GitHub)	Segment open source
Best for	DB primary keys	User-facing short IDs	Event streams, logs

How to choose:

Greenfield DB PK: UUIDv7 (standardized, ecosystem, index-friendly).
Visible in URL / UI: ULID (26 chars, Base32 without I/L/O/U avoids confusion) or KSUID (27 chars Base62).
Hundreds to thousands of IDs/sec: KSUID is plenty; second-level granularity sidesteps clock-skew sensitivity.
Peak 1M/sec, strict monotonic: Snowflake (64 bits, 50% storage savings) or ULID (spec guarantees within-ms monotonicity).

# String forms of each
UUIDv4:    6ba7b810-9dad-11d1-80b4-00c04fd430c8   ← 36 chars
UUIDv7:    018f1c2e-a31b-7c12-9a4b-3e1c2d4f5a6b   ← 36 chars, first 48 bits are time
ULID:      01HXY8K5T4ZNQ9P2M3RVW6S0BC             ← 26 chars
KSUID:     2QqRwSdEf8VnT1bAaCpYlGmHkXjZ           ← 27 chars
Snowflake: 1773294523890106368                    ← 19 decimal digits

Real-world cases:

Segment: the KSUID repo states trillions of KSUIDs in their event pipeline; 20-byte layout (4 bytes timestamp + 16 bytes random), second precision is enough for event dedup.
ULID: the canonical spec mandates within-ms monotonicity (implementations must use a monotonic factory); MongoDB's ObjectID is a similar idea (12 bytes = 4 time + 5 random + 3 counter).
Stripe: type-prefixed random strings (cus_, ch_) — not any standard, but the gold reference for "readable + type-safe + unguessable" product-grade IDs.

Extensions & Optimizations

Running out of worker_id: 10 bits = 1024 limit, hit fast in container era. Discord / Sonyflake reduce timestamp precision (10ms granularity) to free bits for worker_id.
Multi-region: split bits as region(3) + dc(3) + worker(4), letting the ID also encode physical location. Twitter does this internally.
client_id segment: for offline-generated IDs on mobile, use local UUIDv7 + a client token for idempotency; server doesn't re-allocate. WhatsApp uses this so retries don't dup.
Migrating UUIDv4 → UUIDv7: swap the generator, keep old data — v7's version nibble differs from v4 so no type clash; index performance gradually recovers as v7 proportion grows.
Monitoring: IDs/sec (approaching 4096/ms/worker?), clock drift, worker_id collision alerts.

Common Pitfalls + Interview Questions

Pitfall 1: UUIDv4 as PK on large tables. "UUID is globally unique, must be safe" — sounds reasonable, but B-tree insert collapse usually only surfaces past 50M rows, by which time it's late. New projects: just use v7.

Pitfall 2: Assuming Snowflake is globally monotonic. It's only roughly time-ordered — within the same millisecond, IDs from different workers are ordered by worker_id, not by anything real. Don't write code that depends on "newer ID > older ID."

Pitfall 3: Treating the ID as a timestamp. The time embedded in a Snowflake is the generation time, not the event time; for offline clients these can differ by days.

Pitfall 4: worker_id reuse. Machine drift or container scheduling can give two processes the same worker_id → guaranteed dup IDs. ZK ephemeral nodes + heartbeat is the standard fix.

Pitfall 5: Leaking business volume. Auto-increment IDs let competitors compute "how many messages sent today." Early Twitter exposed growth numbers this way. Any externally visible ID needs at least a random segment.

Likely interview follow-ups:

Peak 10M/sec — is 64-bit Snowflake enough? How would you re-allocate bits?
Generating IDs across data centers — how do you avoid collisions? How many bits for datacenter_id?
What's wrong with UUIDv4 as a MySQL primary key? How does v7 fix it? Give an order-of-magnitude benchmark.
When clock skew occurs, how does your system respond? At what skew size do you fail open?
Can clients generate IDs locally? What preconditions must hold?

Deep Dive Questions

1. UUIDv4 as a MySQL PK collapses insert throughput. The root cause isn't "randomness" itself. Where exactly does it slow down, and how does UUIDv7 fix it?

It's not CPU, not raw disk bandwidth — it's buffer pool hit rate collapse. InnoDB uses a clustered index — the PK determines the physical position of the row.

Sequential PK (auto-inc / UUIDv7): new inserts always land on the rightmost leaf page; as long as that page is in the buffer pool, zero disk IO and page splits are rare.
Random PK (UUIDv4): inserts hit random leaf pages across the entire table. Once the table exceeds buffer pool (e.g. 100GB table, 16GB buffer pool), nearly every insert misses → read the page from disk first, then write → write amplification + IOPS saturation.
Page splits compound: random inserts frequently trigger leaf-page splits, fragmenting the index, growing B-tree height, further hurting buffer pool utilization.

How v7 fixes it: the first 48 bits are a timestamp, so recently generated IDs are lexicographically close, all landing in a handful of right-side leaf pages — these stay hot, buffer pool essentially always hits. Percona / EnterpriseDB benchmarks at the billion-row scale show v7 inserts 5-10× faster than v4, with an order of magnitude less write amplification.

Implication: "v4 + a separate BIGINT auto-inc PK + UUID as unique key" was a historical compromise; with v7 you can go clean.

2. Snowflake's 12-bit sequence supports 4096/ms = 4.1M/sec. But in production "single process throws errors above 300k/sec" is common. Why?

The bottleneck isn't the 4096 ceiling, it's the "wait until next ms" path. When the sequence runs out within a millisecond, the code has to sleep until next ms. But:

Thread contention: one Snowflake instance shared across threads needs to lock / CAS on last_ts + seq. Lock contention bites long before 4.1M/sec — typically around 300k/sec.
Syscall overhead: gettimeofday() on Linux is fast (vDSO), but per next_id call adds up — at 300k/sec it's already eating significant CPU.
Burst vs average: real traffic is spiky; one specific millisecond can saturate 4096 and block, with the next millisecond empty — average looks like 300k/sec but the burst already saturated.

Engineering fixes:

Per-thread Snowflake instance (further subdivide worker_id across threads) — eliminates the lock.
Batch pre-generation: next_batch(1000) returns a contiguous chunk to the caller, amortizing syscalls.
Cache the timestamp: business threads read the last ts; a dedicated ticker coroutine updates it every ms, saving syscalls.

Discord, Baidu's UidGenerator, and Meituan's Leaf have all done these optimizations at this scale.

3. A system has both ID schemes coexisting (old BIGINT auto-inc + new Snowflake). How do you migrate without downtime?

The classic "primary-key evolution" problem. A direct PK swap = ALTER TABLE lock + index rebuild + every foreign key updated — hours of downtime. Zero-downtime path:

Phase 1: dual write. Add snowflake_id BIGINT UNIQUE; new writes populate both IDs. Backfill old rows in batches (by PK range) offline.
Phase 2: dual read. Code prefers reading by snowflake_id; reads by old ID translate internally. External APIs accept either format (distinguish by prefix or width).
Phase 3: migrate foreign keys. Every table referencing the old ID grows a parent_snowflake_id column, backfills, switches reads, drops the old FK. Most painful step — touches every JOIN.
Phase 4: retire old ID. After old-ID access drops to zero, drop the column and swap the PK. PK swap via pt-online-schema-change / gh-ost shadow tables avoids table locks.

Key pitfalls:

Caches still hold old IDs → clear related cache after dual write begins.
Clients / third parties still hold old IDs → publish a permanent redirect API (GET /v1/legacy_id/123 → snowflake).
BI / analytics joins on old ID → the data warehouse must migrate too.

The whole process typically takes 3-6 months; the key discipline is not rushing to drop the old ID — keep dual writes 1-2 months to observe.

4. In a chat system, what happens when the sender's client generates an ID on a plane, comes online a week later, and sends? How do you handle it?

Core problem: the client's UUIDv7 timestamp is a week ago; when the message reaches the server, sorting by ID places it a week back in channel history — the UI never shows it.

Layered solution:

Dual-timestamp model: the message schema stores both client_id (client-generated, for idempotent dedup) and server_ts (server receipt time, for ordering). Sort by server_ts, dedup by client_id.
created_at in the message body: UI shows "send time" using the client's created_at (so the user sees "written a week ago"), but the message's position in the channel list uses server_ts. WhatsApp, iMessage both do this.
Idempotency: server upserts on (channel_id, client_id); the client sending three retries of the same message succeeds only once.
Collision boundary: even if two clients are offline at the same time, same channel, same millisecond, UUIDv7 collision probability is still ~2⁻⁷⁴ — effectively zero.

Deeper design wisdom: never let a single field carry "unique identifier," "time order," and "causal order" simultaneously — their semantics differ (generation time / arrival time / event time). Slack's ts field is the server_ts, client_msg_id is a UUID — clearly separated.

5. To design a "user-visible" ID (in URLs, receipts, support tickets), how would you choose between UUIDv7, ULID, or a custom Base62?

User-visible IDs have completely different requirements from backend PKs. Five dimensions matter:

Unguessability: /order/123 in the URL lets anyone enumerate others' orders. Visible IDs need ≥ 64 bits of randomness. UUIDv7's 74 random bits are sufficient, but the 48-bit timestamp prefix leaks order time — unacceptable for some financial scenarios.
Resistance to input errors: when a customer reads an ID to support, O/0 and I/1 must be distinguishable. Crockford Base32 (used by ULID) was designed for this — drops I, L, O, U; Base62 doesn't have this property.
Length: URLs and receipts want short. UUIDv7's 36 chars is too long; ULID's 26 chars is acceptable; Stripe-style cus_ + 14 random chars hits the sweet spot.
Type identifiability: prefixes (ord_, inv_) let support staff identify the ID type at a glance, avoiding "invoice ID looked up as order ID" mistakes. This is Stripe's most underrated engineering decision.
Time leak: B2B scenarios may want time exposed; B2C privacy scenarios cannot.

Recommended: Stripe model — {type_prefix}_{base62(random)}, 22-30 chars. Backend PK is UUIDv7 / Snowflake, never exposed externally; add a mapping layer. The cost is a mapping table (or use the prefix + random directly as the PK), in exchange for API evolution freedom.

Counter-example: early Twitter URLs containing tweet IDs were auto-inc then Snowflake — exposed tweet volume; the timestamp is still extractable today. Product launch identifier design deserves caution.

← Back to index

Problem & Requirements

High-Level Architecture

Key Techniques

1. UUID Versions: v4 Random vs v7 Time-Ordered

2. Snowflake: 64-bit Three-Segment Layout

3. Clock Skew: The Snowflake Killer

4. ULID / KSUID / UUIDv7 Comparison

Extensions & Optimizations

Common Pitfalls + Interview Questions

Further Reading

Deep Dive Questions