Day 11 Medium Unique ID Snowflake UUIDv7 KSUID/ULID

Unique ID Generation — The Triangle of Time, Randomness, and IndexUUID, Snowflake, KSUID, ULID

Problem & Requirements

Design a message ID scheme for a global chat product (think Discord / Slack / WhatsApp). Scale: 300M DAU, peak 1M messages/sec, messages sharded across hundreds of Cassandra nodes by channel. Each message needs an ID that determines its physical placement in the DB, its sort order on the client, and how it looks in URLs.

Key requirements:

High-Level Architecture

graph TD
    ZK["ZooKeeper / etcd
assigns worker_id"] NTP["NTP / PTP clock sync
detects backward jumps"] APP1["App instance #1
embedded ID lib
worker_id=42"] APP2["App instance #2
embedded ID lib
worker_id=43"] APPN["App instance #N
worker_id=...
(max 2¹⁰=1024)"] CLIENT["Mobile client
UUIDv7 generated locally
for offline messages
"] CASS[("Cassandra
partition by channel
cluster by id DESC
")] ZK -.assigns.-> APP1 & APP2 & APPN NTP -.syncs.-> APP1 & APP2 & APPN APP1 & APP2 & APPN --> CASS CLIENT -->|with client_id| APP1 classDef coord fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef client fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class ZK,NTP coord class APP1,APP2,APPN app class CLIENT client class CASS store

ID generation is in library mode — no standalone service that can die; ZK is only touched at startup

Components: ZooKeeper assigns a worker_id (10 bits = 1024 slots) per app instance at boot; the slot is returned on shutdown. NTP keeps clocks aligned (typically < 10ms drift) — clock skew is the Achilles heel of Snowflake-style schemes. The ID lib runs in-process, zero network hop. Mobile clients generate UUIDv7 locally for offline sends and the server accepts the client-supplied ID (idempotent send).

Key Techniques

1. UUID Versions: v4 Random vs v7 Time-Ordered

Principle: UUID is 128 bits. v4 is 122 bits of pure randomness — collision probability ~2⁻¹²², zero coordination, but disastrous for B-tree indexes: inserts hit random leaf pages, causing page splits and buffer-pool misses. v7 (RFC 9562, finalized 2024) puts a 48-bit millisecond Unix timestamp in the prefix, with 74 bits of randomness/monotonic counter after. Lexicographic order equals time order: new inserts land on the right edge of the B-tree, page splits go nearly to zero.

Trade-off:
# Minimal UUIDv7 generator (Python, RFC 9562)
import os, time
def uuid7() -> bytes:
    ts_ms = int(time.time() * 1000)          # 48 bits
    rand_a = int.from_bytes(os.urandom(2), 'big') & 0x0FFF  # 12 bits
    rand_b = int.from_bytes(os.urandom(8), 'big') & ((1<<62)-1)  # 62 bits
    # Layout: ts(48) | ver=7(4) | rand_a(12) | var=10(2) | rand_b(62)
    n = (ts_ms << 80) | (0x7 << 76) | (rand_a << 64) | (0b10 << 62) | rand_b
    return n.to_bytes(16, 'big')

# Postgres 17+ native gen_random_uuid() is still v4; community uuidv7() extensions exist
Real-world cases:

2. Snowflake: 64-bit Three-Segment Layout

Principle: Twitter open-sourced Snowflake in 2010. 64 bits = 1 sign + 41-bit millisecond timestamp (custom epoch, ~69-year span) + 10-bit worker_id (1024 machines) + 12-bit sequence (4096 IDs per worker per ms). 4096 × 1024 = 4.19M IDs/ms = 4.19B/sec — far beyond any realistic single system. Half the bytes of UUID, naturally roughly time-ordered.

Trade-off vs UUIDv7 / DB sequence:
SchemeWidthPeak/secCoordinationFatal Flaw
DB auto_incrementBIGINT 64~100k/instanceNoneSingle point + leaks volume
UUIDv4128UnlimitedNoneSlow PK inserts
Snowflake644.19Bworker_id + clockClock skew
UUIDv7128UnlimitedNone2× the space
DB segment (batch)64Depends on batchDB grants 1000 at a timeRestart wastes segment
# Core Snowflake (Python pseudocode)
class Snowflake:
    EPOCH = 1672531200000  # 2023-01-01 custom epoch
    def __init__(self, worker_id):
        assert 0 <= worker_id < 1024
        self.worker_id = worker_id
        self.seq = 0
        self.last_ts = -1
    def next_id(self):
        ts = int(time.time() * 1000)
        if ts < self.last_ts:
            raise ClockBackwardError(self.last_ts - ts)  # see next section
        if ts == self.last_ts:
            self.seq = (self.seq + 1) & 0xFFF
            if self.seq == 0:
                ts = self._wait_next_ms(self.last_ts)  # wait next millisecond
        else:
            self.seq = 0
        self.last_ts = ts
        return ((ts - self.EPOCH) << 22) | (self.worker_id << 12) | self.seq
Real-world cases:

3. Clock Skew: The Snowflake Killer

Principle: Snowflake's uniqueness depends entirely on monotonic timestamps. A 100ms NTP backward jump means IDs generated in that window can collide with IDs from the past. Leap seconds, VM pause-and-resume, ops typos when setting the clock — all have caused production incidents.

Three mainstream remedies:
# Robust handler for clock skew
def next_id(self):
    ts = now_ms()
    delta = self.last_ts - ts
    if delta > 0:
        if delta <= 5:          # small skew — wait
            time.sleep(delta / 1000)
            ts = now_ms()
        elif delta <= 2000:     # medium — logical clock
            ts = self.last_ts + 1
        else:                   # large — refuse service
            raise ClockBackwardError(delta)
    # ... normal seq allocation
Real incident patterns: VM restored from snapshot jumps hours back; one misconfigured NTP source rolls the entire fleet back 30 seconds simultaneously, generating IDs that collide with historical ones — the only fix is take the instance out + wait until real time exceeds last_ts.

4. ULID / KSUID / UUIDv7 Comparison

Principle: all three aim at "lexicographic = time order, fully distributed, no coordinator." They differ in width, encoding, and strictness of monotonicity.

UUIDv7ULIDKSUID
Width128128160
Time precisionmsmssec
Random bits7480128
Text encodinghex+dashes (36 chars)Crockford Base32 (26 chars)Base62 (27 chars)
MonotonicityWithin-ms optionalWithin-ms strict (spec)None (sec granularity)
StandardizationIETF RFC 9562Community spec (GitHub)Segment open source
Best forDB primary keysUser-facing short IDsEvent streams, logs
How to choose:
# String forms of each
UUIDv4:    6ba7b810-9dad-11d1-80b4-00c04fd430c8   ← 36 chars
UUIDv7:    018f1c2e-a31b-7c12-9a4b-3e1c2d4f5a6b   ← 36 chars, first 48 bits are time
ULID:      01HXY8K5T4ZNQ9P2M3RVW6S0BC             ← 26 chars
KSUID:     2QqRwSdEf8VnT1bAaCpYlGmHkXjZ           ← 27 chars
Snowflake: 1773294523890106368                    ← 19 decimal digits
Real-world cases:

Extensions & Optimizations

Common Pitfalls + Interview Questions

Pitfall 1: UUIDv4 as PK on large tables. "UUID is globally unique, must be safe" — sounds reasonable, but B-tree insert collapse usually only surfaces past 50M rows, by which time it's late. New projects: just use v7.
Pitfall 2: Assuming Snowflake is globally monotonic. It's only roughly time-ordered — within the same millisecond, IDs from different workers are ordered by worker_id, not by anything real. Don't write code that depends on "newer ID > older ID."
Pitfall 3: Treating the ID as a timestamp. The time embedded in a Snowflake is the generation time, not the event time; for offline clients these can differ by days.
Pitfall 4: worker_id reuse. Machine drift or container scheduling can give two processes the same worker_id → guaranteed dup IDs. ZK ephemeral nodes + heartbeat is the standard fix.
Pitfall 5: Leaking business volume. Auto-increment IDs let competitors compute "how many messages sent today." Early Twitter exposed growth numbers this way. Any externally visible ID needs at least a random segment.

Likely interview follow-ups:

  1. Peak 10M/sec — is 64-bit Snowflake enough? How would you re-allocate bits?
  2. Generating IDs across data centers — how do you avoid collisions? How many bits for datacenter_id?
  3. What's wrong with UUIDv4 as a MySQL primary key? How does v7 fix it? Give an order-of-magnitude benchmark.
  4. When clock skew occurs, how does your system respond? At what skew size do you fail open?
  5. Can clients generate IDs locally? What preconditions must hold?

Further Reading

Deep Dive Questions

1. UUIDv4 as a MySQL PK collapses insert throughput. The root cause isn't "randomness" itself. Where exactly does it slow down, and how does UUIDv7 fix it?

It's not CPU, not raw disk bandwidth — it's buffer pool hit rate collapse. InnoDB uses a clustered index — the PK determines the physical position of the row.

  • Sequential PK (auto-inc / UUIDv7): new inserts always land on the rightmost leaf page; as long as that page is in the buffer pool, zero disk IO and page splits are rare.
  • Random PK (UUIDv4): inserts hit random leaf pages across the entire table. Once the table exceeds buffer pool (e.g. 100GB table, 16GB buffer pool), nearly every insert misses → read the page from disk first, then write → write amplification + IOPS saturation.
  • Page splits compound: random inserts frequently trigger leaf-page splits, fragmenting the index, growing B-tree height, further hurting buffer pool utilization.

How v7 fixes it: the first 48 bits are a timestamp, so recently generated IDs are lexicographically close, all landing in a handful of right-side leaf pages — these stay hot, buffer pool essentially always hits. Percona / EnterpriseDB benchmarks at the billion-row scale show v7 inserts 5-10× faster than v4, with an order of magnitude less write amplification.

Implication: "v4 + a separate BIGINT auto-inc PK + UUID as unique key" was a historical compromise; with v7 you can go clean.

2. Snowflake's 12-bit sequence supports 4096/ms = 4.1M/sec. But in production "single process throws errors above 300k/sec" is common. Why?

The bottleneck isn't the 4096 ceiling, it's the "wait until next ms" path. When the sequence runs out within a millisecond, the code has to sleep until next ms. But:

  • Thread contention: one Snowflake instance shared across threads needs to lock / CAS on last_ts + seq. Lock contention bites long before 4.1M/sec — typically around 300k/sec.
  • Syscall overhead: gettimeofday() on Linux is fast (vDSO), but per next_id call adds up — at 300k/sec it's already eating significant CPU.
  • Burst vs average: real traffic is spiky; one specific millisecond can saturate 4096 and block, with the next millisecond empty — average looks like 300k/sec but the burst already saturated.

Engineering fixes:

  • Per-thread Snowflake instance (further subdivide worker_id across threads) — eliminates the lock.
  • Batch pre-generation: next_batch(1000) returns a contiguous chunk to the caller, amortizing syscalls.
  • Cache the timestamp: business threads read the last ts; a dedicated ticker coroutine updates it every ms, saving syscalls.

Discord, Baidu's UidGenerator, and Meituan's Leaf have all done these optimizations at this scale.

3. A system has both ID schemes coexisting (old BIGINT auto-inc + new Snowflake). How do you migrate without downtime?

The classic "primary-key evolution" problem. A direct PK swap = ALTER TABLE lock + index rebuild + every foreign key updated — hours of downtime. Zero-downtime path:

  • Phase 1: dual write. Add snowflake_id BIGINT UNIQUE; new writes populate both IDs. Backfill old rows in batches (by PK range) offline.
  • Phase 2: dual read. Code prefers reading by snowflake_id; reads by old ID translate internally. External APIs accept either format (distinguish by prefix or width).
  • Phase 3: migrate foreign keys. Every table referencing the old ID grows a parent_snowflake_id column, backfills, switches reads, drops the old FK. Most painful step — touches every JOIN.
  • Phase 4: retire old ID. After old-ID access drops to zero, drop the column and swap the PK. PK swap via pt-online-schema-change / gh-ost shadow tables avoids table locks.

Key pitfalls:

  • Caches still hold old IDs → clear related cache after dual write begins.
  • Clients / third parties still hold old IDs → publish a permanent redirect API (GET /v1/legacy_id/123 → snowflake).
  • BI / analytics joins on old ID → the data warehouse must migrate too.

The whole process typically takes 3-6 months; the key discipline is not rushing to drop the old ID — keep dual writes 1-2 months to observe.

4. In a chat system, what happens when the sender's client generates an ID on a plane, comes online a week later, and sends? How do you handle it?

Core problem: the client's UUIDv7 timestamp is a week ago; when the message reaches the server, sorting by ID places it a week back in channel history — the UI never shows it.

Layered solution:

  • Dual-timestamp model: the message schema stores both client_id (client-generated, for idempotent dedup) and server_ts (server receipt time, for ordering). Sort by server_ts, dedup by client_id.
  • created_at in the message body: UI shows "send time" using the client's created_at (so the user sees "written a week ago"), but the message's position in the channel list uses server_ts. WhatsApp, iMessage both do this.
  • Idempotency: server upserts on (channel_id, client_id); the client sending three retries of the same message succeeds only once.
  • Collision boundary: even if two clients are offline at the same time, same channel, same millisecond, UUIDv7 collision probability is still ~2⁻⁷⁴ — effectively zero.

Deeper design wisdom: never let a single field carry "unique identifier," "time order," and "causal order" simultaneously — their semantics differ (generation time / arrival time / event time). Slack's ts field is the server_ts, client_msg_id is a UUID — clearly separated.

5. To design a "user-visible" ID (in URLs, receipts, support tickets), how would you choose between UUIDv7, ULID, or a custom Base62?

User-visible IDs have completely different requirements from backend PKs. Five dimensions matter:

  • Unguessability: /order/123 in the URL lets anyone enumerate others' orders. Visible IDs need ≥ 64 bits of randomness. UUIDv7's 74 random bits are sufficient, but the 48-bit timestamp prefix leaks order time — unacceptable for some financial scenarios.
  • Resistance to input errors: when a customer reads an ID to support, O/0 and I/1 must be distinguishable. Crockford Base32 (used by ULID) was designed for this — drops I, L, O, U; Base62 doesn't have this property.
  • Length: URLs and receipts want short. UUIDv7's 36 chars is too long; ULID's 26 chars is acceptable; Stripe-style cus_ + 14 random chars hits the sweet spot.
  • Type identifiability: prefixes (ord_, inv_) let support staff identify the ID type at a glance, avoiding "invoice ID looked up as order ID" mistakes. This is Stripe's most underrated engineering decision.
  • Time leak: B2B scenarios may want time exposed; B2C privacy scenarios cannot.

Recommended: Stripe model{type_prefix}_{base62(random)}, 22-30 chars. Backend PK is UUIDv7 / Snowflake, never exposed externally; add a mapping layer. The cost is a mapping table (or use the prefix + random directly as the PK), in exchange for API evolution freedom.

Counter-example: early Twitter URLs containing tweet IDs were auto-inc then Snowflake — exposed tweet volume; the timestamp is still extractable today. Product launch identifier design deserves caution.