Day 4 Hard Database Sharding Consistent Hashing Resharding

Database Sharding — Where Single-Box Ends, Regret BeginsSharding: Hash vs Range, Consistent Hashing, Hot Spots, Resharding

The Problem

"Add a box, take 10× the traffic" — the rookie fantasy of sharding. The reality: sharding is irreversible architectural debt. Once you shard, you lose cross-shard transactions, you lose multi-table joins, you lose automatic secondary indexes. What you gain is "scale forever" and "every schema change is a nightmare."

So why do it? Because single boxes have hard ceilings: Postgres writes ~50k QPS at ~10TB; MySQL InnoDB buffer pool tops out at a few hundred GB; vertical CPU/RAM/SSD pricing curves go nonlinear (r6i.32xlarge runs ~$8k/month, and there's no SKU above that). When the single box won't do it, sharding becomes inevitable. The question isn't whether — it's when, how, and how you'll redo it.

Counter-examples: Foursquare 2010, a MongoDB shard ran out of space; a node OOMed during migration and the whole cluster cascaded — 11 hours down. Slack's first reshard in 2017 took 9 months (sharded by team_id; hot workspaces stayed skewed). Notion's 2021 split from one Postgres to 32 shards took half a year, and in 2023 they split again from 32 to 480.

Requirements and Constraints (the interview opener)

High-Level Architecture

graph TD
    APP["Application"]
    PROXY["Shard Router
Vitess vtgate / app layer / Citus coordinator"] META["Shard Metadata
ZK / etcd / config DB"] S1[("Shard 0
user_id % 16 == 0")] S2[("Shard 1
user_id % 16 == 1")] S3[("...")] SN[("Shard 15
user_id % 16 == 15")] R1[("Shard 0
Replica")] R2[("Shard 1
Replica")] APP --> PROXY PROXY -.read routing table.-> META PROXY --> S1 PROXY --> S2 PROXY --> S3 PROXY --> SN S1 -.async repl.-> R1 S2 -.async repl.-> R2 classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef proxy fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef meta fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef shard fill:#0e2030,stroke:#5eead4,color:#e8eef5 class APP app class PROXY proxy class META meta class S1,S2,S3,SN,R1,R2 shard

Three layers: application → router (looks up metadata to pick a shard) → physical shards (each with its own primary/replica)

Key Technical Points

1. Hash Sharding vs Range Sharding: The Triangle of Performance, Hot Spots, and Cross-Shard Queries

Principle: the two basic strategies, deciding how data maps to physical nodes.

Hash ShardingRange Sharding
Mappingshard = hash(key) % Nby key range ([a-f] → S0, [g-m] → S1, …)
Distribution✅ near-uniform with a good hash❌ easy to skew (all new users land in the last shard)
Point lookup✅ O(1) hash directly✅ binary search the routing table
Range scan❌ must fan out across all shards✅ contiguous ranges stay in one shard
Hot spotssingle hot key (superuser) = single hot shardmonotonic writes (time series) = last shard is hot
Reshardingnaive mod rehash is painful; consistent hash mitigatesrange splits are relatively easy
TypicalDynamoDB (partition key), Cassandra, MemcachedHBase, Bigtable, MongoDB (optional), CockroachDB
How to choose:
# Simple hash sharding (application layer)
import hashlib
NUM_SHARDS = 16

def shard_of(user_id: int) -> int:
    h = hashlib.md5(str(user_id).encode()).digest()
    return int.from_bytes(h[:4], 'big') % NUM_SHARDS

def get_conn(user_id):
    return shard_pool[shard_of(user_id)]

# ⚠️ don't use Python's built-in hash() — inconsistent across processes (PYTHONHASHSEED)
# ⚠️ don't do user_id % N directly — if user_id is monotonically generated and N is small,
#    new users cluster into a few shards
Real cases:

2. Consistent Hashing: So Adding a Node Doesn't Move Everything

Principle: naive hash(key) % N moves almost every key when N changes (N: 8→9 remaps ~88% of keys). Consistent hashing visualizes the hash space as a ring (0 to 2^32); nodes and keys both map onto it; a key walks clockwise to the first node. Adding or removing a node only affects the slice between two neighbors.

graph LR
    subgraph Ring["Hash Ring (2^32)"]
        direction LR
        N1["Node A
vnode 0..200"] N2["Node B
vnode 201..400"] N3["Node C
vnode 401..600"] N4["Node D (new)
only takes over
this slice"] end K1["key1 hash=150"] --> N1 K2["key2 hash=350"] --> N2 K3["key3 hash=480"] --> N3 K4["key4 hash=270"] --> N4 classDef node fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef key fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef new fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class N1,N2,N3 node class K1,K2,K3,K4 key class N4 new

But naive consistent hashing skews when nodes aren't evenly distributed. Virtual nodes (vnodes) fix this — each physical node owns 100–200 virtual points on the ring, statistically smoother; removing a node spreads its load evenly to all others rather than dumping on a single neighbor.

Trade-off:
# Simplified consistent hashing (with vnodes)
import bisect, hashlib

class ConsistentHash:
    def __init__(self, nodes, vnodes=150):
        self.ring = []        # sorted list of (hash, node)
        self.vnodes = vnodes
        for n in nodes:
            self.add_node(n)

    def _h(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node):
        for i in range(self.vnodes):
            h = self._h(f"{node}#{i}")
            bisect.insort(self.ring, (h, node))

    def get(self, key):
        h = self._h(key)
        i = bisect.bisect(self.ring, (h, ''))
        return self.ring[i % len(self.ring)][1]

# Add one node (N → N+1), remap ratio ≈ 1/(N+1)
# Naive mod: remap ratio ≈ N/(N+1)
Real cases:

3. Hot Spots: Why "Uniform" Sharding Still Gets Crushed

Principle: "statistically uniform" ≠ "uniform in production." Three classes of hot spot:

  1. Key hot: one key takes too much traffic (a celebrity with 100M followers, a viral product at 100k QPS). No matter how good the hash, that key stays on one shard.
  2. Range hot: under range sharding, "the newest data" concentrates (timestamp shard key → all writes go to the last shard).
  3. Tenant hot: in multi-tenant systems, big customers (Salesforce in Slack, Adobe in Notion) take 90% of one shard.

Mitigations:

StrategyHowFitsCost
Saltingappend a random/enumerable suffix: celebrity_42#{0..15}extremely write-hot single keyN× read amplification (concurrent query + merge)
Cache in fronthot key in Redis / CDN; DB only takes missesextremely read-hotcache consistency / TTL jitter
Dedicated shardbig tenant gets its own shard / instancetenant hotrequires detection + migration + capacity planning
Dynamic splitSpanner / DynamoDB / CockroachDB auto-detect and split hot regionswhen the system supports ittransient throttle during split
Reversed timestampprefix rowkey with MAX_LONG - timestamp or a hashrange time-series tailloses natural ordering, app must flip
Read replica fan-outadd more read replicas to the hot shardread hotwrites still single-point, replica lag
# Salting example: follower table for a write-heavy celebrity account
# Original schema:
#   follower(celebrity_id, follower_id, ts)  partition by celebrity_id
# Problem: celebrity_id=42 (Elon) writes at 50k QPS, far past the 1000 WCU per-partition cap

# Fix: append a hash suffix
shard_suffix = random.randint(0, 15)
write(table, partition=f"42#{shard_suffix}", row={...})

# On read, fan out
results = []
for i in range(16):
    results += query(partition=f"42#{i}")
return merge_sort_by_ts(results)

# Trade-off: 16× read amplification; but the write problem is solved
# Fits write >> read; not the reverse
Real cases:

4. Resharding: The Engineering Nightmare from 32 to 480 Shards

Principle: resharding is among the most painful distributed-systems operations because it must: (a) move data from the old topology to the new; (b) not interrupt the business; (c) be strongly consistent across cutover; (d) be reversible if it goes wrong. Any resharding without shadow read/write and a rollback plan is gambling.

sequenceDiagram
    participant App
    participant Router
    participant Old as Old Shard (16)
    participant New as New Shard (32)

    Note over Router: Phase 1 · dual write
    App->>Router: write k1
    Router->>Old: write k1 (primary)
    Router->>New: write k1 (shadow, async)

    Note over Router: Phase 2 · backfill history
    Router-->>New: background copy: snapshot + CDC

    Note over Router: Phase 3 · shadow-read validation
    App->>Router: read k1
    Router->>Old: read (returned to user)
    Router->>New: read (compare, not returned)
    Note right of Router: diff > threshold → halt cutover

    Note over Router: Phase 4 · cutover (gradual %)
    App->>Router: write k1
    Router->>New: write k1 (primary)
    Router->>Old: write k1 (mirror, backup)

    Note over Router: Phase 5 · teardown / drop old data
Resharding strategies compared:
# Resharding key idea: idempotent dual write + version stamp
def write(key, value):
    version = current_ms()
    # dual write with version, so old values can't overwrite new ones
    write_old(key, value, version)
    if reshard_phase >= DUAL_WRITE:
        try:
            write_new(key, value, version)  # if-newer
        except Exception as e:
            log_drift(key, e)   # alert, do not block the main path

def read(key):
    if reshard_phase == SHADOW_READ:
        v_old = read_old(key)
        v_new = read_new(key)
        if v_old != v_new:
            metrics.diff_count.inc()
            log_diff(key, v_old, v_new)
        return v_old  # still return old until cutover
    elif reshard_phase >= CUTOVER:
        return read_new(key)
    else:
        return read_old(key)
Real cases:

Scaling Up: What to Do After Growth

Common Pitfalls

1. Shard key on a monotonic field: hashing on auto_increment id still concentrates new data in a few shards (auto_increment in sharded settings often has prefix bias). Timestamp as range key puts every new write at the tail.
2. Using % as the shard function — adding a node = disaster: N from 16 → 32 moves ~93.75% of keys. Use consistent hashing or logical-over-physical.
3. Forcing 2PC across shards for transactions: performance collapses, the coordinator becomes a SPOF. Switch to saga / outbox / app-layer compensation.
4. Cutting over with no shadow read: you think it's consistent; users see data loss or duplicates. Run shadow read for at least 1–2 weeks of diff data before cutting.
5. Ignoring connection explosion: N shards, every app instance connecting to all shards = N × instances of connections, instantly past Postgres's 5k cap. Pool via PgBouncer / proxy.
6. Cross-shard unique constraints: unique(email) in a sharded table can't be enforced by the DB (the same email succeeds in two shards). You need a central unique service (a small KV store or one centralized table).
7. Sharding too early: single-box Postgres on r6i.32xlarge with a good schema handles 50k QPS / 5TB. Many teams shard at 1k QPS, adding complexity for no reason. Scale up, add cache, add replicas — shard last.

Sample Interview Questions

  1. Designing Twitter: how do you shard the tweets table — user_id or tweet_id? Why? What problem does each have?
  2. Explain consistent hashing and vnodes; use an "adding a node" scenario to contrast mod hash vs consistent hash migration ratios.
  3. A DynamoDB partition suddenly throttles (hot partition) and you can't change the schema. How many mitigations can you list?
  4. Designing a multi-tenant SaaS, one customer spikes to 30% of platform traffic. How do you "move them out" invisibly?
  5. What are the key phases of online resharding? How do you detect data drift during dual write?
  6. Why provision many more logical shards than physical instances? Use a concrete scale-out scenario to show the benefit.

Key Resources

English Summary

Sharding is irreversible architectural debt — you trade away cross-shard transactions, joins, and global secondary indexes for unbounded horizontal scale. Two basic strategies: hash sharding distributes evenly but kills range scans; range sharding preserves locality but creates hotspots on monotonically increasing keys. Consistent hashing with virtual nodes minimizes data movement when adding/removing nodes — N→N+1 moves only 1/N of keys versus ~N/(N+1) with naive mod-hash. Hot spots are inevitable: celebrity accounts, tenant whales, and time-series tails all break statistical uniformity. Mitigations include salting, dedicated shards, adaptive capacity, and caching. Modern best practice is logical-over-physical sharding: provision many more logical shards (e.g., 1024) than physical instances, so scaling out is just migrating logical shards without changing the routing hash. Online resharding requires dual writes, shadow reads, and gradual cutover; expect 3–9 months of engineering. Default to vertical scale and read replicas first — only shard when you must.

Going Deeper (click to expand)

1. Notion's logical-over-physical (1024 logical / 32 physical) means "scale-out without changing the routing function." But if even logical shards run out (need 1024 → 4096), what do they face? Have they just deferred the problem?

The essence: logical-over-physical decouples the "shard function" from the "physical topology." The former stays stable, the latter is tunable. It solves "physical capacity expansion" but not "single logical shard saturation."

  • When do you need more logical shards: a single logical shard saturates (e.g., one workspace is too big to fit on a single logical shard).
  • This really does require rehashing: 1024 → 4096 changes the routing hash, remapping almost all keys. Cost equals first-time sharding.
  • Mitigation A: over-provision logical shards from day one. Notion picked 480, leaving headroom; Figma went straight to 4096. One logical shard's metadata is ~1MB; 4096 is ~4MB — effectively free.
  • Mitigation B: split-style resharding. Split shard k into k_a and k_b, hash range half-and-half, affecting only one shard's data. CockroachDB / Spanner range split natively; app-layer sharding can mimic it, though the routing table grows.
  • Mitigation C: heterogeneous sharding. Move large tenants to dedicated shards (physical isolation) outside the hash pool. Slack's vitess pods do exactly this.

Conclusion: logical-over-physical isn't a silver bullet, but it drives "routine scale-out" to near-zero cost and defers the pain of true resharding by years. That "deferral" is huge engineering value — the business has time to evolve, the team has time to mature, tooling has time to develop. Paying complexity only when necessary is core wisdom in distributed-systems design.

2. Pinterest co-locates a user's boards and pins on the same shard (by user_id). Sounds great — but if users "follow" another user's board, the follow relationship crosses shards. How do they handle the cross-shard "follow + feed generation"?

Pinterest (and Twitter / Instagram) all face the "social graph crosses shards" problem. The common idea: abandon real-time joins on OLTP, switch to precomputation + async pipelines.

  • Where the follow relationship lives: a dedicated "relationship table" sharded by follower_id (same key as user shard), storing "who follows whom"; mirrored by followee_id for fanout (dual write).
  • Feed generation: fanout-on-write. A posts a new pin → look up A's followers → append one row to each follower's feed table. Each follower's feed lives on their own shard; reads are O(1).
  • Celebrity problem (10M followers): fanout of 10M is too expensive. Switch to hybrid: regular users continue fanout-on-write; celebrities switch to fanout-on-read (followers pull celeb's recent pins at read time and merge).
  • Cross-shard strong consistency is gone: the moment A follows B, the feed doesn't immediately show B's pins. Eventually consistent in seconds to minutes — product-acceptable.
  • Reverse index (who pinned this image): separate reverse-index service (sharded by pin_id), async from the main table via CDC.

The pattern: in sharded systems, cross-shard relationships are converted to shard-local queries through async fanout / materialized views. Twitter timeline, Instagram feed, Facebook News Feed all use this. The cost is write amplification (one fanout = N writes) and short-term inconsistency, in exchange for O(1) reads and horizontal scale.

Interview extension: if asked "how does a social feed work," answering with hybrid fanout-on-write/read + celebrity special-casing is already a senior answer. Day 14 (Feed Systems) covers this in depth.

3. You designed a SaaS backend, sharded by tenant_id. A year in: 80% of tenants are < 1GB, but the top 5 hold 60% of all data (classic power law). How do you evolve?

Classic "tenant skew." Almost every multi-tenant SaaS hits it. A single sharding strategy can't handle two different capacity tiers — you need a tiered architecture.

  • Identify tiers: do data analysis first, classify tenants into three buckets: (a) long-tail (< 1GB, millions); (b) mid (1GB–100GB, thousands); (c) whale (> 100GB, dozens).
  • Long-tail: shared shard pool. One physical instance hosts thousands of tenants by hash; this is the default.
  • Mid: fixed logical shard. Each tenant pinned to one logical shard; multiple logical shards share a physical instance, migratable independently as needed.
  • Whale: dedicated instance. Each whale gets its own instance (sometimes its own cluster) — separately tunable, separate SLAs, separate billing. Slack puts Salesforce / Verizon-tier customers on dedicated vitess pods.
  • Encapsulate in the routing layer: app code sees no difference; get_shard(tenant_id) queries metadata to pick the right tier.
  • Transition mechanism: when a tenant crosses a tier (mid → whale), online migration is required: shadow read + dual write + cutover; hours to days.
  • Pricing benefit: whales on dedicated instances → cost is traceable → you can charge enterprise tier. Snowflake, Databricks, MongoDB Atlas all use this pattern: "shared → dedicated" is the physical foundation of product tiering.

Anti-patterns:

  • "Hash all tenants uniformly": a whale blows up a shard; long-tail wastes capacity.
  • "Dedicated DB per tenant from day one": millions of long-tail instances explodes ops.

Core idea: physical resources should mirror the shape of the data, not be force-fit into a uniform model. That's also why "tenant_id as shard key" is a starting point, not an ending one.

4. Consistent hashing moves only 1/N of data when adding a node — beautiful. But Notion picked logical-over-physical, not consistent hashing. Why? What are the real engineering differences?

On the surface both solve "don't move much data when adding a node," but several real engineering differences make logical-over-physical friendlier for OLTP databases.

  • (a) Migration granularity. Consistent hashing moves a "slice of keys on the ring" — those keys are scattered through all historical writes; migration is fundamentally scan + filter, random I/O, slow. Logical shard migration moves "an entire logical shard as a unit" — file-level / table-level copy, sequential I/O, fast, atomic.
  • (b) Operational visibility. A logical shard is a "named entity" — you can see "shard #237 on instance-3, X GB, Y QPS." With consistent hash, a key range is an abstract ring segment; monitoring, debug, and capacity planning have no concrete handle.
  • (c) Cross-shard operations have stable boundaries. Logical shard boundaries are stable; cross-shard operations are predictable. With consistent hash, the same set of keys may end up in different segments after adding nodes — the physical topology of cross-segment queries changes dynamically.
  • (d) Fault isolation. A logical shard failure only affects its tenants. With consistent hash, a failed ring segment can affect multiple unrelated workloads.
  • (e) Secondary indexes, foreign keys, constraints. OLTP unique / FK constraints typically bind to a specific table / shard. Logical shards offer stable containers; a flowing ring range doesn't.
  • (f) Consistent hash fits KV / cache better. Memcached, DynamoDB — schemaless, no cross-row constraints, frequent node churn — are consistent hash heaven. Postgres is not.

Conclusion: consistent hashing's "minimum migration" optimization is huge for KV cache systems (where nodes change often); for OLTP databases (rare, planned node changes), that savings doesn't matter much, and "observability + operational simplicity" matters more. That's why DynamoDB and Cassandra (KV) use consistent hash, while Notion / Figma / Slack (OLTP) use logical-over-physical. The choice isn't about which theory is elegant; it's about which matches the rate of change in the business and the team's operational mental model.

5. Capacity estimate: design an e-commerce platform, projecting 100M users, 100 orders/year each, 2KB/order, 5 years of data. Reason through "do we shard," "how many shards," "shard key," and "cross-shard queries."

Step 1: total scale

  • 5-year order count = 1e8 × 100 × 5 = 5e10 = 50 billion rows.
  • Raw data = 500e9 × 2KB = 1000 TB = 1 PB.
  • With indexes (2–3 secondary, ~50% of data) + MVCC bloat = real ~2 PB.
  • Average write QPS: 5e10 / (5 × 365 × 86400) ≈ 317 QPS. Black-Friday peak (20×) ≈ ~6.3k QPS.
  • Read QPS: each user views orders 10 times/day → 1e8 × 10 / 86400 ≈ 11.6k QPS, peak ~200k.

Step 2: do we shard?

  • 2PB of data is far past single-box capacity (even r6i.32xlarge has ~10TB of practical SSD) → must shard.
  • QPS could be handled by a single box, but 2PB of data makes index maintenance, backup/restore, and vacuum impractical → data volume alone forces sharding.
  • Alternative: distributed SQL (TiDB / Spanner / CockroachDB), but cost may be 3–5× — trade-off territory.

Step 3: how many shards

  • Per-shard target: < 1TB of data + < 5k QPS (50% headroom).
  • 2PB / 1TB = 2000 shards.
  • 2000 physical instances is too expensive → logical-over-physical: 2048 logical shards on 64–128 physical instances (16–32 logical shards per instance).
  • Future scale-out moves logical shards to new instances without changing the routing hash.

Step 4: shard key

  • Option 1: user_id. User queries their own orders (90% of traffic) on one shard ✅. Problem: merchant-dimension queries fan out across shards.
  • Option 2: order_id (Snowflake-like, with timestamp). Order detail page is O(1) ✅. Problem: "user's order list" scans all shards ❌.
  • Option 3: (merchant_id, user_id) composite. Both merchant and user views are local. But a user's "all my orders across merchants" is hard.
  • Actual pick: hash on user_id, with order_id carrying a user_id prefix (e.g., {user_id_hash}_{snowflake}) so "look up by order_id" can also locate the shard. Merchant dimension is a separate materialized view (CDC → merchant index table, sharded by merchant_id).

Step 5: cross-shard queries

  • User dimension (order list, order detail) → single shard, no cross-shard.
  • Merchant dimension (merchant dashboard sees own orders) → maintain a merchant-view secondary table (CDC async sync, sharded by merchant_id).
  • Aggregate analytics ("last 30 days GMV") → not through OLTP; CDC to Snowflake / ClickHouse.
  • Risk, recommendations → warehouse + offline pipelines, never hit OLTP.

Cost estimate

  • 2PB actual storage (with indexes, 3× replicas) = 6PB. AWS gp3 SSD at $0.08/GB/month → 6PB × $80/TB = $480k/month in storage alone.
  • Hot/cold tiering: last 6 months hot (200TB) on SSD, the rest on object storage + ClickHouse → real SSD need 600TB, monthly cost down to ~$50k.
  • Compute: 128 r6i.8xlarge instances ≈ $150k/month.
  • Total order of magnitude $200k–$500k/month, $3M–$6M/year. That's why large e-commerce platforms must do hot/cold tiering, archival, compression.

Interview point: from "data size" to "do we shard," from "QPS + headroom" to "how many," from "dominant access pattern" to "shard key," from "secondary access patterns" to "secondary indexes / materialized views," and finally to "cost magnitude" with hot/cold tiering. The full chain is essentially a staff-level answer.