The Problem
"Add a box, take 10× the traffic" — the rookie fantasy of sharding. The reality: sharding is irreversible architectural debt. Once you shard, you lose cross-shard transactions, you lose multi-table joins, you lose automatic secondary indexes. What you gain is "scale forever" and "every schema change is a nightmare."
So why do it? Because single boxes have hard ceilings: Postgres writes ~50k QPS at ~10TB; MySQL InnoDB buffer pool tops out at a few hundred GB; vertical CPU/RAM/SSD pricing curves go nonlinear (r6i.32xlarge runs ~$8k/month, and there's no SKU above that). When the single box won't do it, sharding becomes inevitable. The question isn't whether — it's when, how, and how you'll redo it.
Counter-examples: Foursquare 2010, a MongoDB shard ran out of space; a node OOMed during migration and the whole cluster cascaded — 11 hours down. Slack's first reshard in 2017 took 9 months (sharded by team_id; hot workspaces stayed skewed). Notion's 2021 split from one Postgres to 32 shards took half a year, and in 2023 they split again from 32 to 480.
Requirements and Constraints (the interview opener)
- Current scale and growth: data size, write QPS, read QPS, growth rate. If you won't exceed single-box limits within a year, don't shard.
- Dominant access pattern: which key does 90% of your queries use? That's your shard-key candidate.
- Cross-shard query fraction: if 30% of queries fan out to every shard, sharding barely helps.
- Data distribution: are tenants / users / orders uniform in size, or long-tailed? Long tail = hot-shard risk.
- Transaction requirements: many cross-user or cross-tenant transactions? Those have to become sagas or eventual consistency after sharding.
- Future resharding cost: can you accept 1–2 resharding rounds (months of engineering each), or do you want consistent hashing to dodge them from day one?
- Operational maturity: roll your own application-side sharding? Vitess (MySQL) or Citus (Postgres)? Or jump straight to distributed SQL (TiDB / CockroachDB)?
High-Level Architecture
graph TD
APP["Application"]
PROXY["Shard Router
Vitess vtgate / app layer / Citus coordinator"]
META["Shard Metadata
ZK / etcd / config DB"]
S1[("Shard 0
user_id % 16 == 0")]
S2[("Shard 1
user_id % 16 == 1")]
S3[("...")]
SN[("Shard 15
user_id % 16 == 15")]
R1[("Shard 0
Replica")]
R2[("Shard 1
Replica")]
APP --> PROXY
PROXY -.read routing table.-> META
PROXY --> S1
PROXY --> S2
PROXY --> S3
PROXY --> SN
S1 -.async repl.-> R1
S2 -.async repl.-> R2
classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef proxy fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef meta fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef shard fill:#0e2030,stroke:#5eead4,color:#e8eef5
class APP app
class PROXY proxy
class META meta
class S1,S2,S3,SN,R1,R2 shard
Three layers: application → router (looks up metadata to pick a shard) → physical shards (each with its own primary/replica)
Key Technical Points
1. Hash Sharding vs Range Sharding: The Triangle of Performance, Hot Spots, and Cross-Shard Queries
Principle: the two basic strategies, deciding how data maps to physical nodes.
| Hash Sharding | Range Sharding |
| Mapping | shard = hash(key) % N | by key range ([a-f] → S0, [g-m] → S1, …) |
| Distribution | ✅ near-uniform with a good hash | ❌ easy to skew (all new users land in the last shard) |
| Point lookup | ✅ O(1) hash directly | ✅ binary search the routing table |
| Range scan | ❌ must fan out across all shards | ✅ contiguous ranges stay in one shard |
| Hot spots | single hot key (superuser) = single hot shard | monotonic writes (time series) = last shard is hot |
| Resharding | naive mod rehash is painful; consistent hash mitigates | range splits are relatively easy |
| Typical | DynamoDB (partition key), Cassandra, Memcached | HBase, Bigtable, MongoDB (optional), CockroachDB |
How to choose:
- Mostly point lookups and uniform data (social, e-commerce orders by user_id) → Hash.
- Need range scans (time series, logs, fetch a window by time) → Range (but watch out for tail-write hotness).
- Mix: composite key (e.g.
(hash(user_id), timestamp range)) — cross-user hash distributes, intra-user range stays sequential. Cassandra's partition key + clustering key is this idea.
# Simple hash sharding (application layer)
import hashlib
NUM_SHARDS = 16
def shard_of(user_id: int) -> int:
h = hashlib.md5(str(user_id).encode()).digest()
return int.from_bytes(h[:4], 'big') % NUM_SHARDS
def get_conn(user_id):
return shard_pool[shard_of(user_id)]
# ⚠️ don't use Python's built-in hash() — inconsistent across processes (PYTHONHASHSEED)
# ⚠️ don't do user_id % N directly — if user_id is monotonically generated and N is small,
# new users cluster into a few shards
Real cases:
- Instagram: sharded by user_id into thousands of "logical shards," multiple logical shards multiplexed on a single Postgres instance. New user_id is allocated by a Snowflake-like generator that bakes in the shard tag.
- Notion: workspace hash sharded into 480 logical shards across 96 Postgres instances (5 per instance). All blocks live in one
block table, routed by workspace.
- Discord (ScyllaDB): messages keyed by
(channel_id, bucket) — channel_id hashes to partition, bucket is a time window for range; same channel in the same window reads sequentially.
- HBase / Bigtable: pure range — rows sorted by rowkey, regions split lexicographically; Google recommends prefixing rowkeys to spread load (avoid "all recent data hot").
- Pinterest: in-house MySQL sharding, hash on user_id. Boards and pins follow the user shard (co-located), keeping all user-dimension data in one shard.
2. Consistent Hashing: So Adding a Node Doesn't Move Everything
Principle: naive hash(key) % N moves almost every key when N changes (N: 8→9 remaps ~88% of keys). Consistent hashing visualizes the hash space as a ring (0 to 2^32); nodes and keys both map onto it; a key walks clockwise to the first node. Adding or removing a node only affects the slice between two neighbors.
graph LR
subgraph Ring["Hash Ring (2^32)"]
direction LR
N1["Node A
vnode 0..200"]
N2["Node B
vnode 201..400"]
N3["Node C
vnode 401..600"]
N4["Node D (new)
only takes over
this slice"]
end
K1["key1 hash=150"] --> N1
K2["key2 hash=350"] --> N2
K3["key3 hash=480"] --> N3
K4["key4 hash=270"] --> N4
classDef node fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef key fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef new fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class N1,N2,N3 node
class K1,K2,K3,K4 key
class N4 new
But naive consistent hashing skews when nodes aren't evenly distributed. Virtual nodes (vnodes) fix this — each physical node owns 100–200 virtual points on the ring, statistically smoother; removing a node spreads its load evenly to all others rather than dumping on a single neighbor.
Trade-off:
- Naive mod hash: ✅ one line of code; ❌ adding a node ≈ resharding.
- Consistent hashing + vnodes: ✅ adding/removing a node migrates only
1/N of data; ❌ routing table is more complex; need to query vnode mappings; too many vnodes makes node restart slow (each vnode rejoins).
- Jump consistent hash (Google 2014): completely stateless, no vnodes, O(log N); but only supports adding/removing at the tail (can't drop arbitrary middle nodes). Used in Vitess and parts of Discord.
- Rendezvous (HRW) hashing: each key scores every node, takes the max. Same migration rate as consistent hash, simpler implementation, but lookup is O(N) not O(log N).
# Simplified consistent hashing (with vnodes)
import bisect, hashlib
class ConsistentHash:
def __init__(self, nodes, vnodes=150):
self.ring = [] # sorted list of (hash, node)
self.vnodes = vnodes
for n in nodes:
self.add_node(n)
def _h(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
for i in range(self.vnodes):
h = self._h(f"{node}#{i}")
bisect.insort(self.ring, (h, node))
def get(self, key):
h = self._h(key)
i = bisect.bisect(self.ring, (h, ''))
return self.ring[i % len(self.ring)][1]
# Add one node (N → N+1), remap ratio ≈ 1/(N+1)
# Naive mod: remap ratio ≈ N/(N+1)
Real cases:
- Amazon Dynamo (2007 paper): the origin of consistent hashing and vnodes; DynamoDB still uses this underneath.
- Cassandra: 256 vtokens (virtual tokens) per node by default; adding a node moves only
1/N of data. Apple runs 100k+ Cassandra nodes.
- Riak / ScyllaDB: same Dynamo model.
- Memcached (client-side): Ketama algorithm (consistent hash + vnodes) is the de facto standard; early Facebook ran 1k+ Memcached nodes on it.
- Discord: voice server assignment uses consistent hashing to map guilds to voice nodes; failure only reconnects affected guilds.
- Cloudflare: parts of edge request routing use HRW / consistent hash to land same-host requests on the same box for cache locality.
3. Hot Spots: Why "Uniform" Sharding Still Gets Crushed
Principle: "statistically uniform" ≠ "uniform in production." Three classes of hot spot:
- Key hot: one key takes too much traffic (a celebrity with 100M followers, a viral product at 100k QPS). No matter how good the hash, that key stays on one shard.
- Range hot: under range sharding, "the newest data" concentrates (timestamp shard key → all writes go to the last shard).
- Tenant hot: in multi-tenant systems, big customers (Salesforce in Slack, Adobe in Notion) take 90% of one shard.
Mitigations:
| Strategy | How | Fits | Cost |
| Salting | append a random/enumerable suffix: celebrity_42#{0..15} | extremely write-hot single key | N× read amplification (concurrent query + merge) |
| Cache in front | hot key in Redis / CDN; DB only takes misses | extremely read-hot | cache consistency / TTL jitter |
| Dedicated shard | big tenant gets its own shard / instance | tenant hot | requires detection + migration + capacity planning |
| Dynamic split | Spanner / DynamoDB / CockroachDB auto-detect and split hot regions | when the system supports it | transient throttle during split |
| Reversed timestamp | prefix rowkey with MAX_LONG - timestamp or a hash | range time-series tail | loses natural ordering, app must flip |
| Read replica fan-out | add more read replicas to the hot shard | read hot | writes still single-point, replica lag |
# Salting example: follower table for a write-heavy celebrity account
# Original schema:
# follower(celebrity_id, follower_id, ts) partition by celebrity_id
# Problem: celebrity_id=42 (Elon) writes at 50k QPS, far past the 1000 WCU per-partition cap
# Fix: append a hash suffix
shard_suffix = random.randint(0, 15)
write(table, partition=f"42#{shard_suffix}", row={...})
# On read, fan out
results = []
for i in range(16):
results += query(partition=f"42#{i}")
return merge_sort_by_ts(results)
# Trade-off: 16× read amplification; but the write problem is solved
# Fits write >> read; not the reverse
Real cases:
- Twitter: the Justin Bieber-era "superuser problem." Manhattan (Twitter's in-house KV) routed celeb accounts to "split storage + single-writer, multi-read replica." Timeline fanout went hybrid (normal users push, celebs pull).
- DynamoDB Adaptive Capacity: since 2017, auto-detects hot partitions and splits them transparently — but there are still caps, and customers often resort to salting.
- Slack: the "vitess pod" concept puts hot workspaces on dedicated MySQL instances; many cold workspaces share one.
- Bigtable / HBase: the first recommendation in the docs is "scatter rowkey prefixes" to prevent hot regions. Salesforce uses hash prefixes to handle hot time-series writes.
- Shopify Vitess: manually moves top merchants to dedicated pods before Black Friday to avoid single-shard meltdown.
4. Resharding: The Engineering Nightmare from 32 to 480 Shards
Principle: resharding is among the most painful distributed-systems operations because it must: (a) move data from the old topology to the new; (b) not interrupt the business; (c) be strongly consistent across cutover; (d) be reversible if it goes wrong. Any resharding without shadow read/write and a rollback plan is gambling.
sequenceDiagram
participant App
participant Router
participant Old as Old Shard (16)
participant New as New Shard (32)
Note over Router: Phase 1 · dual write
App->>Router: write k1
Router->>Old: write k1 (primary)
Router->>New: write k1 (shadow, async)
Note over Router: Phase 2 · backfill history
Router-->>New: background copy: snapshot + CDC
Note over Router: Phase 3 · shadow-read validation
App->>Router: read k1
Router->>Old: read (returned to user)
Router->>New: read (compare, not returned)
Note right of Router: diff > threshold → halt cutover
Note over Router: Phase 4 · cutover (gradual %)
App->>Router: write k1
Router->>New: write k1 (primary)
Router->>Old: write k1 (mirror, backup)
Note over Router: Phase 5 · teardown / drop old data
Resharding strategies compared:
- Stop-the-world migration: copy data inside a maintenance window. ✅ simple; ❌ hours of downtime, unacceptable for modern services.
- Online resharding (dual write): the five-phase pattern above. ✅ zero downtime; ❌ huge engineering effort (months); maintaining consistency during dual write is fiddly.
- Logical shards over physical shards: provision 1024 logical shards on 16 physical instances up front; resharding just moves logical shards to new instances, without changing the routing hash. Vitess, Notion, Figma all do this. The most modern approach.
- Consistent hashing: in theory adding a node moves only
1/N of data, but vnode mapping changes still require online backfill — not as free as it sounds.
- Distributed SQL (TiDB / CockroachDB / Spanner): auto region split and migration; ops is nearly free — but you accept the trade-offs (write latency, ecosystem, cost).
# Resharding key idea: idempotent dual write + version stamp
def write(key, value):
version = current_ms()
# dual write with version, so old values can't overwrite new ones
write_old(key, value, version)
if reshard_phase >= DUAL_WRITE:
try:
write_new(key, value, version) # if-newer
except Exception as e:
log_drift(key, e) # alert, do not block the main path
def read(key):
if reshard_phase == SHADOW_READ:
v_old = read_old(key)
v_new = read_new(key)
if v_old != v_new:
metrics.diff_count.inc()
log_diff(key, v_old, v_new)
return v_old # still return old until cutover
elif reshard_phase >= CUTOVER:
return read_new(key)
else:
return read_old(key)
Real cases:
- Notion 2021 → 2023: single Postgres → 32 shards → 480 shards. Each resharding took about 6 months. The second one used logical-over-physical, so future scale-out only migrates logical shards to new machines, without changing routing.
- Figma 2023: from single Postgres to horizontally sharded Postgres (sharded by file), with PgBouncer + an in-house routing layer; 9 months of work, with shadow-read validation running for a month.
- Slack 2017: resharded the messages table from a single DB to Vitess — 9 months of engineering, CDC incremental sync + shadow read.
- Shopify Vitess: periodic resharding is routine ops with mature tooling (vreplication); they proactively rebalance hot pods before Black Friday.
- Foursquare 2010 (counter-example): MongoDB shard filled up before migration; migration triggered an OOM cascade, 11-hour outage. Postmortem lesson: "plan the next reshard at 50% capacity, not at 95%."
Scaling Up: What to Do After Growth
- Picking the shard key matters more than picking the database. The wrong shard key is a debt you carry forever. Principles: high cardinality (good distribution), aligned with the dominant query pattern (90% of queries carry the key), stable over time (doesn't migrate from one value to another).
- Co-location: keep related data in the same shard. Pinterest co-locates boards and pins with the user shard; Citus uses colocation groups; Vitess uses keyspaces + sequences. User-dimension transactions and joins complete on a single shard.
- Cross-shard secondary indexes: either maintain a global index table at the app layer (synchronous dual write), or push to external search (Elasticsearch, a reverse-index service). Strongly consistent cross-shard indexes are expensive.
- Cross-shard joins / aggregation: give up real-time joins; analyze via CDC → warehouse; turn interactive APIs into fan-out + application-layer merge (cap at K shards).
- Capacity planning at 2× headroom: keep each shard under 50% utilization, otherwise resharding can't keep up. The Foursquare lesson.
- Many more logical shards than physical instances (1024 vs 32): scale by moving logical shards, never by changing the routing function, with zero application changes.
Common Pitfalls
1. Shard key on a monotonic field: hashing on auto_increment id still concentrates new data in a few shards (auto_increment in sharded settings often has prefix bias). Timestamp as range key puts every new write at the tail.
2. Using % as the shard function — adding a node = disaster: N from 16 → 32 moves ~93.75% of keys. Use consistent hashing or logical-over-physical.
3. Forcing 2PC across shards for transactions: performance collapses, the coordinator becomes a SPOF. Switch to saga / outbox / app-layer compensation.
4. Cutting over with no shadow read: you think it's consistent; users see data loss or duplicates. Run shadow read for at least 1–2 weeks of diff data before cutting.
5. Ignoring connection explosion: N shards, every app instance connecting to all shards = N × instances of connections, instantly past Postgres's 5k cap. Pool via PgBouncer / proxy.
6. Cross-shard unique constraints: unique(email) in a sharded table can't be enforced by the DB (the same email succeeds in two shards). You need a central unique service (a small KV store or one centralized table).
7. Sharding too early: single-box Postgres on r6i.32xlarge with a good schema handles 50k QPS / 5TB. Many teams shard at 1k QPS, adding complexity for no reason. Scale up, add cache, add replicas — shard last.
Sample Interview Questions
- Designing Twitter: how do you shard the tweets table — user_id or tweet_id? Why? What problem does each have?
- Explain consistent hashing and vnodes; use an "adding a node" scenario to contrast mod hash vs consistent hash migration ratios.
- A DynamoDB partition suddenly throttles (hot partition) and you can't change the schema. How many mitigations can you list?
- Designing a multi-tenant SaaS, one customer spikes to 30% of platform traffic. How do you "move them out" invisibly?
- What are the key phases of online resharding? How do you detect data drift during dual write?
- Why provision many more logical shards than physical instances? Use a concrete scale-out scenario to show the benefit.
Key Resources
- Designing Data-Intensive Applications, Ch. 6 — Partitioning (Kleppmann): the canonical discussion of hash/range, secondary indexes, and resharding.
- Notion engineering blog, "Sharding Postgres" (2021) and "The Great Re-shard" (2023): full retrospective of single DB → 32 shards → 480 shards, including the logical-over-physical design.
- Vitess docs + Slack blog "Scaling Datastores at Slack with Vitess": an industrial reference for horizontal MySQL scale.
English Summary
Sharding is irreversible architectural debt — you trade away cross-shard transactions, joins, and global secondary indexes for unbounded horizontal scale. Two basic strategies: hash sharding distributes evenly but kills range scans; range sharding preserves locality but creates hotspots on monotonically increasing keys. Consistent hashing with virtual nodes minimizes data movement when adding/removing nodes — N→N+1 moves only 1/N of keys versus ~N/(N+1) with naive mod-hash. Hot spots are inevitable: celebrity accounts, tenant whales, and time-series tails all break statistical uniformity. Mitigations include salting, dedicated shards, adaptive capacity, and caching. Modern best practice is logical-over-physical sharding: provision many more logical shards (e.g., 1024) than physical instances, so scaling out is just migrating logical shards without changing the routing hash. Online resharding requires dual writes, shadow reads, and gradual cutover; expect 3–9 months of engineering. Default to vertical scale and read replicas first — only shard when you must.
Going Deeper (click to expand)
1. Notion's logical-over-physical (1024 logical / 32 physical) means "scale-out without changing the routing function." But if even logical shards run out (need 1024 → 4096), what do they face? Have they just deferred the problem?
The essence: logical-over-physical decouples the "shard function" from the "physical topology." The former stays stable, the latter is tunable. It solves "physical capacity expansion" but not "single logical shard saturation."
- When do you need more logical shards: a single logical shard saturates (e.g., one workspace is too big to fit on a single logical shard).
- This really does require rehashing: 1024 → 4096 changes the routing hash, remapping almost all keys. Cost equals first-time sharding.
- Mitigation A: over-provision logical shards from day one. Notion picked 480, leaving headroom; Figma went straight to 4096. One logical shard's metadata is ~1MB; 4096 is ~4MB — effectively free.
- Mitigation B: split-style resharding. Split shard k into k_a and k_b, hash range half-and-half, affecting only one shard's data. CockroachDB / Spanner range split natively; app-layer sharding can mimic it, though the routing table grows.
- Mitigation C: heterogeneous sharding. Move large tenants to dedicated shards (physical isolation) outside the hash pool. Slack's vitess pods do exactly this.
Conclusion: logical-over-physical isn't a silver bullet, but it drives "routine scale-out" to near-zero cost and defers the pain of true resharding by years. That "deferral" is huge engineering value — the business has time to evolve, the team has time to mature, tooling has time to develop. Paying complexity only when necessary is core wisdom in distributed-systems design.
2. Pinterest co-locates a user's boards and pins on the same shard (by user_id). Sounds great — but if users "follow" another user's board, the follow relationship crosses shards. How do they handle the cross-shard "follow + feed generation"?
Pinterest (and Twitter / Instagram) all face the "social graph crosses shards" problem. The common idea: abandon real-time joins on OLTP, switch to precomputation + async pipelines.
- Where the follow relationship lives: a dedicated "relationship table" sharded by follower_id (same key as user shard), storing "who follows whom"; mirrored by followee_id for fanout (dual write).
- Feed generation: fanout-on-write. A posts a new pin → look up A's followers → append one row to each follower's feed table. Each follower's feed lives on their own shard; reads are O(1).
- Celebrity problem (10M followers): fanout of 10M is too expensive. Switch to hybrid: regular users continue fanout-on-write; celebrities switch to fanout-on-read (followers pull celeb's recent pins at read time and merge).
- Cross-shard strong consistency is gone: the moment A follows B, the feed doesn't immediately show B's pins. Eventually consistent in seconds to minutes — product-acceptable.
- Reverse index (who pinned this image): separate reverse-index service (sharded by pin_id), async from the main table via CDC.
The pattern: in sharded systems, cross-shard relationships are converted to shard-local queries through async fanout / materialized views. Twitter timeline, Instagram feed, Facebook News Feed all use this. The cost is write amplification (one fanout = N writes) and short-term inconsistency, in exchange for O(1) reads and horizontal scale.
Interview extension: if asked "how does a social feed work," answering with hybrid fanout-on-write/read + celebrity special-casing is already a senior answer. Day 14 (Feed Systems) covers this in depth.
3. You designed a SaaS backend, sharded by tenant_id. A year in: 80% of tenants are < 1GB, but the top 5 hold 60% of all data (classic power law). How do you evolve?
Classic "tenant skew." Almost every multi-tenant SaaS hits it. A single sharding strategy can't handle two different capacity tiers — you need a tiered architecture.
- Identify tiers: do data analysis first, classify tenants into three buckets: (a) long-tail (< 1GB, millions); (b) mid (1GB–100GB, thousands); (c) whale (> 100GB, dozens).
- Long-tail: shared shard pool. One physical instance hosts thousands of tenants by hash; this is the default.
- Mid: fixed logical shard. Each tenant pinned to one logical shard; multiple logical shards share a physical instance, migratable independently as needed.
- Whale: dedicated instance. Each whale gets its own instance (sometimes its own cluster) — separately tunable, separate SLAs, separate billing. Slack puts Salesforce / Verizon-tier customers on dedicated vitess pods.
- Encapsulate in the routing layer: app code sees no difference;
get_shard(tenant_id) queries metadata to pick the right tier.
- Transition mechanism: when a tenant crosses a tier (mid → whale), online migration is required: shadow read + dual write + cutover; hours to days.
- Pricing benefit: whales on dedicated instances → cost is traceable → you can charge enterprise tier. Snowflake, Databricks, MongoDB Atlas all use this pattern: "shared → dedicated" is the physical foundation of product tiering.
Anti-patterns:
- "Hash all tenants uniformly": a whale blows up a shard; long-tail wastes capacity.
- "Dedicated DB per tenant from day one": millions of long-tail instances explodes ops.
Core idea: physical resources should mirror the shape of the data, not be force-fit into a uniform model. That's also why "tenant_id as shard key" is a starting point, not an ending one.
4. Consistent hashing moves only 1/N of data when adding a node — beautiful. But Notion picked logical-over-physical, not consistent hashing. Why? What are the real engineering differences?
On the surface both solve "don't move much data when adding a node," but several real engineering differences make logical-over-physical friendlier for OLTP databases.
- (a) Migration granularity. Consistent hashing moves a "slice of keys on the ring" — those keys are scattered through all historical writes; migration is fundamentally scan + filter, random I/O, slow. Logical shard migration moves "an entire logical shard as a unit" — file-level / table-level copy, sequential I/O, fast, atomic.
- (b) Operational visibility. A logical shard is a "named entity" — you can see "shard #237 on instance-3, X GB, Y QPS." With consistent hash, a key range is an abstract ring segment; monitoring, debug, and capacity planning have no concrete handle.
- (c) Cross-shard operations have stable boundaries. Logical shard boundaries are stable; cross-shard operations are predictable. With consistent hash, the same set of keys may end up in different segments after adding nodes — the physical topology of cross-segment queries changes dynamically.
- (d) Fault isolation. A logical shard failure only affects its tenants. With consistent hash, a failed ring segment can affect multiple unrelated workloads.
- (e) Secondary indexes, foreign keys, constraints. OLTP unique / FK constraints typically bind to a specific table / shard. Logical shards offer stable containers; a flowing ring range doesn't.
- (f) Consistent hash fits KV / cache better. Memcached, DynamoDB — schemaless, no cross-row constraints, frequent node churn — are consistent hash heaven. Postgres is not.
Conclusion: consistent hashing's "minimum migration" optimization is huge for KV cache systems (where nodes change often); for OLTP databases (rare, planned node changes), that savings doesn't matter much, and "observability + operational simplicity" matters more. That's why DynamoDB and Cassandra (KV) use consistent hash, while Notion / Figma / Slack (OLTP) use logical-over-physical. The choice isn't about which theory is elegant; it's about which matches the rate of change in the business and the team's operational mental model.
5. Capacity estimate: design an e-commerce platform, projecting 100M users, 100 orders/year each, 2KB/order, 5 years of data. Reason through "do we shard," "how many shards," "shard key," and "cross-shard queries."
Step 1: total scale
- 5-year order count = 1e8 × 100 × 5 = 5e10 = 50 billion rows.
- Raw data = 500e9 × 2KB = 1000 TB = 1 PB.
- With indexes (2–3 secondary, ~50% of data) + MVCC bloat = real ~2 PB.
- Average write QPS: 5e10 / (5 × 365 × 86400) ≈ 317 QPS. Black-Friday peak (20×) ≈ ~6.3k QPS.
- Read QPS: each user views orders 10 times/day → 1e8 × 10 / 86400 ≈ 11.6k QPS, peak ~200k.
Step 2: do we shard?
- 2PB of data is far past single-box capacity (even r6i.32xlarge has ~10TB of practical SSD) → must shard.
- QPS could be handled by a single box, but 2PB of data makes index maintenance, backup/restore, and vacuum impractical → data volume alone forces sharding.
- Alternative: distributed SQL (TiDB / Spanner / CockroachDB), but cost may be 3–5× — trade-off territory.
Step 3: how many shards
- Per-shard target: < 1TB of data + < 5k QPS (50% headroom).
- 2PB / 1TB = 2000 shards.
- 2000 physical instances is too expensive → logical-over-physical: 2048 logical shards on 64–128 physical instances (16–32 logical shards per instance).
- Future scale-out moves logical shards to new instances without changing the routing hash.
Step 4: shard key
- Option 1: user_id. User queries their own orders (90% of traffic) on one shard ✅. Problem: merchant-dimension queries fan out across shards.
- Option 2: order_id (Snowflake-like, with timestamp). Order detail page is O(1) ✅. Problem: "user's order list" scans all shards ❌.
- Option 3: (merchant_id, user_id) composite. Both merchant and user views are local. But a user's "all my orders across merchants" is hard.
- Actual pick: hash on user_id, with order_id carrying a user_id prefix (e.g.,
{user_id_hash}_{snowflake}) so "look up by order_id" can also locate the shard. Merchant dimension is a separate materialized view (CDC → merchant index table, sharded by merchant_id).
Step 5: cross-shard queries
- User dimension (order list, order detail) → single shard, no cross-shard.
- Merchant dimension (merchant dashboard sees own orders) → maintain a merchant-view secondary table (CDC async sync, sharded by merchant_id).
- Aggregate analytics ("last 30 days GMV") → not through OLTP; CDC to Snowflake / ClickHouse.
- Risk, recommendations → warehouse + offline pipelines, never hit OLTP.
Cost estimate
- 2PB actual storage (with indexes, 3× replicas) = 6PB. AWS gp3 SSD at $0.08/GB/month → 6PB × $80/TB = $480k/month in storage alone.
- Hot/cold tiering: last 6 months hot (200TB) on SSD, the rest on object storage + ClickHouse → real SSD need 600TB, monthly cost down to ~$50k.
- Compute: 128 r6i.8xlarge instances ≈ $150k/month.
- Total order of magnitude $200k–$500k/month, $3M–$6M/year. That's why large e-commerce platforms must do hot/cold tiering, archival, compression.
Interview point: from "data size" to "do we shard," from "QPS + headroom" to "how many," from "dominant access pattern" to "shard key," from "secondary access patterns" to "secondary indexes / materialized views," and finally to "cost magnitude" with hot/cold tiering. The full chain is essentially a staff-level answer.