Problem & Constraints
Design an S3-like object store: serving hundreds of billions of objects, 100 PB, objects from 1 byte to 5 TB, with eleven 9s of durability (99.999999999% — fewer than 1 lost object per 10 million per year). This is the bedrock under cloud drives, backups, data lakes, and image/video origins — Dropbox, Netflix masters, and your app's avatars all live here.
- Read/write ratio & size distribution: write-once-read-many, but object size is extremely long-tailed — 99% of objects are small (<1MB), 99% of bytes live in big files. Optimize both ends separately.
- Consistency: a GET right after a PUT must see it (read-after-write); overwrite and list consistency can be relaxed.
- Durability ≫ availability: lost data is irreversible; a brief outage is recoverable. Durability is the first-class citizen at eleven 9s; four 9s of availability is fine.
- Throughput: a single 5 TB object takes an hour even at 10 Gbps — you need parallel chunking + resumable upload, not one TCP stream end to end.
- Cost: 100 PB at 3x replication is 300 PB of raw disk; erasure coding squeezes it to ~130 PB. Cost per PB per month (power, racks) is life or death.
High-Level Architecture
graph TD
C["Client
SDK / direct PUT via presigned URL"]
LB["API layer
S3 protocol / Auth / throttle"]
subgraph CP["Control Plane"]
META["Metadata service
key → chunk locations, sharded KV"]
end
subgraph DP["Data Plane"]
PL["Placement / EC encoder"]
S1["Storage Node 1"]
S2["Storage Node 2"]
S3["Storage Node N
raw disk + CRC"]
end
BG["Background jobs
scrub · repair · GC · tiering"]
C --> LB
LB --> META
LB --> PL
PL --> S1 & S2 & S3
META -.index.-> DP
BG -.scan/repair.-> DP
The core is control-plane / data-plane separation: the metadata service holds the "object key → which disks hold the chunks" index (small, strongly consistent, must scale to hundreds of billions of rows); the data plane just stores immutable byte chunks (large, optimized for throughput and cost). Scaling them independently is what sets object storage apart from a file system — no directory tree, no inode chains; a flat keyspace makes horizontal scaling almost unbounded. Background jobs (scrub, repair, GC) are the true source of durability, decoupled from the request path.
Key Techniques
1. Three Storage Models — Why Object Storage Scales Without Limit
The trade-off in one line: object storage gives up in-place mutation and directory semantics in exchange for a flat keyspace's unbounded horizontal scale and rock-bottom unit cost.
Principle: Block (e.g. EBS/raw disk) gives fixed-size sectors — most flexible but you manage the filesystem yourself; File (e.g. NFS/EFS) gives a POSIX directory tree and in-place rewrites, but the directory tree is a tree that must be kept consistent, so metadata becomes the scaling bottleneck. Object (e.g. S3) treats the whole file as an immutable blob keyed by a flat string (the `/` in `a/b/c.jpg` is just a character, not a directory) — no rename, no append, no inode chain. Metadata collapses into a giant KV table that can be sharded arbitrarily by key hash.
| Dimension | Block | File | Object |
| Access unit | sector/LBA | path + offset | whole object (key) |
| Mutation | in-place random write | in-place random write | full overwrite (immutable) |
| Metadata | none (raw) | dir tree/inode | flat KV |
| Scale ceiling | per-volume limited | dir-tree bottleneck | near-unbounded |
| Typical | EBS, local disk | EFS, NFS, HDFS | S3, GCS, R2 |
Real cases: AWS S3 stores over 100 trillion objects (2021 Pi Day figure), powered by a flat keyspace + sharded metadata; POSIX-semantic workloads (e.g. mmap during training) go to FSx/Lustre instead. Dropbox's Magic Pocket likewise chops files into immutable blocks indexed by a KV (key→location), not a directory tree.
2. The Durability Engine — Replication vs Erasure Coding
The trade-off in one line: 3 replicas are simple and repair fast but cost 200% space amplification; erasure coding drops to ~30-50% amplification at the cost of encoding CPU, read amplification during repair, and being unfit for small files.
Principle: for eleven 9s, a single disk's ~2-4% annual failure rate is nowhere near enough. 3-replication puts one copy in each of 3 AZs — any 2 can fail and you still read; brute-force simple. Erasure coding (Reed-Solomon) splits the object into k data shards, computes m parity shards, and spreads all k+m across distinct fault domains; any k shards reconstruct the object. Backblaze Vault uses 17+3: 17 data + 3 parity = 20 shards, lose any 3 and lose nothing, with only 20/17≈1.18x amplification — and it too reaches eleven 9s.
Why not use EC everywhere? ① Small files don't pay: split into 17 and each shard is tiny, so metadata and IOPS overhead eat the savings — hence hot/small objects often use replicas, cold/large objects use EC. ② Repair read amplification: rebuilding one shard reads k others (17 reads vs a replica's single copy), amplifying network during failure storms. ③ CPU: encode/decode burns cycles and raises write latency.
# Reed-Solomon idea (pseudo-code, not runnable)
# treat data shards as polynomial coefficients;
# parity = evaluations at extra points
data = split(obj, k=17) # 17 data shards
parity = rs_encode(data, m=3) # Vandermonde/Cauchy matrix → 3 parity
shards = data + parity # 20 shards across 20 fault domains
# rebuild after losing <=3 shards: solve a 17-unknown linear system
def repair(surviving_shards): # any 17 suffice
return rs_decode(surviving_shards) # matrix inverse x survivors
Real cases: Backblaze open-sourced its own Reed-Solomon library; Vault uses 17+3 across 20 pods. AWS S3 promises eleven 9s, redundant across ≥3 AZs with continuous CRC scrubbing of every object. Facebook f4 uses EC for warm BLOBs, cutting old-photo storage amplification from 3.6x (with cross-DC replicas) to ~2.1x. Dropbox's Magic Pocket also uses EC to drive down cross-region redundancy cost.
3. Large-Object Upload — Multipart + Direct Client Upload
The trade-off in one line: parallel chunked upload + presigned-URL direct upload buys high throughput and resumability, at the cost of client complexity and "orphan parts" needing GC.
Principle: 5 TB over a single TCP stream restarts from zero on any hiccup, and single-stream throughput is capped by RTT × window. Multipart upload splits the object into parts (S3 mandates 5MB–5GB/part, max 10,000 parts); each part uploads independently — parallel, retryable, with its own ETag — and a final CompleteMultipartUpload stitches the part list into the object. Paired with a presigned URL: the server signs a time-limited URL with its secret, and the client PUTs directly to the storage layer, bytes never touching your app servers — saving bandwidth and a network hop.
# the server only issues signatures, never touches the byte stream
def init_upload(key, size):
upload_id = meta.create_multipart(key)
n = ceil(size / PART_SIZE) # e.g. 64MB/part
urls = [presign("PUT", key, upload_id, i) # one signed URL per part
for i in range(n)]
return upload_id, urls
# client: PUT parts in parallel, retry only the failed part, record {part_i: etag}
# finish: POST complete(upload_id, [(i, etag)...]) -> atomic metadata commit
Pitfall: a client that bails mid-upload leaves orphan parts occupying disk and still billing. S3's fix is lifecycle rule: abort incomplete multipart upload after N days — always configure it, or cost quietly leaks.
Real cases: S3 multipart upload + presigned URLs are the SDK's default large-file path (auto-chunked concurrency >100MB). Netflix uploading masters, Dropbox syncing big files, and virtually every "browser-direct-to-S3" SaaS rely on presigned URLs to offload traffic from their own servers.
4. Strong Metadata Consistency & Hot Partitions
The trade-off in one line: strong consistency makes read-after-write easy, but metadata must be sharded to handle hundreds of billions of keys, and sharding creates hot spots on sequential keys.
Principle: metadata is a giant table of "key → shard locations + version + ACL" and must be sharded (by key hash). Before 2020 S3 was eventually consistent (a freshly PUT object might not appear in a list); in late 2020 S3 shipped strong read-after-write consistency — once a PUT succeeds, any GET/LIST sees it immediately, at no extra price or performance cost. The price is that the metadata layer must be strongly consistent within a shard (typically a Paxos/Raft-replicated KV). But sharding by sequential key prefix hits hot spots: if every key is a `2026-06-20/...` time prefix, writes forever slam the same shard.
Interview favorite: old S3 guidance suggested adding a random (hashed) prefix to spread hot spots; since 2018 S3 auto-partitions adaptively by prefix, each prefix reaching 3,500 PUT / 5,500 GET per second — but you still design key schemas to avoid monotonically increasing prefixes. This is the same "database sharding hot spot" (Day 4) trap, applied to storage metadata.
Real cases: Google Colossus (the GFS successor) moved metadata off a single master onto a BigTable-backed shardable metadata layer, breaking GFS's single-master scaling ceiling. S3's strong-consistency overhaul likewise rebuilt the metadata subsystem into a strongly consistent KV.
Scaling & Optimization
- Tiering: hot data on SSD/replicas, warm on HDD/EC, cold on SMR/tape (S3 Glacier). Auto-demote by access frequency to cut cost by an order of magnitude. Dropbox specifically optimized Magic Pocket for cold storage and adopted SMR drives.
- Versioning + soft delete: immutable objects are naturally multi-versioned; delete just writes a tombstone, with GC reclaiming later. This is also the basis for backup and ransomware defense (versioning + MFA delete + cross-region replication).
- Continuous scrubbing: background jobs constantly recompute and compare CRCs to catch bit rot and bad disks early, repairing before data becomes unrecoverable. This — more than the request path — is what actually buys eleven 9s.
- Cross-region replication (CRR): asynchronously copy objects to another region for DR and data-residency compliance. Note it is eventually consistent; RPO is not zero.
Pitfalls + Interview Questions
- Treating object storage as a filesystem: assuming `ls` on a "directory" is cheap — it's actually a LIST prefix scan, slow and pricey at millions of keys. There's no true rename (it's copy+delete).
- Replicating cold data: 3 replicas for cold data burns money — use EC; conversely, EC for small files is asking for trouble.
- Forgetting orphan multiparts: skip the lifecycle abort rule and cost leaks month after month.
- Follow-ups: ① How is eleven 9s actually computed? (a probability model over fault-domain independence + repair speed vs failure rate) ② How do you upload a 5 TB object? Resume it? ③ What does each of strong vs eventual consistency give up in object storage? ④ How do you handle hot keys — is it the same problem as database sharding? ⑤ When is space actually freed after deleting an object? (GC + tombstone)
Deep-Dive Resources
- "Designing Data-Intensive Applications" (Kleppmann) — the underlying principles of replication, partitioning, and consistency; the metadata layer is exactly these concepts applied.
- Werner Vogels — "Building and operating a pretty big storage system called S3" (allthingsdistributed.com, 2023) — a first-hand view of S3's durability, scrubbing, and scale.
- Dropbox Engineering — "Inside the Magic Pocket" (dropbox.tech) — EC, placement, and migration in an exabyte-scale self-built object store.
- Backblaze — "Reed-Solomon Erasure Coding" / "Vault Architecture" (backblaze.com/blog) — 17+3 EC and an open-source implementation, the durability math made concrete.
Going Deeper
How is eleven 9s of durability actually "computed", and which assumption does it lean on most?
Durability ≈ driving the probability that "all of an object's redundant shards are lost before repair completes" extremely low. Given a disk's annual failure rate (AFR) and repair time (MTTR), you build a Markov/binomial model: under k+m coding, you only lose data if ≥ m+1 independent fault domains die within the MTTR window. The two levers are fault-domain independence and repair speed — faster repair and more independent domains shrink the loss window. The most fragile assumption is failures being independent: same-batch bad disks, same-rack power loss, and same-region disasters are correlated failures that instantly break the pretty probability model. That's exactly why you place across AZs/fault domains and scrub continuously (to shrink MTTR).
Why is object storage almost always "immutable" (overwrite, not in-place edit)? What does that limitation buy you?
Immutability makes everything simpler and concurrent-safe: ① caches/CDNs can cache forever because a key's content never changes (a change mints a new version id); ② replication and EC never handle a "half-edited" intermediate state — a write either fully lands or doesn't exist; ③ no in-place writes means no read/write lock contention, so multiple copies are naturally consistent; ④ versioning and snapshots come for free. The cost is "changing one byte rewrites the whole object", so object storage is unfit for frequent small edits (that's block/file or database territory). It's the classic trade-off of restricted functionality for scalability, consistency, and cost.
S3 upgraded from eventual to strong read-after-write consistency in 2020 "at no extra price or performance cost" — what did that re-engineer?
Eventual consistency usually stems from "metadata served by caches / async indexing". For strong consistency, the moment a PUT succeeds the authoritative metadata record must already sit where subsequent reads hit — meaning the metadata subsystem itself must be a strongly consistent sharded KV (Paxos/Raft-style), with the read path hitting the authoritative shard directly rather than a possibly-stale cache. "No performance cost" means they didn't use the naive "lock on read / read-quorum" route, but pushed consistency down into the per-partition replication protocol so ordinary reads stay single-hop. Essentially they moved the consistency cost from "every request" to a one-time "metadata architecture" investment.
For 100 PB of cold data, pick 3 replicas or 17+3 EC? Do the cost math, then say when replicas actually win.
3 replicas: 100 PB logical → 300 PB raw. 17+3 EC: 20/17≈1.18x → ~118 PB raw. A ~180 PB gap, and at cold-storage rates of thousands of dollars per PB per month, the annual saving is in the tens of millions — EC for cold data is a near no-brainer. When do replicas still win? ① Many small files: split into 17 and each shard is a few KB, metadata rows explode and random-read IOPS gets hammered, so EC's space savings drown in ops cost. ② Latency-sensitive and frequently read: replicas read a single nearby copy; EC normal reads only fetch data shards (no decode), but any missing shard forces pulling k shards to rebuild, worsening tail latency. So reality is tiered: hot/small → replicas, cold/large → EC.
Direct presigned-URL upload bypasses your app servers — what control do you lose, and how do you make up for it?
You lose the interception point on the request path: you can't run virus scanning, content moderation, hard quota checks, rewriting/transcoding, or fine-grained billing at the instant of write. Two remedies: ① Constrain at signing time: embed conditions in the presigned URL (content-length-range to cap size, content-type to limit type, key prefix to confine path) — the signature itself is an authorization decision. ② Event-driven post-processing: upload completion fires an event (S3 Event → Lambda/queue) to asynchronously scan, transcode, index, and reconcile, quarantining/deleting on violation. The essence is replacing "synchronous gatekeeping" with "authorization + asynchronous governance" — the standard shape of serverless media pipelines.