Day 29 Hard Object Storage Erasure Coding Multipart Upload Durability

File & Object Storage — Trading 1.5x Space for Eleven 9s of DurabilityS3 Architecture, Erasure Coding, Multipart Upload, Durability

Problem & Constraints

Design an S3-like object store: serving hundreds of billions of objects, 100 PB, objects from 1 byte to 5 TB, with eleven 9s of durability (99.999999999% — fewer than 1 lost object per 10 million per year). This is the bedrock under cloud drives, backups, data lakes, and image/video origins — Dropbox, Netflix masters, and your app's avatars all live here.

High-Level Architecture

graph TD
    C["Client
SDK / direct PUT via presigned URL"] LB["API layer
S3 protocol / Auth / throttle"] subgraph CP["Control Plane"] META["Metadata service
key → chunk locations, sharded KV"] end subgraph DP["Data Plane"] PL["Placement / EC encoder"] S1["Storage Node 1"] S2["Storage Node 2"] S3["Storage Node N
raw disk + CRC"] end BG["Background jobs
scrub · repair · GC · tiering"] C --> LB LB --> META LB --> PL PL --> S1 & S2 & S3 META -.index.-> DP BG -.scan/repair.-> DP

The core is control-plane / data-plane separation: the metadata service holds the "object key → which disks hold the chunks" index (small, strongly consistent, must scale to hundreds of billions of rows); the data plane just stores immutable byte chunks (large, optimized for throughput and cost). Scaling them independently is what sets object storage apart from a file system — no directory tree, no inode chains; a flat keyspace makes horizontal scaling almost unbounded. Background jobs (scrub, repair, GC) are the true source of durability, decoupled from the request path.

Key Techniques

1. Three Storage Models — Why Object Storage Scales Without Limit

The trade-off in one line: object storage gives up in-place mutation and directory semantics in exchange for a flat keyspace's unbounded horizontal scale and rock-bottom unit cost.

Principle: Block (e.g. EBS/raw disk) gives fixed-size sectors — most flexible but you manage the filesystem yourself; File (e.g. NFS/EFS) gives a POSIX directory tree and in-place rewrites, but the directory tree is a tree that must be kept consistent, so metadata becomes the scaling bottleneck. Object (e.g. S3) treats the whole file as an immutable blob keyed by a flat string (the `/` in `a/b/c.jpg` is just a character, not a directory) — no rename, no append, no inode chain. Metadata collapses into a giant KV table that can be sharded arbitrarily by key hash.

DimensionBlockFileObject
Access unitsector/LBApath + offsetwhole object (key)
Mutationin-place random writein-place random writefull overwrite (immutable)
Metadatanone (raw)dir tree/inodeflat KV
Scale ceilingper-volume limiteddir-tree bottlenecknear-unbounded
TypicalEBS, local diskEFS, NFS, HDFSS3, GCS, R2
Real cases: AWS S3 stores over 100 trillion objects (2021 Pi Day figure), powered by a flat keyspace + sharded metadata; POSIX-semantic workloads (e.g. mmap during training) go to FSx/Lustre instead. Dropbox's Magic Pocket likewise chops files into immutable blocks indexed by a KV (key→location), not a directory tree.

2. The Durability Engine — Replication vs Erasure Coding

The trade-off in one line: 3 replicas are simple and repair fast but cost 200% space amplification; erasure coding drops to ~30-50% amplification at the cost of encoding CPU, read amplification during repair, and being unfit for small files.

Principle: for eleven 9s, a single disk's ~2-4% annual failure rate is nowhere near enough. 3-replication puts one copy in each of 3 AZs — any 2 can fail and you still read; brute-force simple. Erasure coding (Reed-Solomon) splits the object into k data shards, computes m parity shards, and spreads all k+m across distinct fault domains; any k shards reconstruct the object. Backblaze Vault uses 17+3: 17 data + 3 parity = 20 shards, lose any 3 and lose nothing, with only 20/17≈1.18x amplification — and it too reaches eleven 9s.

Why not use EC everywhere? ① Small files don't pay: split into 17 and each shard is tiny, so metadata and IOPS overhead eat the savings — hence hot/small objects often use replicas, cold/large objects use EC. ② Repair read amplification: rebuilding one shard reads k others (17 reads vs a replica's single copy), amplifying network during failure storms. ③ CPU: encode/decode burns cycles and raises write latency.
# Reed-Solomon idea (pseudo-code, not runnable)
# treat data shards as polynomial coefficients;
# parity = evaluations at extra points
data   = split(obj, k=17)              # 17 data shards
parity = rs_encode(data, m=3)          # Vandermonde/Cauchy matrix → 3 parity
shards = data + parity                 # 20 shards across 20 fault domains

# rebuild after losing <=3 shards: solve a 17-unknown linear system
def repair(surviving_shards):          # any 17 suffice
    return rs_decode(surviving_shards) # matrix inverse x survivors
Real cases: Backblaze open-sourced its own Reed-Solomon library; Vault uses 17+3 across 20 pods. AWS S3 promises eleven 9s, redundant across ≥3 AZs with continuous CRC scrubbing of every object. Facebook f4 uses EC for warm BLOBs, cutting old-photo storage amplification from 3.6x (with cross-DC replicas) to ~2.1x. Dropbox's Magic Pocket also uses EC to drive down cross-region redundancy cost.

3. Large-Object Upload — Multipart + Direct Client Upload

The trade-off in one line: parallel chunked upload + presigned-URL direct upload buys high throughput and resumability, at the cost of client complexity and "orphan parts" needing GC.

Principle: 5 TB over a single TCP stream restarts from zero on any hiccup, and single-stream throughput is capped by RTT × window. Multipart upload splits the object into parts (S3 mandates 5MB–5GB/part, max 10,000 parts); each part uploads independently — parallel, retryable, with its own ETag — and a final CompleteMultipartUpload stitches the part list into the object. Paired with a presigned URL: the server signs a time-limited URL with its secret, and the client PUTs directly to the storage layer, bytes never touching your app servers — saving bandwidth and a network hop.

# the server only issues signatures, never touches the byte stream
def init_upload(key, size):
    upload_id = meta.create_multipart(key)
    n = ceil(size / PART_SIZE)                 # e.g. 64MB/part
    urls = [presign("PUT", key, upload_id, i)  # one signed URL per part
            for i in range(n)]
    return upload_id, urls
# client: PUT parts in parallel, retry only the failed part, record {part_i: etag}
# finish: POST complete(upload_id, [(i, etag)...]) -> atomic metadata commit
Pitfall: a client that bails mid-upload leaves orphan parts occupying disk and still billing. S3's fix is lifecycle rule: abort incomplete multipart upload after N days — always configure it, or cost quietly leaks.
Real cases: S3 multipart upload + presigned URLs are the SDK's default large-file path (auto-chunked concurrency >100MB). Netflix uploading masters, Dropbox syncing big files, and virtually every "browser-direct-to-S3" SaaS rely on presigned URLs to offload traffic from their own servers.

4. Strong Metadata Consistency & Hot Partitions

The trade-off in one line: strong consistency makes read-after-write easy, but metadata must be sharded to handle hundreds of billions of keys, and sharding creates hot spots on sequential keys.

Principle: metadata is a giant table of "key → shard locations + version + ACL" and must be sharded (by key hash). Before 2020 S3 was eventually consistent (a freshly PUT object might not appear in a list); in late 2020 S3 shipped strong read-after-write consistency — once a PUT succeeds, any GET/LIST sees it immediately, at no extra price or performance cost. The price is that the metadata layer must be strongly consistent within a shard (typically a Paxos/Raft-replicated KV). But sharding by sequential key prefix hits hot spots: if every key is a `2026-06-20/...` time prefix, writes forever slam the same shard.

Interview favorite: old S3 guidance suggested adding a random (hashed) prefix to spread hot spots; since 2018 S3 auto-partitions adaptively by prefix, each prefix reaching 3,500 PUT / 5,500 GET per second — but you still design key schemas to avoid monotonically increasing prefixes. This is the same "database sharding hot spot" (Day 4) trap, applied to storage metadata.
Real cases: Google Colossus (the GFS successor) moved metadata off a single master onto a BigTable-backed shardable metadata layer, breaking GFS's single-master scaling ceiling. S3's strong-consistency overhaul likewise rebuilt the metadata subsystem into a strongly consistent KV.

Scaling & Optimization

Pitfalls + Interview Questions

Deep-Dive Resources

Going Deeper

How is eleven 9s of durability actually "computed", and which assumption does it lean on most?
Durability ≈ driving the probability that "all of an object's redundant shards are lost before repair completes" extremely low. Given a disk's annual failure rate (AFR) and repair time (MTTR), you build a Markov/binomial model: under k+m coding, you only lose data if ≥ m+1 independent fault domains die within the MTTR window. The two levers are fault-domain independence and repair speed — faster repair and more independent domains shrink the loss window. The most fragile assumption is failures being independent: same-batch bad disks, same-rack power loss, and same-region disasters are correlated failures that instantly break the pretty probability model. That's exactly why you place across AZs/fault domains and scrub continuously (to shrink MTTR).
Why is object storage almost always "immutable" (overwrite, not in-place edit)? What does that limitation buy you?
Immutability makes everything simpler and concurrent-safe: ① caches/CDNs can cache forever because a key's content never changes (a change mints a new version id); ② replication and EC never handle a "half-edited" intermediate state — a write either fully lands or doesn't exist; ③ no in-place writes means no read/write lock contention, so multiple copies are naturally consistent; ④ versioning and snapshots come for free. The cost is "changing one byte rewrites the whole object", so object storage is unfit for frequent small edits (that's block/file or database territory). It's the classic trade-off of restricted functionality for scalability, consistency, and cost.
S3 upgraded from eventual to strong read-after-write consistency in 2020 "at no extra price or performance cost" — what did that re-engineer?
Eventual consistency usually stems from "metadata served by caches / async indexing". For strong consistency, the moment a PUT succeeds the authoritative metadata record must already sit where subsequent reads hit — meaning the metadata subsystem itself must be a strongly consistent sharded KV (Paxos/Raft-style), with the read path hitting the authoritative shard directly rather than a possibly-stale cache. "No performance cost" means they didn't use the naive "lock on read / read-quorum" route, but pushed consistency down into the per-partition replication protocol so ordinary reads stay single-hop. Essentially they moved the consistency cost from "every request" to a one-time "metadata architecture" investment.
For 100 PB of cold data, pick 3 replicas or 17+3 EC? Do the cost math, then say when replicas actually win.
3 replicas: 100 PB logical → 300 PB raw. 17+3 EC: 20/17≈1.18x → ~118 PB raw. A ~180 PB gap, and at cold-storage rates of thousands of dollars per PB per month, the annual saving is in the tens of millions — EC for cold data is a near no-brainer. When do replicas still win? ① Many small files: split into 17 and each shard is a few KB, metadata rows explode and random-read IOPS gets hammered, so EC's space savings drown in ops cost. ② Latency-sensitive and frequently read: replicas read a single nearby copy; EC normal reads only fetch data shards (no decode), but any missing shard forces pulling k shards to rebuild, worsening tail latency. So reality is tiered: hot/small → replicas, cold/large → EC.
Direct presigned-URL upload bypasses your app servers — what control do you lose, and how do you make up for it?
You lose the interception point on the request path: you can't run virus scanning, content moderation, hard quota checks, rewriting/transcoding, or fine-grained billing at the instant of write. Two remedies: ① Constrain at signing time: embed conditions in the presigned URL (content-length-range to cap size, content-type to limit type, key prefix to confine path) — the signature itself is an authorization decision. ② Event-driven post-processing: upload completion fires an event (S3 Event → Lambda/queue) to asynchronously scan, transcode, index, and reconcile, quarantining/deleting on violation. The essence is replacing "synchronous gatekeeping" with "authorization + asynchronous governance" — the standard shape of serverless media pipelines.