Day 29 Hard Object Storage Erasure Coding Multipart Upload Durability

File & Object Storage — Trading 1.5x Space for Eleven 9s of DurabilityS3 Architecture, Erasure Coding, Multipart Upload, Durability

Problem & Constraints

Design an S3-like object store: serving hundreds of billions of objects, 100 PB, objects from 1 byte to 5 TB, with eleven 9s of durability (99.999999999% — fewer than 1 lost object per 10 million per year). This is the bedrock under cloud drives, backups, data lakes, and image/video origins — Dropbox, Netflix masters, and your app's avatars all live here.

Read/write ratio & size distribution: write-once-read-many, but object size is extremely long-tailed — 99% of objects are small (<1MB), 99% of bytes live in big files. Optimize both ends separately.
Consistency: a GET right after a PUT must see it (read-after-write); overwrite and list consistency can be relaxed.
Durability ≫ availability: lost data is irreversible; a brief outage is recoverable. Durability is the first-class citizen at eleven 9s; four 9s of availability is fine.
Throughput: a single 5 TB object takes an hour even at 10 Gbps — you need parallel chunking + resumable upload, not one TCP stream end to end.
Cost: 100 PB at 3x replication is 300 PB of raw disk; erasure coding squeezes it to ~130 PB. Cost per PB per month (power, racks) is life or death.

High-Level Architecture

graph TD
    C["Client
SDK / direct PUT via presigned URL"]
    LB["API layer
S3 protocol / Auth / throttle"]
    subgraph CP["Control Plane"]
      META["Metadata service
key → chunk locations, sharded KV"]
    end
    subgraph DP["Data Plane"]
      PL["Placement / EC encoder"]
      S1["Storage Node 1"]
      S2["Storage Node 2"]
      S3["Storage Node N
raw disk + CRC"]
    end
    BG["Background jobs
scrub · repair · GC · tiering"]
    C --> LB
    LB --> META
    LB --> PL
    PL --> S1 & S2 & S3
    META -.index.-> DP
    BG -.scan/repair.-> DP

The core is control-plane / data-plane separation: the metadata service holds the "object key → which disks hold the chunks" index (small, strongly consistent, must scale to hundreds of billions of rows); the data plane just stores immutable byte chunks (large, optimized for throughput and cost). Scaling them independently is what sets object storage apart from a file system — no directory tree, no inode chains; a flat keyspace makes horizontal scaling almost unbounded. Background jobs (scrub, repair, GC) are the true source of durability, decoupled from the request path.

Key Techniques

1. Three Storage Models — Why Object Storage Scales Without Limit

The trade-off in one line: object storage gives up in-place mutation and directory semantics in exchange for a flat keyspace's unbounded horizontal scale and rock-bottom unit cost.

Principle: Block (e.g. EBS/raw disk) gives fixed-size sectors — most flexible but you manage the filesystem yourself; File (e.g. NFS/EFS) gives a POSIX directory tree and in-place rewrites, but the directory tree is a tree that must be kept consistent, so metadata becomes the scaling bottleneck. Object (e.g. S3) treats the whole file as an immutable blob keyed by a flat string (the `/` in `a/b/c.jpg` is just a character, not a directory) — no rename, no append, no inode chain. Metadata collapses into a giant KV table that can be sharded arbitrarily by key hash.

Dimension	Block	File	Object
Access unit	sector/LBA	path + offset	whole object (key)
Mutation	in-place random write	in-place random write	full overwrite (immutable)
Metadata	none (raw)	dir tree/inode	flat KV
Scale ceiling	per-volume limited	dir-tree bottleneck	near-unbounded
Typical	EBS, local disk	EFS, NFS, HDFS	S3, GCS, R2

Real cases: AWS S3 stores over 100 trillion objects (2021 Pi Day figure), powered by a flat keyspace + sharded metadata; POSIX-semantic workloads (e.g. mmap during training) go to FSx/Lustre instead. Dropbox's Magic Pocket likewise chops files into immutable blocks indexed by a KV (key→location), not a directory tree.

2. The Durability Engine — Replication vs Erasure Coding

The trade-off in one line: 3 replicas are simple and repair fast but cost 200% space amplification; erasure coding drops to ~30-50% amplification at the cost of encoding CPU, read amplification during repair, and being unfit for small files.

Principle: for eleven 9s, a single disk's ~2-4% annual failure rate is nowhere near enough. 3-replication puts one copy in each of 3 AZs — any 2 can fail and you still read; brute-force simple. Erasure coding (Reed-Solomon) splits the object into k data shards, computes m parity shards, and spreads all k+m across distinct fault domains; any k shards reconstruct the object. Backblaze Vault uses 17+3: 17 data + 3 parity = 20 shards, lose any 3 and lose nothing, with only 20/17≈1.18x amplification — and it too reaches eleven 9s.

Why not use EC everywhere? ① Small files don't pay: split into 17 and each shard is tiny, so metadata and IOPS overhead eat the savings — hence hot/small objects often use replicas, cold/large objects use EC. ② Repair read amplification: rebuilding one shard reads k others (17 reads vs a replica's single copy), amplifying network during failure storms. ③ CPU: encode/decode burns cycles and raises write latency.

# Reed-Solomon idea (pseudo-code, not runnable)
# treat data shards as polynomial coefficients;
# parity = evaluations at extra points
data   = split(obj, k=17)              # 17 data shards
parity = rs_encode(data, m=3)          # Vandermonde/Cauchy matrix → 3 parity
shards = data + parity                 # 20 shards across 20 fault domains

# rebuild after losing <=3 shards: solve a 17-unknown linear system
def repair(surviving_shards):          # any 17 suffice
    return rs_decode(surviving_shards) # matrix inverse x survivors

Real cases: Backblaze open-sourced its own Reed-Solomon library; Vault uses 17+3 across 20 pods. AWS S3 promises eleven 9s, redundant across ≥3 AZs with continuous CRC scrubbing of every object. Facebook f4 uses EC for warm BLOBs, cutting old-photo storage amplification from 3.6x (with cross-DC replicas) to ~2.1x. Dropbox's Magic Pocket also uses EC to drive down cross-region redundancy cost.

3. Large-Object Upload — Multipart + Direct Client Upload

The trade-off in one line: parallel chunked upload + presigned-URL direct upload buys high throughput and resumability, at the cost of client complexity and "orphan parts" needing GC.

Principle: 5 TB over a single TCP stream restarts from zero on any hiccup, and single-stream throughput is capped by RTT × window. Multipart upload splits the object into parts (S3 mandates 5MB–5GB/part, max 10,000 parts); each part uploads independently — parallel, retryable, with its own ETag — and a final CompleteMultipartUpload stitches the part list into the object. Paired with a presigned URL: the server signs a time-limited URL with its secret, and the client PUTs directly to the storage layer, bytes never touching your app servers — saving bandwidth and a network hop.

# the server only issues signatures, never touches the byte stream
def init_upload(key, size):
    upload_id = meta.create_multipart(key)
    n = ceil(size / PART_SIZE)                 # e.g. 64MB/part
    urls = [presign("PUT", key, upload_id, i)  # one signed URL per part
            for i in range(n)]
    return upload_id, urls
# client: PUT parts in parallel, retry only the failed part, record {part_i: etag}
# finish: POST complete(upload_id, [(i, etag)...]) -> atomic metadata commit

Pitfall: a client that bails mid-upload leaves orphan parts occupying disk and still billing. S3's fix is lifecycle rule: abort incomplete multipart upload after N days — always configure it, or cost quietly leaks.

Real cases: S3 multipart upload + presigned URLs are the SDK's default large-file path (auto-chunked concurrency >100MB). Netflix uploading masters, Dropbox syncing big files, and virtually every "browser-direct-to-S3" SaaS rely on presigned URLs to offload traffic from their own servers.

4. Strong Metadata Consistency & Hot Partitions

The trade-off in one line: strong consistency makes read-after-write easy, but metadata must be sharded to handle hundreds of billions of keys, and sharding creates hot spots on sequential keys.

Principle: metadata is a giant table of "key → shard locations + version + ACL" and must be sharded (by key hash). Before 2020 S3 was eventually consistent (a freshly PUT object might not appear in a list); in late 2020 S3 shipped strong read-after-write consistency — once a PUT succeeds, any GET/LIST sees it immediately, at no extra price or performance cost. The price is that the metadata layer must be strongly consistent within a shard (typically a Paxos/Raft-replicated KV). But sharding by sequential key prefix hits hot spots: if every key is a `2026-06-20/...` time prefix, writes forever slam the same shard.

Interview favorite: old S3 guidance suggested adding a random (hashed) prefix to spread hot spots; since 2018 S3 auto-partitions adaptively by prefix, each prefix reaching 3,500 PUT / 5,500 GET per second — but you still design key schemas to avoid monotonically increasing prefixes. This is the same "database sharding hot spot" (Day 4) trap, applied to storage metadata.

Real cases: Google Colossus (the GFS successor) moved metadata off a single master onto a BigTable-backed shardable metadata layer, breaking GFS's single-master scaling ceiling. S3's strong-consistency overhaul likewise rebuilt the metadata subsystem into a strongly consistent KV.

Scaling & Optimization

Tiering: hot data on SSD/replicas, warm on HDD/EC, cold on SMR/tape (S3 Glacier). Auto-demote by access frequency to cut cost by an order of magnitude. Dropbox specifically optimized Magic Pocket for cold storage and adopted SMR drives.
Versioning + soft delete: immutable objects are naturally multi-versioned; delete just writes a tombstone, with GC reclaiming later. This is also the basis for backup and ransomware defense (versioning + MFA delete + cross-region replication).
Continuous scrubbing: background jobs constantly recompute and compare CRCs to catch bit rot and bad disks early, repairing before data becomes unrecoverable. This — more than the request path — is what actually buys eleven 9s.
Cross-region replication (CRR): asynchronously copy objects to another region for DR and data-residency compliance. Note it is eventually consistent; RPO is not zero.

Pitfalls + Interview Questions

Treating object storage as a filesystem: assuming `ls` on a "directory" is cheap — it's actually a LIST prefix scan, slow and pricey at millions of keys. There's no true rename (it's copy+delete).
Replicating cold data: 3 replicas for cold data burns money — use EC; conversely, EC for small files is asking for trouble.
Forgetting orphan multiparts: skip the lifecycle abort rule and cost leaks month after month.
Follow-ups: ① How is eleven 9s actually computed? (a probability model over fault-domain independence + repair speed vs failure rate) ② How do you upload a 5 TB object? Resume it? ③ What does each of strong vs eventual consistency give up in object storage? ④ How do you handle hot keys — is it the same problem as database sharding? ⑤ When is space actually freed after deleting an object? (GC + tombstone)

Deep-Dive Resources

"Designing Data-Intensive Applications" (Kleppmann) — the underlying principles of replication, partitioning, and consistency; the metadata layer is exactly these concepts applied.
Werner Vogels — "Building and operating a pretty big storage system called S3" (allthingsdistributed.com, 2023) — a first-hand view of S3's durability, scrubbing, and scale.
Dropbox Engineering — "Inside the Magic Pocket" (dropbox.tech) — EC, placement, and migration in an exabyte-scale self-built object store.
Backblaze — "Reed-Solomon Erasure Coding" / "Vault Architecture" (backblaze.com/blog) — 17+3 EC and an open-source implementation, the durability math made concrete.

Going Deeper

How is eleven 9s of durability actually "computed", and which assumption does it lean on most?

Durability ≈ driving the probability that "all of an object's redundant shards are lost before repair completes" extremely low. Given a disk's annual failure rate (AFR) and repair time (MTTR), you build a Markov/binomial model: under k+m coding, you only lose data if ≥ m+1 independent fault domains die within the MTTR window. The two levers are fault-domain independence and repair speed — faster repair and more independent domains shrink the loss window. The most fragile assumption is failures being independent: same-batch bad disks, same-rack power loss, and same-region disasters are correlated failures that instantly break the pretty probability model. That's exactly why you place across AZs/fault domains and scrub continuously (to shrink MTTR).

Why is object storage almost always "immutable" (overwrite, not in-place edit)? What does that limitation buy you?

Immutability makes everything simpler and concurrent-safe: ① caches/CDNs can cache forever because a key's content never changes (a change mints a new version id); ② replication and EC never handle a "half-edited" intermediate state — a write either fully lands or doesn't exist; ③ no in-place writes means no read/write lock contention, so multiple copies are naturally consistent; ④ versioning and snapshots come for free. The cost is "changing one byte rewrites the whole object", so object storage is unfit for frequent small edits (that's block/file or database territory). It's the classic trade-off of restricted functionality for scalability, consistency, and cost.

S3 upgraded from eventual to strong read-after-write consistency in 2020 "at no extra price or performance cost" — what did that re-engineer?

Eventual consistency usually stems from "metadata served by caches / async indexing". For strong consistency, the moment a PUT succeeds the authoritative metadata record must already sit where subsequent reads hit — meaning the metadata subsystem itself must be a strongly consistent sharded KV (Paxos/Raft-style), with the read path hitting the authoritative shard directly rather than a possibly-stale cache. "No performance cost" means they didn't use the naive "lock on read / read-quorum" route, but pushed consistency down into the per-partition replication protocol so ordinary reads stay single-hop. Essentially they moved the consistency cost from "every request" to a one-time "metadata architecture" investment.

For 100 PB of cold data, pick 3 replicas or 17+3 EC? Do the cost math, then say when replicas actually win.

3 replicas: 100 PB logical → 300 PB raw. 17+3 EC: 20/17≈1.18x → ~118 PB raw. A ~180 PB gap, and at cold-storage rates of thousands of dollars per PB per month, the annual saving is in the tens of millions — EC for cold data is a near no-brainer. When do replicas still win? ① Many small files: split into 17 and each shard is a few KB, metadata rows explode and random-read IOPS gets hammered, so EC's space savings drown in ops cost. ② Latency-sensitive and frequently read: replicas read a single nearby copy; EC normal reads only fetch data shards (no decode), but any missing shard forces pulling k shards to rebuild, worsening tail latency. So reality is tiered: hot/small → replicas, cold/large → EC.

Direct presigned-URL upload bypasses your app servers — what control do you lose, and how do you make up for it?

You lose the interception point on the request path: you can't run virus scanning, content moderation, hard quota checks, rewriting/transcoding, or fine-grained billing at the instant of write. Two remedies: ① Constrain at signing time: embed conditions in the presigned URL (content-length-range to cap size, content-type to limit type, key prefix to confine path) — the signature itself is an authorization decision. ② Event-driven post-processing: upload completion fires an event (S3 Event → Lambda/queue) to asynchronously scan, transcode, index, and reconcile, quarantining/deleting on violation. The essence is replacing "synchronous gatekeeping" with "authorization + asynchronous governance" — the standard shape of serverless media pipelines.

← Back to index