Day 3 Medium Database SQL vs NoSQL OLTP/OLAP ACID/BASE

Database Selection — Not Just "SQL vs NoSQL"Database Selection: SQL vs NoSQL, OLTP vs OLAP, ACID vs BASE

The Problem

When an interviewer asks "why MongoDB?", a senior engineer can blow it — because the right answer isn't "because it's fast," it's "because my access pattern is reads aggregated by user_id, writes outweigh reads, and the schema changes often." Database selection isn't a tech religion; it's constraint derivation. Data model, access pattern, consistency requirements, scale, and operational cost — five axes decide everything.

The counter-examples are plentiful. Uber migrated from Postgres to MySQL in 2016 (Postgres replication protocol's write amplification), then in 2020 built Schemaless on top of MySQL as a KV store. Pinterest moved from MongoDB to sharded MySQL in 2012. Discord migrated message storage from MongoDB to Cassandra in 2017 (1B messages/day), then to ScyllaDB in 2023. Stripe still runs MongoDB for the core ledger — backed by a meticulous application-layer consistency layer. There is no best database, only the closest fit.

Requirements and Constraints (the interview opener)

Access pattern: point lookup by primary key, secondary-index scan, range query, multi-table join, or full-table aggregation? This dictates the database type.
Read/write ratio and QPS: 100k QPS of writes won't fit on a single Postgres (ceiling around 50k); you need Cassandra, DynamoDB, or sharded MySQL.
Data volume and growth: 100GB? 10TB? 1PB? Above 10TB, single-box OLTP databases are out — you must shard or move to distributed SQL (CockroachDB, TiDB, Spanner).
Consistency requirements: bank balance (strong) vs social like (eventual).
Latency requirements: p99 1ms (Redis), 10ms (DynamoDB), 100ms (Postgres + index), seconds (OLAP).
Schema stability: still iterating fast (NoSQL flexible) or stable (SQL's structure is safer)?
Operational capacity: does the team know how to tune Postgres or Cassandra? Managed (Aurora, DynamoDB) or self-hosted?

High-Level Architecture (Typical Data Stack)

graph LR
    APP["Application
API / Service"]
    CACHE["Redis
KV cache · sessions · leaderboards"]
    OLTP["OLTP
Postgres / MySQL · DynamoDB"]
    SEARCH["Search
Elasticsearch · OpenSearch"]
    QUEUE["CDC / Kafka
binlog → downstream"]
    DWH["Data Warehouse
Snowflake / BigQuery · OLAP"]
    LAKE["Data Lake
S3 + Iceberg · cold data"]

    APP --> CACHE
    APP --> OLTP
    APP --> SEARCH
    OLTP -->|binlog| QUEUE
    QUEUE --> DWH
    QUEUE --> SEARCH
    QUEUE --> LAKE

    classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef cache fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef oltp fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef olap fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class APP app
    class CACHE cache
    class OLTP,SEARCH oltp
    class QUEUE,DWH,LAKE olap

A modern backend is usually polyglot: OLTP for transactions, OLAP for analytics, search for search, cache for reads

Key Technical Points

1. SQL vs NoSQL: Not a Binary — Pick a Data Model

Principle: "NoSQL" is a poor label. It's an umbrella for four genuinely different things:

Type	Examples	Data model	Typical use
Relational (SQL)	Postgres, MySQL, Oracle	rows + schema + joins	transactions, orders, relational data
Key-Value	Redis, DynamoDB, Riak	K → V (V may be JSON)	sessions, caching, counters
Document	MongoDB, Couchbase	nested JSON documents	user profiles, CMS, evolving schemas
Wide-column	Cassandra, HBase, ScyllaDB	row key + column families	time series, messaging, heavy writes
Graph	Neo4j, Neptune, TigerGraph	nodes + edges	social graphs, fraud, knowledge graphs
Search	Elasticsearch, OpenSearch	inverted index	full-text search, logs, analytics
Time-series	InfluxDB, TimescaleDB, Prometheus	(time, tags, value)	monitoring, IoT, financial ticks

SQL strengths:

ACID across rows and tables: transfers, inventory deductions, order + payment.
Multi-table joins: user → orders → products in a single query.
Strong schema: bad data can't get in; "flexibility" in NoSQL is often a euphemism for dirty data.
Declarative SQL: the DB optimizes execution plans — usually faster than what you'd hand-code.

NoSQL strengths:

Native horizontal scale: add a Cassandra node and you're done; Postgres adds replicas easily, shards painfully.
Schema-less: fast iteration, no migration to add a field (an advantage while access patterns are unknown; a liability once they stabilize).
Extreme latency / throughput: DynamoDB sustains 3000 RCU + 1000 WCU per partition, p99 in single-digit ms.

Real cases:

Stripe: core ledger still on sharded MongoDB, with application-layer strong consistency and dual-write validation filling the gaps. The original choice was for schema flexibility; migrating is no longer practical.
Notion: Postgres + sharding (480 logical databases), all blocks in one table, sharded by hash(block_id).
Discord: message storage on Cassandra from 2017 (1B msg/day), switched to ScyllaDB in 2023 (same protocol, C++ implementation, p99 latency down 95%).
Facebook: user data on sharded MySQL (thousands of shards) plus TAO (a graph cache layer); ads on MySQL; messaging on HBase.
Uber: Postgres → MySQL → Schemaless (in-house KV on MySQL) → Docstore (in-house distributed SQL).

2. OLTP vs OLAP: Row Storage vs Columnar, the Real Difference

Principle: transactional and analytical systems have inverted access patterns:

OLTP: each request reads/writes a few rows by primary key, write-heavy, low-latency (ms), ACID. "Get user 123's orders," "decrement inventory."
OLAP: scans huge row sets, read-heavy, long-running (seconds to minutes), aggregations. "GMV by region for the last 30 days."

This determines the storage layout. OLTP uses row storage (all fields of a row together, optimized for fetching a whole row). OLAP uses columnar storage (all values of one column together, optimized for aggregation, high compression, and SIMD).

graph LR
    subgraph Row["Row Store (Postgres / MySQL)"]
        R1["[1, Alice, 30, US]"]
        R2["[2, Bob, 25, UK]"]
        R3["[3, Carol, 28, JP]"]
    end
    subgraph Col["Columnar Store (BigQuery / Snowflake)"]
        C1["id: [1, 2, 3]"]
        C2["name: [Alice, Bob, Carol]"]
        C3["age: [30, 25, 28]"]
        C4["country: [US, UK, JP]"]
    end
    Row -->|"SELECT * WHERE id=2 ✅"| Q1["fast point lookup"]
    Col -->|"AVG(age) GROUP BY country ✅"| Q2["fast aggregation · 10x compression"]
    classDef row fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef col fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class Row,R1,R2,R3 row
    class Col,C1,C2,C3,C4 col

Trade-off:

Row-store OLTP: ✅ fast point lookups and single-row updates, transactions are easy; ❌ slow aggregations (must scan the whole row), poor compression.
Columnar OLAP: ✅ aggregations are extremely fast (read only the column you need), 5–20× compression (same-column values are similar), SIMD-friendly; ❌ single-row updates are expensive (you touch N columns), wrong for OLTP.
HTAP hybrid: TiDB and SingleStore support both, with row + columnar replicas, but operational and consistency complexity is high. The more common cloud-era pattern is "OLTP + CDC → OLAP" as two independent systems.

# Why columnar aggregations are fast: only read the column you need
# Table with 100 columns, 1B rows
# OLTP row store running SELECT AVG(salary):
#   scans 1B × 100 columns of data = massive I/O
# OLAP columnar running SELECT AVG(salary):
#   reads only the salary column (1B × 1 column), compressed to ~1GB
#   + SIMD vectorized execution → sub-second response

Real cases:

Snowflake: columnar + multi-tenant + storage/compute separation (S3 stores data, warehouses spin up on demand) — the cloud-native OLAP textbook.
Google BigQuery: columnar + Dremel execution, sub-second on PB-scale scans. Underlying format is Capacitor.
ClickHouse: the open-source columnar OLAP dark horse out of Yandex; Cloudflare uses it for logs (30PB+, 11M rows/s inserts).
Airbnb: OLTP Postgres + CDC → Druid / Presto for real-time analytics; nightly Spark + S3 for the warehouse.
Databricks Lakehouse: S3 + Delta Lake (columnar Parquet + transaction log), an attempt at "one data set, two workloads."

3. ACID vs BASE: Two Philosophies of Consistency

ACID (Atomicity, Consistency, Isolation, Durability): the traditional RDBMS promise. BASE (Basically Available, Soft state, Eventual consistency): the NoSQL compromise.

	ACID	BASE
Atomicity	transactions are all-or-nothing	none; rely on idempotent design
Consistency	everyone sees a consistent view post-transaction	eventual; reads may see old values within the window
Isolation	SI / RR / Serializable	even read-committed is hard to guarantee
Durability	commit fsyncs to disk	async multi-replica; rare edge cases lose data
Typical	Postgres, Spanner, CockroachDB	DynamoDB (classic), Cassandra, Riak

But since the 2010s, the line has been blurring:

Spanner / CockroachDB / TiDB: distributed SQL using Paxos/Raft + TrueTime to give ACID with horizontal scale.
DynamoDB 2018+: supports transactions (TransactWriteItems / TransactGetItems), up to 100 items / 4MB.
MongoDB 4.0+: multi-document transactions (10× slower than single-document, but they exist).

# Postgres SERIALIZABLE transaction (highest isolation)
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- if balance >= 100, debit
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;
-- ⚠️ Serializable raises SerializationFailure on concurrent conflict
-- the application must catch and retry (backoff + jitter)

Trade-off:

Strong-consistency ACID: ✅ application code is simple (don't think about concurrency), hard to get wrong; ❌ single-box ceiling (Postgres writes ~50k QPS), cross-region latency is high (Spanner cross-region write ~100ms).
BASE eventual consistency: ✅ extreme availability, no horizontal-scaling ceiling; ❌ the application must handle stale reads, duplicate writes — development complexity explodes.
"Correctness vs speed": financial transfers must be ACID; social likes, counters, recommendation rankings can be BASE.

Real cases:

Google Spanner: TrueTime (GPS + atomic clocks, bounded clock skew) enables global ACID. Powers Google Ads, exposed as Cloud Spanner.
Amazon DynamoDB: strict BASE originally (2007 paper); transactions added in 2018, but ACID mode halves throughput.
CockroachDB: open-source Spanner-style design using HLC (Hybrid Logical Clock) for cross-region ACID. Used by DoorDash and Comcast.
TiDB: PingCAP's MySQL-compatible distributed SQL, heavily used in Chinese e-commerce for OLTP+OLAP.
Cassandra: BASE textbook — tunable consistency via quorum reads/writes (R + W > N). Used by Discord, Apple, Netflix.

4. A Decision Tree: Postgres / DynamoDB / Cassandra / Redis

Principle: four mainstream options cover 90% of scenarios. Here's a practical decision path.

graph TD
    Start{"Need transactions and joins?"}
    Start -->|"Yes"| SQL_q{"Data > 1TB?"}
    Start -->|"No"| KV_q{"Latency < 1ms?"}

    SQL_q -->|"No"| PG["✅ Postgres
transactions · complex queries · JSONB"]
    SQL_q -->|"Yes"| Dist_q{"Global multi-region?"}

    Dist_q -->|"Yes"| Span["✅ Spanner / CockroachDB
cross-region ACID"]
    Dist_q -->|"No"| TiDB["✅ TiDB / Vitess
distributed MySQL"]

    KV_q -->|"Yes"| Redis["✅ Redis
in-memory KV · cache · leaderboards"]
    KV_q -->|"No"| Scale_q{"> 50k write QPS & massive data?"}

    Scale_q -->|"Yes"| Wide_q{"Managed or self-hosted?"}
    Scale_q -->|"No"| PG2["✅ Postgres
usually enough"]

    Wide_q -->|"Managed"| DDB["✅ DynamoDB
zero-ops · pay per access"]
    Wide_q -->|"Self-hosted"| Cass["✅ Cassandra / Scylla
multi-region write · BASE"]

    classDef sql fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef kv fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef wide fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    classDef q fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    class PG,PG2,Span,TiDB sql
    class Redis kv
    class DDB,Cass wide
    class Start,SQL_q,KV_q,Dist_q,Scale_q,Wide_q q

The senior-engineer default: unless you can articulate why not Postgres (specific QPS, data volume, or access pattern), use Postgres. It has transactions, JSONB (semi-structured), full-text search (tsvector), geo (PostGIS), time series (TimescaleDB extension), and columnar (Citus) — nearly a one-stop shop. Notion, Instagram, Robinhood, Reddit, GitLab, Heroku all run Postgres heavily.

Real cases:

Instagram: single Postgres early on; at scale, "multiple logical shards in the same database instance" sharded by hash(user_id). Main store today is still Postgres + Cassandra (for feed cache).
Lyft: DynamoDB for driver location (write-heavy, point lookup) plus Postgres for orders.
Netflix: Cassandra (primary) + DynamoDB (some) + EVCache (Memcached fork) + Elasticsearch.
Shopify: MySQL on Vitess (4PB+), sharded by shop_id — one shop per pod.
OpenAI: Postgres for user / billing + Redis; API logs in ClickHouse.

Scaling Up: What to Do After Growth

Single-box Postgres can't take it: first add read replicas (read/write split) → vertical sharding (split by domain: user / order / inventory) → horizontal sharding (Citus / Vitess / application-layer).
Cross-domain joins: CDC (Debezium pulling binlog) → Kafka → materialize to search/warehouse; application-layer joins become downstream consumers.
OLAP slowing OLTP down: never run analytical SQL on the primary. Use a read replica or CDC to an independent OLAP system (Snowflake, ClickHouse).
Multi-region: single-region OLTP (avoiding distributed transactions) with async replication to other regions; or move to Spanner / CockroachDB (global ACID with 100ms+ write latency).
Cold data archival: hot data in OLTP (last 90 days), cold data on S3 + Iceberg / Parquet, queried via Athena / Trino.

Common Pitfalls (Even Senior Engineers Slip Here)

1. Using MongoDB like a relational DB: once the business stabilizes, "no joins" becomes a curse. Either you do N+1 queries in the app (slow) or you denormalize (write amplification, consistency nightmare). Stripe still pays the price.

2. Using Redis as the primary store: Redis persistence (RDB/AOF) was never the design goal; crash RPO is at least a few seconds. For account balances, Redis is a cache, not the source of truth.

3. Letting the ORM generate SQL: ActiveRecord / Django ORM slips into N+1 or full-table scans without warning. EXPLAIN before production.

4. Designing schema before access pattern: NoSQL (especially DynamoDB) demands "decide the queries first, then design the table." Pick the wrong partition key and you're rebuilding. AWS pushes single-table design, but the learning curve is steep.

5. Schema migration is rarely free: ALTER on a large table to add a column or change a type can lock for hours. In production: online DDL (pt-online-schema-change, gh-ost), batched backfill, dual-write old + new columns.

6. Cross-database distributed transactions: "write Postgres + write Redis" with consistency requirements needs the outbox pattern (Day 7). Don't reach for 2PC.

Sample Interview Questions

Designing Twitter — what stores the timeline? Why not all MySQL?
Postgres or MySQL? Name three real differences (don't say "about the same").
When would you pick Cassandra over DynamoDB?
What does the "S" (soft state) in BASE mean? Give an example.
From 100k to 100M users, how does the database evolve? List phases and each phase's bottleneck.
HTAP databases (TiDB) vs the "OLTP + CDC to OLAP" pattern — which do you pick?

Key Resources

Designing Data-Intensive Applications, Ch. 2–3 (Kleppmann): the canonical treatment of data models and storage engines.
"Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007) and "Spanner: Google's Globally Distributed Database" (OSDI 2012): the two milestones of NoSQL and distributed SQL.
Discord engineering blog, "How Discord Stores Trillions of Messages" (2023): the MongoDB → Cassandra → ScyllaDB migration story, with latency data.

English Summary

Database selection is constraint derivation, not technology religion. Five axes determine the choice: data model (relational / KV / document / wide-column / graph / search / time-series), access pattern (point lookup vs range scan vs aggregation vs join), consistency (ACID vs BASE), scale (QPS, data size, geo distribution), and operational cost. Default to Postgres unless you have a specific reason — its versatility (JSONB, full-text, geo, time-series extensions) covers 80% of needs. Reach for DynamoDB / Cassandra when write throughput exceeds ~50k QPS or you need horizontal scale-out; Redis when sub-ms latency is required; columnar warehouses (Snowflake / BigQuery / ClickHouse) when analytical aggregation dominates. Modern stacks combine multiple stores via CDC pipelines: OLTP handles transactions, search handles full-text, warehouse handles analytics.

Going Deeper (click to expand)

1. The same business entity (e.g. user) lives in both OLTP (Postgres) and OLAP (Snowflake); OLAP always lags by hours. The PM wants "real-time analytics." Give at least three approaches and the cost of each.

The crux: real-time analytics needs minute- or second-level freshness, which traditional daily-batch ETL can't deliver.

A. CDC + stream processing → OLAP. Debezium tails the Postgres binlog → Kafka → Flink / Spark Streaming → writes to ClickHouse / Druid. Freshness in seconds. Cost: stream processing is complex, state management is hard, exactly-once is not easy; the architecture grows by 4–5 components.
B. HTAP single database. TiDB / SingleStore — row + columnar replicas coexist, auto-synced. Same SQL, low migration cost. Cost: columnar replica still has lag (sub-second to a few seconds); write performance worse than pure OLTP; complex ops; smaller community than Postgres.
C. Analytics on the OLTP read replica. Bring up a dedicated read replica (resource-isolated), route heavy queries there. Cost: row store is slow at aggregations (GROUP BY over 100M rows can take minutes); replica is heavy on resources; poorly written queries can drag replication.
D. Real-time materialized views (Materialize, RisingWave). Subscribe to CDC, maintain aggregates automatically, query in milliseconds. Cost: limited SQL expressiveness for materialized views — complex queries can't be expressed; state bloat.
E. Lambda architecture. Batch (daily) + real-time (incremental), merge at query time. Cost: two codebases, semantic drift between them — most companies have abandoned this.

Practical pick for 2024: CDC + ClickHouse is the mainstream (Cloudflare, Uber, ByteDance). Materialize is great when the analytics can be expressed in SQL.

2. DynamoDB partition key is "user_id"; a superuser takes 30% of traffic and one partition is throttled (3000 RCU cap). Schema can't change. What do you do?

Classic hot-partition problem. DynamoDB's single-partition hard caps: 3000 RCU + 1000 WCU + 10GB.

① Write sharding (suffix hash): spread superuser writes across virtual partitions. Use a partition key of user_id#{0..15}, pick random on write; on read, fan out 16 concurrent queries and merge. Cost: 16× read amplification; only addresses hot writes.
② Application-layer cache: cache the superuser's data in DAX (DynamoDB Accelerator) or ElastiCache. Most reads are absorbed by the cache; only misses hit DDB. Cost: staleness window, cache consistency.
③ Adaptive Capacity (AWS automatic): DynamoDB already auto-splits hot partitions and grants more capacity. There is a ceiling, and throttling persists for a few minutes during splits. Passive measure.
④ On-Demand mode: switch to pay-per-request — AWS internally allocates capacity more aggressively. Cost: 2–7× more expensive.
⑤ Redundancy / secondary table: replicate the superuser's data into a separate small table (different partition key) and read it from there. Cost: dual-write, consistency. Fits "a few extremely hot" cases.

Real lesson: DynamoDB's mantra "design the access pattern before the partition key" isn't a suggestion. Once a hot key surfaces in production, the migration cost is enormous — which is why many teams move back to Postgres + sharding after a few years.

3. Senior engineers say "start with Postgres, switch later if needed." Give a concrete scenario where Postgres really won't do — what's the threshold (QPS, data volume, feature)?

Single-box Postgres ceilings (on AWS r6i.32xlarge / 128 vCPU / 1TB RAM):

Write QPS ceiling around 50k–80k (including WAL fsync). Higher than that, single-box WAL writing becomes the bottleneck and you must shard. Aurora's storage layer can push this to ~200k, at several times the cost.
Data volume ceiling around 5–10TB. Beyond this, maintenance windows (vacuum, index rebuild, backup/restore) are too long; autovacuum can't keep up with dead tuples.
Connection count ceiling around 5k (each connection ~10MB RAM + a forked process). High-concurrency services must pool with PgBouncer / RDS Proxy.
p99 write latency sensitivity (need < 5ms): fsync jitter can spike p99 to tens of ms; vacuum and checkpoint windows are worse.
Cross-region strong consistency: Postgres synchronous_commit across regions pushes writes to 100ms+. That's Spanner / CockroachDB territory.

Concrete scenarios:

IoT time-series ingest at 1M QPS → Cassandra / ScyllaDB / TimescaleDB with sharding, required.
Global multi-region writes (users in any region need low-latency writes) → CockroachDB / Spanner.
Columnar aggregation over 10B rows → ClickHouse / BigQuery.
Millisecond leaderboards + real-time counters → Redis (ZSET / HINCRBY).
5-hop graph traversal for friends-of-friends → Neo4j / TigerGraph (Postgres recursive CTE works but is orders of magnitude slower).

Real proof: Instagram rode Postgres to IPO (1B DAU) on sharding plus excellent access-pattern design. So "Postgres can't take it" is often code for "the design isn't good enough."

4. Cassandra is eventually consistent. How does the application handle stale reads? Give two real designs.

Cassandra defaults to R=1, W=1 (fastest, least consistent). Production usually runs R=QUORUM, W=QUORUM (R+W > N for strong consistency); with N=3 that's R=2, W=2. Even with quorum there's a stale-read window (write hit 2 replicas, the third hasn't received it yet, you read from the third).

Scenario 1: Discord message list

Messages partitioned by channel_id, sorted by msg_id. New messages may not be visible for a moment.
Design: clients subscribe to WebSocket pushes (the gateway pushes directly), not relying on DB reads. DB reads only serve history fetches; 100ms of staleness is invisible to the user.
Key trick: "push immediately on write," bypassing the DB consistency window.

Scenario 2: E-commerce order status

User pays, immediately navigates to the order page, may see an old status (pending instead of paid).
Design: use read-your-writes consistency — the client sends "last write timestamp or version" as a hint; the app guarantees reads at or above that version (poll, retry, or route to the replica that took the write).
Alternative: on payment callback, synchronously update a Redis cache; the user sees the cache (strongly consistent), background async reconciles Cassandra.

General patterns:

Read-repair: built into Cassandra — on read, if replicas disagree, trigger background repair.
Hinted handoff: on write, if a replica is down, peers stash a "hint" and replay it on recovery.
Lightweight Transaction (LWT / Paxos): Cassandra offers CAS writes (IF NOT EXISTS); 4–10× slower.
Application-layer version control: writes carry a version; reads verify version-awareness on the client.

Core idea: BASE consistency problems are solved in the product layer, not the DB layer. Discord can't get new messages from a DB read? Then push them.

5. Capacity estimation: 100M orders/day, 1KB each, kept for 5 years. How much OLTP (Postgres) storage? SSD or HDD? What's the budget?

Raw data:

100M × 1KB/day = 100 GB/day
5 years ≈ 1825 days × 100 GB = 182.5 TB

Actual footprint (× 3–5 amplification):

Indexes: primary key + typically 3–5 secondary indexes; each is 20–40% of the original data. Indexes ≈ 1.5× data = 270TB.
WAL (write-ahead log): typically retained 7–30 days, ~10TB+.
Vacuum bloat: Postgres MVCC leaves dead tuples; without good vacuum hygiene, this can balloon 20–50%.
Backups: usually multiple snapshots, local + remote, another ×3.
Total estimate: 180TB of raw data → real production storage need of 600TB to 1PB.

SSD vs HDD:

OLTP must be SSD (NVMe). HDD seeks at 5–10ms; p99 SLAs are impossible. NVMe p99 < 100μs.
AWS gp3 SSD is roughly $0.08/GB/month, io2 around $0.125/GB/month.
1PB of SSD ≈ $80k–$125k/month. A year is $1–1.5M in storage alone.

Conclusion: a single box can't do this

Sharding is required: split into 32–128 logical shards by user_id / shard_key.
Hot/cold tiering: last 90 days hot on Postgres SSD (30TB), 90 days–1 year warm on Aurora I/O Optimized (150TB), 1 year+ cold archived to S3 + Iceberg (HDD tier at $0.004/GB/month vs SSD $0.08/GB/month — a 20× savings).
After tiering, SSD need drops to 100–200TB, cost down to $10k–$20k/month.

What the interviewer really wants: whether you compute amplification factors (indexes / WAL / backup / bloat), know about sharding, and know hot/cold tiering saves real money. Putting 180TB of "raw data" on SSD as a single decision is the rookie answer.

← Back to index