Day 3 Medium Database SQL vs NoSQL OLTP/OLAP ACID/BASE

Database Selection — Not Just "SQL vs NoSQL"Database Selection: SQL vs NoSQL, OLTP vs OLAP, ACID vs BASE

The Problem

When an interviewer asks "why MongoDB?", a senior engineer can blow it — because the right answer isn't "because it's fast," it's "because my access pattern is reads aggregated by user_id, writes outweigh reads, and the schema changes often." Database selection isn't a tech religion; it's constraint derivation. Data model, access pattern, consistency requirements, scale, and operational cost — five axes decide everything.

The counter-examples are plentiful. Uber migrated from Postgres to MySQL in 2016 (Postgres replication protocol's write amplification), then in 2020 built Schemaless on top of MySQL as a KV store. Pinterest moved from MongoDB to sharded MySQL in 2012. Discord migrated message storage from MongoDB to Cassandra in 2017 (1B messages/day), then to ScyllaDB in 2023. Stripe still runs MongoDB for the core ledger — backed by a meticulous application-layer consistency layer. There is no best database, only the closest fit.

Requirements and Constraints (the interview opener)

High-Level Architecture (Typical Data Stack)

graph LR
    APP["Application
API / Service"] CACHE["Redis
KV cache · sessions · leaderboards"] OLTP["OLTP
Postgres / MySQL · DynamoDB"] SEARCH["Search
Elasticsearch · OpenSearch"] QUEUE["CDC / Kafka
binlog → downstream"] DWH["Data Warehouse
Snowflake / BigQuery · OLAP"] LAKE["Data Lake
S3 + Iceberg · cold data"] APP --> CACHE APP --> OLTP APP --> SEARCH OLTP -->|binlog| QUEUE QUEUE --> DWH QUEUE --> SEARCH QUEUE --> LAKE classDef app fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef cache fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef oltp fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef olap fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class APP app class CACHE cache class OLTP,SEARCH oltp class QUEUE,DWH,LAKE olap

A modern backend is usually polyglot: OLTP for transactions, OLAP for analytics, search for search, cache for reads

Key Technical Points

1. SQL vs NoSQL: Not a Binary — Pick a Data Model

Principle: "NoSQL" is a poor label. It's an umbrella for four genuinely different things:

TypeExamplesData modelTypical use
Relational (SQL)Postgres, MySQL, Oraclerows + schema + joinstransactions, orders, relational data
Key-ValueRedis, DynamoDB, RiakK → V (V may be JSON)sessions, caching, counters
DocumentMongoDB, Couchbasenested JSON documentsuser profiles, CMS, evolving schemas
Wide-columnCassandra, HBase, ScyllaDBrow key + column familiestime series, messaging, heavy writes
GraphNeo4j, Neptune, TigerGraphnodes + edgessocial graphs, fraud, knowledge graphs
SearchElasticsearch, OpenSearchinverted indexfull-text search, logs, analytics
Time-seriesInfluxDB, TimescaleDB, Prometheus(time, tags, value)monitoring, IoT, financial ticks
SQL strengths: NoSQL strengths:
Real cases:

2. OLTP vs OLAP: Row Storage vs Columnar, the Real Difference

Principle: transactional and analytical systems have inverted access patterns:

This determines the storage layout. OLTP uses row storage (all fields of a row together, optimized for fetching a whole row). OLAP uses columnar storage (all values of one column together, optimized for aggregation, high compression, and SIMD).

graph LR
    subgraph Row["Row Store (Postgres / MySQL)"]
        R1["[1, Alice, 30, US]"]
        R2["[2, Bob, 25, UK]"]
        R3["[3, Carol, 28, JP]"]
    end
    subgraph Col["Columnar Store (BigQuery / Snowflake)"]
        C1["id: [1, 2, 3]"]
        C2["name: [Alice, Bob, Carol]"]
        C3["age: [30, 25, 28]"]
        C4["country: [US, UK, JP]"]
    end
    Row -->|"SELECT * WHERE id=2 ✅"| Q1["fast point lookup"]
    Col -->|"AVG(age) GROUP BY country ✅"| Q2["fast aggregation · 10x compression"]
    classDef row fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef col fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class Row,R1,R2,R3 row
    class Col,C1,C2,C3,C4 col
Trade-off:
# Why columnar aggregations are fast: only read the column you need
# Table with 100 columns, 1B rows
# OLTP row store running SELECT AVG(salary):
#   scans 1B × 100 columns of data = massive I/O
# OLAP columnar running SELECT AVG(salary):
#   reads only the salary column (1B × 1 column), compressed to ~1GB
#   + SIMD vectorized execution → sub-second response
Real cases:

3. ACID vs BASE: Two Philosophies of Consistency

ACID (Atomicity, Consistency, Isolation, Durability): the traditional RDBMS promise. BASE (Basically Available, Soft state, Eventual consistency): the NoSQL compromise.

ACIDBASE
Atomicitytransactions are all-or-nothingnone; rely on idempotent design
Consistencyeveryone sees a consistent view post-transactioneventual; reads may see old values within the window
IsolationSI / RR / Serializableeven read-committed is hard to guarantee
Durabilitycommit fsyncs to diskasync multi-replica; rare edge cases lose data
TypicalPostgres, Spanner, CockroachDBDynamoDB (classic), Cassandra, Riak

But since the 2010s, the line has been blurring:

# Postgres SERIALIZABLE transaction (highest isolation)
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- if balance >= 100, debit
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;
-- ⚠️ Serializable raises SerializationFailure on concurrent conflict
-- the application must catch and retry (backoff + jitter)
Trade-off:
Real cases:

4. A Decision Tree: Postgres / DynamoDB / Cassandra / Redis

Principle: four mainstream options cover 90% of scenarios. Here's a practical decision path.

graph TD
    Start{"Need transactions and joins?"}
    Start -->|"Yes"| SQL_q{"Data > 1TB?"}
    Start -->|"No"| KV_q{"Latency < 1ms?"}

    SQL_q -->|"No"| PG["✅ Postgres
transactions · complex queries · JSONB"] SQL_q -->|"Yes"| Dist_q{"Global multi-region?"} Dist_q -->|"Yes"| Span["✅ Spanner / CockroachDB
cross-region ACID"] Dist_q -->|"No"| TiDB["✅ TiDB / Vitess
distributed MySQL"] KV_q -->|"Yes"| Redis["✅ Redis
in-memory KV · cache · leaderboards"] KV_q -->|"No"| Scale_q{"> 50k write QPS & massive data?"} Scale_q -->|"Yes"| Wide_q{"Managed or self-hosted?"} Scale_q -->|"No"| PG2["✅ Postgres
usually enough"] Wide_q -->|"Managed"| DDB["✅ DynamoDB
zero-ops · pay per access"] Wide_q -->|"Self-hosted"| Cass["✅ Cassandra / Scylla
multi-region write · BASE"] classDef sql fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef kv fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef wide fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef q fill:#1a2530,stroke:#64c8ff,color:#e8eef5 class PG,PG2,Span,TiDB sql class Redis kv class DDB,Cass wide class Start,SQL_q,KV_q,Dist_q,Scale_q,Wide_q q
The senior-engineer default: unless you can articulate why not Postgres (specific QPS, data volume, or access pattern), use Postgres. It has transactions, JSONB (semi-structured), full-text search (tsvector), geo (PostGIS), time series (TimescaleDB extension), and columnar (Citus) — nearly a one-stop shop. Notion, Instagram, Robinhood, Reddit, GitLab, Heroku all run Postgres heavily.
Real cases:

Scaling Up: What to Do After Growth

Common Pitfalls (Even Senior Engineers Slip Here)

1. Using MongoDB like a relational DB: once the business stabilizes, "no joins" becomes a curse. Either you do N+1 queries in the app (slow) or you denormalize (write amplification, consistency nightmare). Stripe still pays the price.
2. Using Redis as the primary store: Redis persistence (RDB/AOF) was never the design goal; crash RPO is at least a few seconds. For account balances, Redis is a cache, not the source of truth.
3. Letting the ORM generate SQL: ActiveRecord / Django ORM slips into N+1 or full-table scans without warning. EXPLAIN before production.
4. Designing schema before access pattern: NoSQL (especially DynamoDB) demands "decide the queries first, then design the table." Pick the wrong partition key and you're rebuilding. AWS pushes single-table design, but the learning curve is steep.
5. Schema migration is rarely free: ALTER on a large table to add a column or change a type can lock for hours. In production: online DDL (pt-online-schema-change, gh-ost), batched backfill, dual-write old + new columns.
6. Cross-database distributed transactions: "write Postgres + write Redis" with consistency requirements needs the outbox pattern (Day 7). Don't reach for 2PC.

Sample Interview Questions

  1. Designing Twitter — what stores the timeline? Why not all MySQL?
  2. Postgres or MySQL? Name three real differences (don't say "about the same").
  3. When would you pick Cassandra over DynamoDB?
  4. What does the "S" (soft state) in BASE mean? Give an example.
  5. From 100k to 100M users, how does the database evolve? List phases and each phase's bottleneck.
  6. HTAP databases (TiDB) vs the "OLTP + CDC to OLAP" pattern — which do you pick?

Key Resources

English Summary

Database selection is constraint derivation, not technology religion. Five axes determine the choice: data model (relational / KV / document / wide-column / graph / search / time-series), access pattern (point lookup vs range scan vs aggregation vs join), consistency (ACID vs BASE), scale (QPS, data size, geo distribution), and operational cost. Default to Postgres unless you have a specific reason — its versatility (JSONB, full-text, geo, time-series extensions) covers 80% of needs. Reach for DynamoDB / Cassandra when write throughput exceeds ~50k QPS or you need horizontal scale-out; Redis when sub-ms latency is required; columnar warehouses (Snowflake / BigQuery / ClickHouse) when analytical aggregation dominates. Modern stacks combine multiple stores via CDC pipelines: OLTP handles transactions, search handles full-text, warehouse handles analytics.

Going Deeper (click to expand)

1. The same business entity (e.g. user) lives in both OLTP (Postgres) and OLAP (Snowflake); OLAP always lags by hours. The PM wants "real-time analytics." Give at least three approaches and the cost of each.

The crux: real-time analytics needs minute- or second-level freshness, which traditional daily-batch ETL can't deliver.

  • A. CDC + stream processing → OLAP. Debezium tails the Postgres binlog → Kafka → Flink / Spark Streaming → writes to ClickHouse / Druid. Freshness in seconds. Cost: stream processing is complex, state management is hard, exactly-once is not easy; the architecture grows by 4–5 components.
  • B. HTAP single database. TiDB / SingleStore — row + columnar replicas coexist, auto-synced. Same SQL, low migration cost. Cost: columnar replica still has lag (sub-second to a few seconds); write performance worse than pure OLTP; complex ops; smaller community than Postgres.
  • C. Analytics on the OLTP read replica. Bring up a dedicated read replica (resource-isolated), route heavy queries there. Cost: row store is slow at aggregations (GROUP BY over 100M rows can take minutes); replica is heavy on resources; poorly written queries can drag replication.
  • D. Real-time materialized views (Materialize, RisingWave). Subscribe to CDC, maintain aggregates automatically, query in milliseconds. Cost: limited SQL expressiveness for materialized views — complex queries can't be expressed; state bloat.
  • E. Lambda architecture. Batch (daily) + real-time (incremental), merge at query time. Cost: two codebases, semantic drift between them — most companies have abandoned this.

Practical pick for 2024: CDC + ClickHouse is the mainstream (Cloudflare, Uber, ByteDance). Materialize is great when the analytics can be expressed in SQL.

2. DynamoDB partition key is "user_id"; a superuser takes 30% of traffic and one partition is throttled (3000 RCU cap). Schema can't change. What do you do?

Classic hot-partition problem. DynamoDB's single-partition hard caps: 3000 RCU + 1000 WCU + 10GB.

  • ① Write sharding (suffix hash): spread superuser writes across virtual partitions. Use a partition key of user_id#{0..15}, pick random on write; on read, fan out 16 concurrent queries and merge. Cost: 16× read amplification; only addresses hot writes.
  • ② Application-layer cache: cache the superuser's data in DAX (DynamoDB Accelerator) or ElastiCache. Most reads are absorbed by the cache; only misses hit DDB. Cost: staleness window, cache consistency.
  • ③ Adaptive Capacity (AWS automatic): DynamoDB already auto-splits hot partitions and grants more capacity. There is a ceiling, and throttling persists for a few minutes during splits. Passive measure.
  • ④ On-Demand mode: switch to pay-per-request — AWS internally allocates capacity more aggressively. Cost: 2–7× more expensive.
  • ⑤ Redundancy / secondary table: replicate the superuser's data into a separate small table (different partition key) and read it from there. Cost: dual-write, consistency. Fits "a few extremely hot" cases.

Real lesson: DynamoDB's mantra "design the access pattern before the partition key" isn't a suggestion. Once a hot key surfaces in production, the migration cost is enormous — which is why many teams move back to Postgres + sharding after a few years.

3. Senior engineers say "start with Postgres, switch later if needed." Give a concrete scenario where Postgres really won't do — what's the threshold (QPS, data volume, feature)?

Single-box Postgres ceilings (on AWS r6i.32xlarge / 128 vCPU / 1TB RAM):

  • Write QPS ceiling around 50k–80k (including WAL fsync). Higher than that, single-box WAL writing becomes the bottleneck and you must shard. Aurora's storage layer can push this to ~200k, at several times the cost.
  • Data volume ceiling around 5–10TB. Beyond this, maintenance windows (vacuum, index rebuild, backup/restore) are too long; autovacuum can't keep up with dead tuples.
  • Connection count ceiling around 5k (each connection ~10MB RAM + a forked process). High-concurrency services must pool with PgBouncer / RDS Proxy.
  • p99 write latency sensitivity (need < 5ms): fsync jitter can spike p99 to tens of ms; vacuum and checkpoint windows are worse.
  • Cross-region strong consistency: Postgres synchronous_commit across regions pushes writes to 100ms+. That's Spanner / CockroachDB territory.

Concrete scenarios:

  • IoT time-series ingest at 1M QPS → Cassandra / ScyllaDB / TimescaleDB with sharding, required.
  • Global multi-region writes (users in any region need low-latency writes) → CockroachDB / Spanner.
  • Columnar aggregation over 10B rows → ClickHouse / BigQuery.
  • Millisecond leaderboards + real-time counters → Redis (ZSET / HINCRBY).
  • 5-hop graph traversal for friends-of-friends → Neo4j / TigerGraph (Postgres recursive CTE works but is orders of magnitude slower).

Real proof: Instagram rode Postgres to IPO (1B DAU) on sharding plus excellent access-pattern design. So "Postgres can't take it" is often code for "the design isn't good enough."

4. Cassandra is eventually consistent. How does the application handle stale reads? Give two real designs.

Cassandra defaults to R=1, W=1 (fastest, least consistent). Production usually runs R=QUORUM, W=QUORUM (R+W > N for strong consistency); with N=3 that's R=2, W=2. Even with quorum there's a stale-read window (write hit 2 replicas, the third hasn't received it yet, you read from the third).

Scenario 1: Discord message list

  • Messages partitioned by channel_id, sorted by msg_id. New messages may not be visible for a moment.
  • Design: clients subscribe to WebSocket pushes (the gateway pushes directly), not relying on DB reads. DB reads only serve history fetches; 100ms of staleness is invisible to the user.
  • Key trick: "push immediately on write," bypassing the DB consistency window.

Scenario 2: E-commerce order status

  • User pays, immediately navigates to the order page, may see an old status (pending instead of paid).
  • Design: use read-your-writes consistency — the client sends "last write timestamp or version" as a hint; the app guarantees reads at or above that version (poll, retry, or route to the replica that took the write).
  • Alternative: on payment callback, synchronously update a Redis cache; the user sees the cache (strongly consistent), background async reconciles Cassandra.

General patterns:

  • Read-repair: built into Cassandra — on read, if replicas disagree, trigger background repair.
  • Hinted handoff: on write, if a replica is down, peers stash a "hint" and replay it on recovery.
  • Lightweight Transaction (LWT / Paxos): Cassandra offers CAS writes (IF NOT EXISTS); 4–10× slower.
  • Application-layer version control: writes carry a version; reads verify version-awareness on the client.

Core idea: BASE consistency problems are solved in the product layer, not the DB layer. Discord can't get new messages from a DB read? Then push them.

5. Capacity estimation: 100M orders/day, 1KB each, kept for 5 years. How much OLTP (Postgres) storage? SSD or HDD? What's the budget?

Raw data:

  • 100M × 1KB/day = 100 GB/day
  • 5 years ≈ 1825 days × 100 GB = 182.5 TB

Actual footprint (× 3–5 amplification):

  • Indexes: primary key + typically 3–5 secondary indexes; each is 20–40% of the original data. Indexes ≈ 1.5× data = 270TB.
  • WAL (write-ahead log): typically retained 7–30 days, ~10TB+.
  • Vacuum bloat: Postgres MVCC leaves dead tuples; without good vacuum hygiene, this can balloon 20–50%.
  • Backups: usually multiple snapshots, local + remote, another ×3.
  • Total estimate: 180TB of raw data → real production storage need of 600TB to 1PB.

SSD vs HDD:

  • OLTP must be SSD (NVMe). HDD seeks at 5–10ms; p99 SLAs are impossible. NVMe p99 < 100μs.
  • AWS gp3 SSD is roughly $0.08/GB/month, io2 around $0.125/GB/month.
  • 1PB of SSD ≈ $80k–$125k/month. A year is $1–1.5M in storage alone.

Conclusion: a single box can't do this

  • Sharding is required: split into 32–128 logical shards by user_id / shard_key.
  • Hot/cold tiering: last 90 days hot on Postgres SSD (30TB), 90 days–1 year warm on Aurora I/O Optimized (150TB), 1 year+ cold archived to S3 + Iceberg (HDD tier at $0.004/GB/month vs SSD $0.08/GB/month — a 20× savings).
  • After tiering, SSD need drops to 100–200TB, cost down to $10k–$20k/month.

What the interviewer really wants: whether you compute amplification factors (indexes / WAL / backup / bloat), know about sharding, and know hot/cold tiering saves real money. Putting 180TB of "raw data" on SSD as a single decision is the rookie answer.