Day 37 Hard Multi-tenant Tenant Isolation Data Partitioning Quotas Metered Billing

Multi-tenant SaaS — One System Serving 10,000 Mutually Distrustful TenantsTenant Isolation, Data Partitioning, Quotas & Metered Billing

Scenario + Requirements

Design a B2B collaboration SaaS (Notion-like / Figma-like) with 10,000 enterprise tenants, 5M users total, too big for a single database. Tenants range from 5-person teams to 50,000-seat customers — three orders of magnitude apart. The central tension: one shared infrastructure serves everyone for cost efficiency, yet no two tenants trust each other — tenant A must never read B's data, and A's bulk import must never starve B's online requests.

Isolation: strong data isolation (compliance: GDPR, SOC2, some customers demand data residency in a specific region); performance isolation (noisy neighbors must not bleed across).
Scale: peak write ~50k QPS, per-tenant data 1MB~2TB, extremely skewed distribution (top 1% of tenants hold ~50% of data).
SLO: read p99 < 100ms; large customers want 99.95% availability; onboarding a new tenant < 1 minute.
Billing: hybrid per-seat + usage-based (API calls, storage), zero tolerance for billing errors.

Four things this issue: which isolation model, how to partition data, how to tame noisy neighbors, and how to meter usage into money.

High-level Architecture

graph TD C[Tenant request] --> GW[API Gateway
resolve tenant_id] GW --> AUTH[AuthN/AuthZ
inject tenant context] AUTH --> RL[Per-tenant rate limit/quota] RL --> APP[Shared app tier
stateless] APP --> ROUTER[Tenant Router
tenant_id to shard map] ROUTER --> S1[(Shard 1
pool, small tenants)] ROUTER --> S2[(Shard 2
pool, small tenants)] ROUTER --> SILO[(Silo DB
large customer)] APP --> MET[Usage events
metering pipeline] MET --> BILL[Billing/invoicing
Stripe] ROUTER -.lookup.-> CAT[(Tenant Catalog
routing + config metadata)]

Component roles: the gateway resolves tenant_id from subdomain / JWT and makes it a first-class citizen across the whole call path; the Tenant Catalog is the control plane storing each tenant's shard placement, isolation tier, quota, and billing plan; the data plane routes by catalog — small tenants mix into shared shards (pool), large customers move to a dedicated database (silo). Usage events flow asynchronously off to the side, into billing.

Key Technical Points

1. Isolation Model: Silo vs Pool vs Bridge — a continuum of security vs cost

Core trade-off: each notch of isolation buys security/compliance but costs efficiency and operational scalability.

[Principle] Multi-tenant isolation is not a boolean but a spectrum. AWS's vocabulary splits it three ways: Silo (each tenant a dedicated stack/DB, physical isolation), Pool (all tenants share resources, isolated by application logic), Bridge (mixed: silo what must be isolated, pool the rest). The key insight is that the isolation boundary can be chosen per layer — compute pooled, storage siloed for large customers — and that is entirely valid.

Silo: per-tenant DB/cluster. ✅ small blast radius, simple compliance, noisy neighbors vanish naturally, billing = just read the resource bill. ❌ cost grows linearly with tenants, schema migration across 10,000 DBs is a nightmare, low utilization (a 5-person tenant still occupies a full stack).
Pool: shared tables, isolated by a tenant_id column + row-level enforcement. ✅ extreme cost efficiency, one migration covers everyone, high utilization. ❌ a single missing WHERE is a cross-tenant data leak, noisy neighbors, blast radius = everyone.
Bridge: pool by default, silo for large/high-compliance customers. ✅ 80% of the cost efficiency + strong-isolation selling point for the 20% that matter. ❌ two code paths, routing complexity.

Real-world: the AWS SaaS Tenant Isolation Strategies whitepaper systematized the silo/pool/bridge models and is now the de-facto industry vocabulary. Salesforce is the classic pool (shared, metadata-driven schema serving massive tenant counts); most mature SaaS eventually evolve into bridge — free/small customers pooled, enterprise tier siloed.

2. Data Partitioning: shard by tenant_id + row-level isolation — skew is enemy #1

Core trade-off: shared schema saves cost but isolation rides entirely on code discipline; schema/DB-per-tenant isolates hard but doesn't scale operationally.

[Principle] Under pool, three granularities: ① shared table + tenant_id column (cheapest, backed by Postgres RLS to catch missing-WHERE leaks); ② schema-per-tenant; ③ database-per-tenant (= silo). SaaS data naturally shards on tenant_id — same-tenant data lands on the same physical shard, the vast majority of queries carry tenant_id and produce no cross-shard fan-out. That is SaaS's huge advantage over generic sharding.

The hard part is skew: random hash(tenant_id)→shard, but one giant tenant can blow up a single shard. The fix is to isolate hot tenants onto their own dedicated shard.

# Postgres RLS: defense-in-depth, backstops even a missing WHERE tenant_id
ALTER TABLE docs ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_iso ON docs
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

# Inject tenant context at request start (with pooled conns, set within a txn)
SET LOCAL app.tenant_id = '...';   # all subsequent queries auto-filtered by RLS

# Routing: hash by default, hot tenants overridden via catalog
def route(tenant_id):
    t = catalog.get(tenant_id)
    if t.dedicated_shard:        # large customer isolated
        return t.dedicated_shard
    return shards[hash(tenant_id) % len(shards)]

Real-world: Notion's The Great Re-shard — partitioned by workspace ID, all data for a workspace on one shard, resharded from 32 to 96 databases with zero downtime. Figma decoupled "logical sharding" (app layer + views) from "physical sharding" (Postgres layer), rolling out the low-risk logical layer before the physical failover, and built DBProxy for routing. AWS details the Postgres RLS pattern for tenant isolation.

3. Noisy Neighbors & Resource Quotas — fairness is pool's Achilles heel

Core trade-off: hard isolation (dedicated resources) prevents contagion but wastes capacity; soft quotas (shared + fair scheduling) save money but require active governance.

[Principle] Under pool, one tenant's million-row bulk import can devour the connection pool / CPU and starve another tenant's online requests — the noisy neighbor. Three lines of defense: ① ingress rate limiting (per-tenant token bucket, tiered by billing plan); ② resource quotas (max connections, query timeout, storage cap per tenant); ③ fair scheduling (per-tenant queues with round-robin, not a shared FIFO that lets a big tenant jump the line). The key: quotas must be per-tenant, not global — a global limit cannot stop a single tenant from blowing things up.

# Per-tenant token bucket + fair queue (no single tenant monopolizes workers)
def admit(tenant_id, cost):
    bucket = buckets[tenant_id]            # capacity set by billing plan
    if not bucket.try_consume(cost):
        raise RateLimited(tenant_id)        # 429 with Retry-After
    enqueue(tenant_queues[tenant_id])       # per-tenant queue

def dispatch():                             # workers round-robin, no preemption
    for q in round_robin(tenant_queues):    # even a big tenant gets only a fair share
        if (job := q.poll()): run(job)

Real-world: Stripe published its multi-tier rate limiters (per-user limits + load shedding); the core idea is to isolate bursty tenants to protect the whole. AWS SaaS Lens lists "noisy neighbor" as the #1 risk of the pool model and recommends per-tenant throttling + usage-based quotas.

4. Usage Metering & Billing Integration — a one-cent billing error is an incident

Core trade-off: real-time accurate metering is expensive; batched approximation is cheaper but hard to reconcile. Billing demands exactly-once semantics.

[Principle] In hybrid billing (seat + usage), the hard part is usage: every API call / storage delta is a usage event that must be aggregated into the bill without loss or duplication. The pipeline: app instrumentation → event queue → aggregation (by tenant×metric×time window) → push to Stripe metered billing. Key designs: each usage event carries an idempotency key (prevents retries from double-charging); aggregation uses at-least-once delivery + downstream idempotent dedup; usage data is the financial source of truth — persist it so it can be recomputed (replayable for billing disputes).

# Idempotent usage event: idempotency_key prevents retries from double-billing
event = {
  "idempotency_key": f"{tenant}:{request_id}",   # unique
  "tenant_id": tenant, "metric": "api_call",
  "qty": 1, "ts": now,
}
emit(usage_topic, event)        # at-least-once delivery

# Aggregate: sum by (tenant, metric, hour), dedup on idempotency_key
INSERT INTO usage_agg(tenant_id, metric, window, qty)
SELECT tenant_id, metric, date_trunc('hour', ts), sum(qty)
FROM (SELECT DISTINCT ON (idempotency_key) * FROM raw_events) d
GROUP BY 1,2,3;

Real-world: Stripe usage-based / metered billing exposes a usage-reporting API with idempotency keys precisely to prevent double-charging; seat + usage hybrids are the SaaS standard (Notion/Figma bill by editor seat; API products bill by call volume). AWS SaaS guidance stresses that metering must be bound to tenant context, or you cannot attribute usage to a tenant for invoicing.

Evolution & Optimization

Start pool, evolve to bridge: all-pool early for speed; when enterprise customers arrive, isolate them to silo (lift-and-shift the single tenant's data to a dedicated DB).
Skew governance: monitor per-shard tenant size, auto-isolate hot tenants; reshard via "dual-write + verify + cutover" (cf. Notion/Figma).
Data residency: compliance customers demand EU/US data not leaving region — add a region dimension to the catalog and route per-tenant to the corresponding regional cluster (multi-region control plane).
Per-tenant observability: tag all metrics/logs/traces with tenant_id, or you cannot answer "which tenant is causing this" during an incident. Cardinality explodes, so be careful with high-cardinality labels.

Pitfalls + Interview Questions

1. Missing WHERE tenant_id = cross-tenant data leak: the deadliest pool bug, and hard to cover with tests. Defense-in-depth: RLS backstop + ORM that forces tenant scope + automated tests simulating cross-tenant access.

2. Connection pool exhausted by one tenant: with a shared pool, one tenant's slow query / large transaction starves everyone. Need per-tenant connection quotas or statement timeouts.

3. Global unique constraints break after sharding: unique email, auto-increment IDs no longer hold globally across shards. Either scope uniqueness to within a tenant, or use a global ID service (see Day 11, Snowflake).

4. Lost/duplicated usage events: at-most-once undercounts (lost revenue); at-least-once without idempotency overcharges (complaints). You need idempotency keys + recomputability.

Likely follow-ups:

How do you choose silo / pool / bridge? What signal triggers moving a tenant from pool to silo?
How does the pool model prevent cross-tenant data leaks at both the code layer and the DB layer?
A giant tenant occupies 80% of a single shard — what do you do? How do you reshard with zero downtime?
Design per-tenant rate limiting where free vs enterprise have different quotas, and a hot tenant's burst doesn't hurt others.
How does a usage-based billing pipeline guarantee exactly-once bills and stay replayable for disputes?

Deep-dive Resources

AWS SaaS Tenant Isolation Strategies whitepaper: authoritative definitions and implementation patterns for silo/pool/bridge.
Notion's The Great Re-shard: sharding by workspace + zero-downtime capacity expansion, in practice.
Figma Databases Team: decoupling logical/physical sharding + DBProxy routing.
AWS: Multi-tenant data isolation with Postgres RLS: the row-level security pattern for tenant isolation.
Designing Data-Intensive Applications (Kleppmann), partitioning & consistency chapters: the fundamentals of partition-key choice, skew, and rebalancing.

Going Deeper (click to expand)

1. Pool saves cost, so why do large customers so often demand silo? Is this technical or commercial?

Both — and often a technical demand wrapped in a commercial one.

Compliance/audit: financial and healthcare SOC2/HIPAA audits demand "provable isolation." Pool's logical isolation is safe, but proving to an auditor that one missing WHERE won't leak is far harder than pointing at a standalone database. Silo is a shortcut to auditability.
Data residency: "my data must stay in the EU" requires per-row routing under pool; silo just puts the whole DB in a region.
Blast radius: a large customer doesn't want to share risk — another tenant getting breached or hitting a bug shouldn't affect them.
Performance SLA: dedicated resources = a committable performance floor; pool always carries noisy-neighbor uncertainty.
Commercial leverage: silo is a premium upsell. Vendors love it — it satisfies the customer and recovers the extra cost. That's why bridge is the endgame for nearly every mature SaaS.

2. You shard evenly with hash(tenant_id), yet one shard's p99 is persistently 3x the rest. Possible causes?

Hash balances tenant count, not load — that is the essence of SaaS skew.

Size skew: one 50k-seat tenant and a few hundred small ones hash to the same shard; its data volume and QPS crush it. The most common cause.
Behavioral skew: some tenant just ran a bulk import/export, temporarily saturating that shard.
Large keys/large transactions: a tenant with a giant document, or a long lock-holding transaction, slows everyone on the shard.
Fix: monitor per-shard + per-tenant load, isolate hot tenants to a dedicated shard (catalog override routing); for huge tenants consider intra-tenant secondary sharding. Note: blindly adding shards doesn't help — the hotspot is caused by a single tenant, so after rehashing it still crowds someone else.

3. A global rate limit (say 100k QPS total) already exists — why still mandate per-tenant limiting? How do they relate?

Global limiting protects the system from falling over; per-tenant limiting protects fairness between tenants — two orthogonal goals.

Global-only: a single tenant can legitimately consume 99% of the 100k QPS, and every other tenant gets 429'd. The system isn't down, but for everyone else it's unavailable. That's the noisy neighbor at the rate-limit layer.
Per-tenant caps give each tenant a ceiling (tiered by billing plan), ensuring no single tenant can monopolize global capacity.
Relationship: stacked. Per-tenant is the first line (fairness); global is the backstop (survival, for the extreme case of all tenants bursting at once + load shedding). Ideally add dynamic fair shares: when the system is idle, let tenants burst above base quota; when busy, fall back to guaranteed shares (weighted fair queuing).

4. Usage billing wants exactly-once, but distributed systems can't truly deliver exactly-once. What now?

Combine "at-least-once delivery + downstream idempotent dedup" to achieve an effectively-once result — the standard pattern for distributed billing (same root as Day 8 message queues and Day 17 payment idempotency).

At-least-once delivery: better to duplicate than to lose (loss = undercharge, books don't balance). Producer retries, durable queue.
Idempotent dedup: each usage event carries a globally unique idempotency_key (e.g. tenant:request_id); aggregation dedups on it (DISTINCT / upsert), absorbing duplicates.
Recomputability (source of truth): persist raw usage events (not just the aggregate). On disputes, aggregation bugs, or late events, replay from the raw stream to recompute the bill. Aggregates are derived; raw events are the financial truth.
Time windows + late arrivals: use a watermark to handle events arriving across hour boundaries (cf. Day 20 stream processing), so yesterday's usage isn't counted into today.

5. A pool shard has 5,000 tenants sharing one table, and you must migrate the schema for just one tenant (add a column + backfill). Why is this harder than in a single-tenant system?

The hard part isn't "add a column" — it's that operations on a shared table ripple to unrelated tenants.

Backfill hits everyone: after adding the column, the backfill UPDATE scans the whole shared table (all 5,000 tenants), not just the target's rows — locks, IO, replication lag are paid by all, violating isolation. Batch + throttle the backfill, and the WHERE must carry tenant_id to backfill only the target (though the ADD COLUMN itself is table-level DDL you can't avoid).
DDL is table-level: in Postgres, even a metadata-only ADD COLUMN takes a brief table-level lock that blocks writes for all 5,000 tenants. Use online DDL tooling, or pick off-peak with a short lock_timeout + retry.
You can't change schema for just one tenant: pool shares the schema, so the new column is visible to everyone — a fundamental pool constraint. If one tenant needs a custom field, use a JSONB extension column, or treat it as the signal to move them to silo.
Vs single-tenant: a single-tenant migration affects only itself and can take any downtime window; under pool, any DDL is a shared change across all tenants, with exponentially higher risk and coordination cost. This is precisely silo's hidden advantage in customization/migration flexibility.