You inherit a platform with a $2M/month bill and ~3000 microservices on Kubernetes. The CFO wants unit cost (cost per request, cost per active user) down 30% within 12 months — but no service's p99 SLO may break. This isn't "turn off some boxes"; it's a constrained optimization: minimize resource cost subject to SLO. The headroom is huge — online clusters in the industry typically run at only 10–20% average CPU. Uber publicly raised fleet-wide utilization from ~20% to 31%, evidence that wasted headroom is the norm.
graph TD
subgraph OBS["Telemetry"]
M["Runtime metrics
CPU · QPS · p99"]
BILL["Cloud bill
Billing / CUR"]
end
M --> PLAN["Capacity planner
forecast + queueing headroom"]
PLAN --> AS["Autoscaler
reactive + predictive"]
AS --> SCHED["Scheduler / bin-packer
Karpenter · Borg"]
SCHED --> FLEET[("Compute fleet
RI · Savings Plan · Spot")]
FLEET --> M
BILL --> ATTR["Cost attribution
tags · allocation"]
ATTR --> FIN["FinOps dashboard
cost / request"]
FIN -.unit-cost feedback.-> PLAN
classDef obs fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef ctrl fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef fleet fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef fin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class M,BILL obs
class PLAN,AS,SCHED ctrl
class FLEET fleet
class ATTR,FIN fin
Left is the "capacity control loop" (metrics→plan→scale→schedule→fleet→back to metrics); right is the "cost loop" (bill→attribution→dashboard). The two couple through the cost/request unit metric.
Principle: Capacity planning computes "how much" under an SLO. Three inputs: peak load (avg_qps × peak_factor), per-instance capacity (measured by load test, not guessed from core count), and a target utilization. The counterintuitive part: target utilization must not be 100%. Queueing theory (M/M/c) says as utilization ρ approaches 1, queue wait grows ∝ 1/(1-ρ) — non-linearly. Going from 70% to 90% multiplies wait time several-fold. Hence a knee: most online services target 50–70%, leaving headroom to absorb bursts and a single-AZ failure (N+1/N+2 redundancy).
# Capacity planning: instances needed at a target utilization (pseudo)
peak_qps = avg_qps * peak_factor # daily peak (e.g. 4x)
per_inst = load_test_throughput # QPS one instance serves
target_util = 0.65 # NOT 1.0 -- before the knee
need = ceil(peak_qps / (per_inst * target_util))
need += az_redundancy # N+1: survive one AZ loss
# Verify: is p99 still within SLO at this util? Use the load-test curve,
# do NOT extrapolate linearly.
GOGC tuning — sizing the headroom waste.Principle: Autoscaling turns "static capacity" into "follow the load." Three paradigms: Reactive (react to live metrics, e.g. K8s HPA target-tracking adjusts replicas when a metric drifts from target), Predictive (forecast future load and scale ahead, e.g. Netflix Scryer), and Scheduled (known pattern, scale on a timer). Reactive's Achilles heel is lag: metric collection + decision + instance boot + warmup can take minutes — too slow for flash-sale-grade spikes.
# HPA target-tracking core formula
desired = ceil(current_replicas * (current_metric / target_metric))
# e.g. 10 replicas, CPU at 90%, target 60% -> ceil(10 * 1.5) = 15
# Anti-flap: scale-in uses max over last 5 min (stabilization window) + cooldown
# Key asymmetry: scale OUT fast, scale IN slow -- shrink too fast and the
# next burst breaks the SLO
Principle: With 3000 services and 500 teams, a single $2M bill is un-optimizable — you must attribute cost to team/product/request. FinOps centers on showback/chargeback: use resource tags (team, service, env) to slice the bill, then allocate shared cost (common LBs, control plane, cross-service DBs, Spot discounts) by some key. The destination is unit cost: cost per request, cost per active user — the only metric you can compare over time and put into a decision.
# Unit economics: turn the bill into a decision-grade metric
team_cost = direct_cost[team] + shared_cost * (team_usage / total_usage)
unit_cost = team_cost / team_requests # $ / 1k req -- comparable QoQ
# Trend > absolute: a rising cost/req means some service is getting less
# efficient -- alert on it
Principle: Lifting utilization from 15% to 45% is a 3x saving. Four levers: ① Rightsizing — requests/limits fit real usage (VPA, Autopilot auto-tune); ② Bin-packing — the scheduler packs pods tightly to cut fragmentation; ③ Consolidation — periodically repack and retire empty nodes; ④ Purchasing mix — baseline on Reserved/Savings Plans (~40–60% off but locked 1–3 yrs), elastic peaks on On-demand, fault-tolerant batch on Spot (~70–90% off but reclaimable anytime).
| Purchase type | Savings | Cost / fit |
|---|---|---|
| Reserved / Savings Plan | ~40–60% | Locked 1–3 yrs; mismatches when architecture changes; for the stable baseline |
| On-demand | 0 (list price) | Priciest but never reclaimed; for elastic peak backstop |
| Spot | ~70–90% | Reclaimed anytime (~2 min notice); only for interruptible/retryable work |
# Node purchasing: commit-discount baseline, Spot for elastic, On-demand backstop
if workload.interruptible and spot_available:
launch(SPOT) # 70-90% off; on reclaim notice -> drain gracefully
elif load <= committed_baseline:
use(RESERVED) # baseline locked, 40-60% off
else:
launch(ON_DEMAND) # peak backstop, priciest but never reclaimed
Likely interview follow-ups:
It's the 1/(1-ρ) denominator. At 70% the queue factor ∝ 1/0.3 ≈ 3.3; at 90% ∝ 1/0.1 = 10 — wait time roughly triples, and it steepens the closer you get to 1. The box savings are linear (~22% fewer nodes), but p99 degradation is non-linear and easily breaks the SLO. The real math: saved instance cost vs (SLO penalties + client retries amplified by timeouts + those retries raising ρ further into a positive feedback loop). Headroom's value is absorbing bursts and buffering the queue; cutting it as "waste" trades tail-latency stability for an accounting number. Bottom line: derive the utilization target from the latency curve under the SLO, not from "higher is better."
Under-provisioning means scale-out can't catch up and the SLO breaks during the lag — worse than pure reactive, because you "thought you didn't need to scale" and lowered the baseline. Fail-safe design: ① asymmetric penalty — over-predict only costs a bit, under-predict violates the SLO, so build in safety margin and bias toward slight over-estimation. ② Layered backstop — predictive sets only the capacity floor (raise the baseline ahead), reactive stays on top to catch spikes the forecast missed. ③ Confidence intervals — emit uncertainty; volatile windows auto-add headroom. ④ Fallback — on anomalous signals (missing data, model drift) degrade to pure reactive; never let one bad forecast pin capacity to the floor. The essence: treat prediction as the optimizer and "reactive + headroom" as the safety net.
Efficiency and resilience are opposed. Squeezing out headroom and maxing Spot systematically removes every buffer: ① zero fault-absorption — a single AZ failure hits already-full nodes; shifting that traffic triggers a cascade (no N+1 to catch it). ② Spot is a shared pool — AWS reclaims tend to be batched by instance type/AZ, you compete with everyone for the same pool, and a mass reclaim collapses capacity instantly. ③ Autoscale lag is deadlier at high utilization — no spare to cover the boot-up minutes. So "extreme efficiency" suits interruptible batch, never the SLO-bearing online path. The right posture is tiering: core online services keep headroom + On-demand/RI floor, while elastic and offline workloads ride Spot and high utilization. Save where you can tolerate volatility.
It's a ratio, and both ends can be gamed. ① Denominator dilution — pump in cheap requests (health checks, polls, retries) and cost/req drops on paper while total cost holds or rises. ② Non-homogeneous requests — a search vs a static page differ 100x; the average hides "expensive requests," so look at the P50/P99 per-request cost distribution, not one mean. ③ Fixed cost dominates at low traffic — control plane and minimum replicas are fixed, so cost/req spikes in the trough without any inefficiency. ④ Cross-service shifting — A offloads compute to B; A's cost/req improves, the global total doesn't move. ⑤ Use marginal cost — "is this extra traffic worth taking" needs marginal, not amortized average, cost. In short: cost/req is great for trends and alerts, but decisions need to split by request type, separate fixed/marginal, and beware the denominator game.
A classic tragedy of the commons + second-order effect. For the team alone: Spot + off-peak does lower the bill and the showback looks great. At platform scale: ① Spot is a shared capacity pool — everyone is incentivized to grab it, the night window quickly becomes a "peak" too, reclaim rates rise, available capacity falls, and nobody saves as expected. ② Nightly batch piles up, filling the original trough, erasing the peak-shaving dividend and possibly squeezing backup/maintenance windows. ③ It's an incentive-design fault — if the allocation key gives Spot discounts only to the user without reflecting the externality their crowding imposes, it rewards over-concentration. The fix is to bake a global view into attribution and scheduling: differentiated price signals for off-peak (busier windows cost more), platform-level Spot diversification and time-spreading of workloads, so local and global optima align. This is exactly how the choice of allocation key shapes behavior.