Day 27 Hard Cost Engineering Capacity Planning Autoscaling FinOps

Cost & Capacity Engineering — Driving Unit Cost Down Without Breaking the SLOCost & Capacity Engineering: Planning, Attribution, Autoscaling, Efficiency

Scenario + Constraints

You inherit a platform with a $2M/month bill and ~3000 microservices on Kubernetes. The CFO wants unit cost (cost per request, cost per active user) down 30% within 12 months — but no service's p99 SLO may break. This isn't "turn off some boxes"; it's a constrained optimization: minimize resource cost subject to SLO. The headroom is huge — online clusters in the industry typically run at only 10–20% average CPU. Uber publicly raised fleet-wide utilization from ~20% to 31%, evidence that wasted headroom is the norm.

High-Level Architecture (Cost–Capacity Loop)

graph TD
    subgraph OBS["Telemetry"]
      M["Runtime metrics
CPU · QPS · p99"] BILL["Cloud bill
Billing / CUR"] end M --> PLAN["Capacity planner
forecast + queueing headroom"] PLAN --> AS["Autoscaler
reactive + predictive"] AS --> SCHED["Scheduler / bin-packer
Karpenter · Borg"] SCHED --> FLEET[("Compute fleet
RI · Savings Plan · Spot")] FLEET --> M BILL --> ATTR["Cost attribution
tags · allocation"] ATTR --> FIN["FinOps dashboard
cost / request"] FIN -.unit-cost feedback.-> PLAN classDef obs fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef ctrl fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef fleet fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef fin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class M,BILL obs class PLAN,AS,SCHED ctrl class FLEET fleet class ATTR,FIN fin

Left is the "capacity control loop" (metrics→plan→scale→schedule→fleet→back to metrics); right is the "cost loop" (bill→attribution→dashboard). The two couple through the cost/request unit metric.

Key Technical Points

1. Capacity Planning: higher utilization isn't always cheaper

Principle: Capacity planning computes "how much" under an SLO. Three inputs: peak load (avg_qps × peak_factor), per-instance capacity (measured by load test, not guessed from core count), and a target utilization. The counterintuitive part: target utilization must not be 100%. Queueing theory (M/M/c) says as utilization ρ approaches 1, queue wait grows ∝ 1/(1-ρ) — non-linearly. Going from 70% to 90% multiplies wait time several-fold. Hence a knee: most online services target 50–70%, leaving headroom to absorb bursts and a single-AZ failure (N+1/N+2 redundancy).

Trade-off:
# Capacity planning: instances needed at a target utilization (pseudo)
peak_qps    = avg_qps * peak_factor            # daily peak (e.g. 4x)
per_inst    = load_test_throughput             # QPS one instance serves
target_util = 0.65                             # NOT 1.0 -- before the knee
need = ceil(peak_qps / (per_inst * target_util))
need += az_redundancy                          # N+1: survive one AZ loss
# Verify: is p99 still within SLO at this util? Use the load-test curve,
#         do NOT extrapolate linearly.
Real-world:

2. Autoscaling: Reactive, Predictive, Scheduled

Principle: Autoscaling turns "static capacity" into "follow the load." Three paradigms: Reactive (react to live metrics, e.g. K8s HPA target-tracking adjusts replicas when a metric drifts from target), Predictive (forecast future load and scale ahead, e.g. Netflix Scryer), and Scheduled (known pattern, scale on a timer). Reactive's Achilles heel is lag: metric collection + decision + instance boot + warmup can take minutes — too slow for flash-sale-grade spikes.

Trade-off:
# HPA target-tracking core formula
desired = ceil(current_replicas * (current_metric / target_metric))
# e.g. 10 replicas, CPU at 90%, target 60% -> ceil(10 * 1.5) = 15
# Anti-flap: scale-in uses max over last 5 min (stabilization window) + cooldown
# Key asymmetry: scale OUT fast, scale IN slow -- shrink too fast and the
#                next burst breaks the SLO
Real-world:

3. Cost Attribution: no unit economics, no optimization

Principle: With 3000 services and 500 teams, a single $2M bill is un-optimizable — you must attribute cost to team/product/request. FinOps centers on showback/chargeback: use resource tags (team, service, env) to slice the bill, then allocate shared cost (common LBs, control plane, cross-service DBs, Spot discounts) by some key. The destination is unit cost: cost per request, cost per active user — the only metric you can compare over time and put into a decision.

Trade-off (allocating shared cost):
# Unit economics: turn the bill into a decision-grade metric
team_cost = direct_cost[team] + shared_cost * (team_usage / total_usage)
unit_cost = team_cost / team_requests        # $ / 1k req -- comparable QoQ
# Trend > absolute: a rising cost/req means some service is getting less
#                    efficient -- alert on it
Real-world:

4. Resource Efficiency: rightsizing, bin-packing, three purchasing tiers

Principle: Lifting utilization from 15% to 45% is a 3x saving. Four levers: ① Rightsizing — requests/limits fit real usage (VPA, Autopilot auto-tune); ② Bin-packing — the scheduler packs pods tightly to cut fragmentation; ③ Consolidation — periodically repack and retire empty nodes; ④ Purchasing mix — baseline on Reserved/Savings Plans (~40–60% off but locked 1–3 yrs), elastic peaks on On-demand, fault-tolerant batch on Spot (~70–90% off but reclaimable anytime).

Purchase typeSavingsCost / fit
Reserved / Savings Plan~40–60%Locked 1–3 yrs; mismatches when architecture changes; for the stable baseline
On-demand0 (list price)Priciest but never reclaimed; for elastic peak backstop
Spot~70–90%Reclaimed anytime (~2 min notice); only for interruptible/retryable work
# Node purchasing: commit-discount baseline, Spot for elastic, On-demand backstop
if workload.interruptible and spot_available:
    launch(SPOT)            # 70-90% off; on reclaim notice -> drain gracefully
elif load <= committed_baseline:
    use(RESERVED)           # baseline locked, 40-60% off
else:
    launch(ON_DEMAND)       # peak backstop, priciest but never reclaimed
Real-world:

Scaling & Evolution

Common Pitfalls + Interview Questions

Pitfall 1: pushing utilization toward 100%. Ignoring the queueing knee, p99 degrades non-linearly past ~80% and SLO penalties claw back the box savings.
Pitfall 2: too-aggressive scale-in. Shrink then a burst hits — repeated flapping, worse than not shrinking; scale-out fast, scale-in slow is the rule.
Pitfall 3: watching total cost, not unit cost. Business 2x with cost 2x is healthy; track the cost/req trend, not the absolute.
Pitfall 4: using Spot like On-demand. Put stateful/non-interruptible services on Spot and a mass reclaim drops data/capacity.
Pitfall 5: incomplete tags / over-buying reservations. 20–30% of cost in the unallocated bucket breaks attribution; a 3-yr RI can cost more than On-demand once architecture shifts.

Likely interview follow-ups:

  1. Average CPU is 15% and the boss wants to cut machines in half — how do you decide how much you can cut? (headroom, peak factor, queueing knee, N+1)
  2. When does reactive vs predictive autoscaling fit? How do you combine them? What about cold-start lag?
  3. How do you attribute a $2M bill to 500 teams? How do you fairly allocate shared cost (LBs, control plane, Spot discounts)?
  4. What workloads suit Spot? How do you handle reclaim gracefully and bound reclaim risk?
  5. cost per request suddenly jumps 20% — how do you localize which service and why?

Deeper Resources

Going Deeper (click to expand)

1. Pushing utilization from 70% to 90% saves machines — why might p99 make you lose more money?

It's the 1/(1-ρ) denominator. At 70% the queue factor ∝ 1/0.3 ≈ 3.3; at 90% ∝ 1/0.1 = 10 — wait time roughly triples, and it steepens the closer you get to 1. The box savings are linear (~22% fewer nodes), but p99 degradation is non-linear and easily breaks the SLO. The real math: saved instance cost vs (SLO penalties + client retries amplified by timeouts + those retries raising ρ further into a positive feedback loop). Headroom's value is absorbing bursts and buffering the queue; cutting it as "waste" trades tail-latency stability for an accounting number. Bottom line: derive the utilization target from the latency curve under the SLO, not from "higher is better."

2. A predictive autoscaler under-predicts a spike — what happens, and how do you make it "safe when wrong"?

Under-provisioning means scale-out can't catch up and the SLO breaks during the lag — worse than pure reactive, because you "thought you didn't need to scale" and lowered the baseline. Fail-safe design: ① asymmetric penalty — over-predict only costs a bit, under-predict violates the SLO, so build in safety margin and bias toward slight over-estimation. ② Layered backstop — predictive sets only the capacity floor (raise the baseline ahead), reactive stays on top to catch spikes the forecast missed. ③ Confidence intervals — emit uncertainty; volatile windows auto-add headroom. ④ Fallback — on anomalous signals (missing data, model drift) degrade to pure reactive; never let one bad forecast pin capacity to the floor. The essence: treat prediction as the optimizer and "reactive + headroom" as the safety net.

3. Very high utilization and a very high Spot share look thrifty. When does "extreme efficiency" bite back? (echoes Day 23 Reliability)

Efficiency and resilience are opposed. Squeezing out headroom and maxing Spot systematically removes every buffer: ① zero fault-absorption — a single AZ failure hits already-full nodes; shifting that traffic triggers a cascade (no N+1 to catch it). ② Spot is a shared pool — AWS reclaims tend to be batched by instance type/AZ, you compete with everyone for the same pool, and a mass reclaim collapses capacity instantly. ③ Autoscale lag is deadlier at high utilization — no spare to cover the boot-up minutes. So "extreme efficiency" suits interruptible batch, never the SLO-bearing online path. The right posture is tiering: core online services keep headroom + On-demand/RI floor, while elastic and offline workloads ride Spot and high utilization. Save where you can tolerate volatility.

4. cost per request is a good metric — when does it lie to you?

It's a ratio, and both ends can be gamed. ① Denominator dilution — pump in cheap requests (health checks, polls, retries) and cost/req drops on paper while total cost holds or rises. ② Non-homogeneous requests — a search vs a static page differ 100x; the average hides "expensive requests," so look at the P50/P99 per-request cost distribution, not one mean. ③ Fixed cost dominates at low traffic — control plane and minimum replicas are fixed, so cost/req spikes in the trough without any inefficiency. ④ Cross-service shifting — A offloads compute to B; A's cost/req improves, the global total doesn't move. ⑤ Use marginal cost — "is this extra traffic worth taking" needs marginal, not amortized average, cost. In short: cost/req is great for trends and alerts, but decisions need to split by request type, separate fixed/marginal, and beware the denominator game.

5. A team dumps all its work onto Spot and nightly batch to cut its own bill. Locally optimal — good or bad for the whole platform?

A classic tragedy of the commons + second-order effect. For the team alone: Spot + off-peak does lower the bill and the showback looks great. At platform scale: ① Spot is a shared capacity pool — everyone is incentivized to grab it, the night window quickly becomes a "peak" too, reclaim rates rise, available capacity falls, and nobody saves as expected. ② Nightly batch piles up, filling the original trough, erasing the peak-shaving dividend and possibly squeezing backup/maintenance windows. ③ It's an incentive-design fault — if the allocation key gives Spot discounts only to the user without reflecting the externality their crowding imposes, it rewards over-concentration. The fix is to bake a global view into attribution and scheduling: differentiated price signals for off-peak (busier windows cost more), platform-level Spot diversification and time-spreading of workloads, so local and global optima align. This is exactly how the choice of allocation key shapes behavior.