Day 27 Hard Cost Engineering Capacity Planning Autoscaling FinOps

Cost & Capacity Engineering — Driving Unit Cost Down Without Breaking the SLOCost & Capacity Engineering: Planning, Attribution, Autoscaling, Efficiency

Scenario + Constraints

You inherit a platform with a $2M/month bill and ~3000 microservices on Kubernetes. The CFO wants unit cost (cost per request, cost per active user) down 30% within 12 months — but no service's p99 SLO may break. This isn't "turn off some boxes"; it's a constrained optimization: minimize resource cost subject to SLO. The headroom is huge — online clusters in the industry typically run at only 10–20% average CPU. Uber publicly raised fleet-wide utilization from ~20% to 31%, evidence that wasted headroom is the norm.

Goal: unit cost ↓30% with SLOs intact (cost and latency are opposing constraints).
Load shape: 4:1 daily peak-to-trough, up to 10:1 on promos — buying statically for the peak burns money year-round.
Status quo: ~15% average CPU; heavy over-requesting (requests far above real usage).
Attribution: 3000 services, 500 teams — cost must slice to team/product/request granularity.
Timeliness: scale-out must finish before the SLO breaks (minutes); scale-in must be stable, not jittery.

High-Level Architecture (Cost–Capacity Loop)

graph TD
    subgraph OBS["Telemetry"]
      M["Runtime metrics
CPU · QPS · p99"]
      BILL["Cloud bill
Billing / CUR"]
    end
    M --> PLAN["Capacity planner
forecast + queueing headroom"]
    PLAN --> AS["Autoscaler
reactive + predictive"]
    AS --> SCHED["Scheduler / bin-packer
Karpenter · Borg"]
    SCHED --> FLEET[("Compute fleet
RI · Savings Plan · Spot")]
    FLEET --> M
    BILL --> ATTR["Cost attribution
tags · allocation"]
    ATTR --> FIN["FinOps dashboard
cost / request"]
    FIN -.unit-cost feedback.-> PLAN

    classDef obs fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef ctrl fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef fleet fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef fin fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class M,BILL obs
    class PLAN,AS,SCHED ctrl
    class FLEET fleet
    class ATTR,FIN fin

Left is the "capacity control loop" (metrics→plan→scale→schedule→fleet→back to metrics); right is the "cost loop" (bill→attribution→dashboard). The two couple through the cost/request unit metric.

Key Technical Points

1. Capacity Planning: higher utilization isn't always cheaper

Principle: Capacity planning computes "how much" under an SLO. Three inputs: peak load (avg_qps × peak_factor), per-instance capacity (measured by load test, not guessed from core count), and a target utilization. The counterintuitive part: target utilization must not be 100%. Queueing theory (M/M/c) says as utilization ρ approaches 1, queue wait grows ∝ 1/(1-ρ) — non-linearly. Going from 70% to 90% multiplies wait time several-fold. Hence a knee: most online services target 50–70%, leaving headroom to absorb bursts and a single-AZ failure (N+1/N+2 redundancy).

Trade-off:

Static over-provisioning: ✅ simple, burst-proof; ❌ low utilization, burns money year-round.
Hug the peak + elasticity: ✅ cheap; ❌ scale-out has lag, steep bursts break the SLO.
Queueing-model headroom: ✅ theoretically optimal; ❌ needs an accurate service-time distribution; heavy-tailed traffic breaks the model.

# Capacity planning: instances needed at a target utilization (pseudo)
peak_qps    = avg_qps * peak_factor            # daily peak (e.g. 4x)
per_inst    = load_test_throughput             # QPS one instance serves
target_util = 0.65                             # NOT 1.0 -- before the knee
need = ceil(peak_qps / (per_inst * target_util))
need += az_redundancy                          # N+1: survive one AZ loss
# Verify: is p99 still within SLO at this util? Use the load-test curve,
#         do NOT extrapolate linearly.

Real-world:

Uber: the Capacity Recommendation Engine (CRE) uses throughput- and utilization-based ML to recommend capacity; a separate post reports saving ~70K cores across 30 core services via Go GOGC tuning — sizing the headroom waste.
Google: the Autopilot paper states that users over-request out of fear of OOM kills, the main driver of cluster waste.

2. Autoscaling: Reactive, Predictive, Scheduled

Principle: Autoscaling turns "static capacity" into "follow the load." Three paradigms: Reactive (react to live metrics, e.g. K8s HPA target-tracking adjusts replicas when a metric drifts from target), Predictive (forecast future load and scale ahead, e.g. Netflix Scryer), and Scheduled (known pattern, scale on a timer). Reactive's Achilles heel is lag: metric collection + decision + instance boot + warmup can take minutes — too slow for flash-sale-grade spikes.

Trade-off:

Reactive (HPA): ✅ simple, general, follows real load; ❌ lagging, prone to flapping, SLO may break during cold start.
Predictive (Scryer): ✅ scales ahead, kills the lag, smooth curve; ❌ a wrong forecast over/under-provisions; needs regular traffic.
Scheduled: ✅ zero dependencies, reliable for predictable peaks; ❌ useless for irregular events.

# HPA target-tracking core formula
desired = ceil(current_replicas * (current_metric / target_metric))
# e.g. 10 replicas, CPU at 90%, target 60% -> ceil(10 * 1.5) = 15
# Anti-flap: scale-in uses max over last 5 min (stabilization window) + cooldown
# Key asymmetry: scale OUT fast, scale IN slow -- shrink too fast and the
#                next burst breaks the SLO

Real-world:

Netflix: Scryer adds predictive scaling on top of Amazon Auto Scaling because reactive was too slow for its predictable morning/evening peaks; in production it's stacked as predictive baseline + reactive backstop.
Kubernetes: HPA is the standard target-tracking implementation; KEDA extends it to event metrics like queue depth (closer to real backlog).

3. Cost Attribution: no unit economics, no optimization

Principle: With 3000 services and 500 teams, a single $2M bill is un-optimizable — you must attribute cost to team/product/request. FinOps centers on showback/chargeback: use resource tags (team, service, env) to slice the bill, then allocate shared cost (common LBs, control plane, cross-service DBs, Spot discounts) by some key. The destination is unit cost: cost per request, cost per active user — the only metric you can compare over time and put into a decision.

Trade-off (allocating shared cost):

By usage (CPU·sec / request count): ✅ fair, aligns incentives; ❌ needs fine-grained metering; shared resources are hard to meter.
Even split (per team): ✅ simple; ❌ unfair, penalizes small teams, zero optimization incentive.
No allocation (an unallocated bucket): ✅ free; ❌ shared cost is often 20–30% — invisible means unmanaged.

# Unit economics: turn the bill into a decision-grade metric
team_cost = direct_cost[team] + shared_cost * (team_usage / total_usage)
unit_cost = team_cost / team_requests        # $ / 1k req -- comparable QoQ
# Trend > absolute: a rising cost/req means some service is getting less
#                    efficient -- alert on it

Real-world:

Spotify: the Cost Insights plugin (open-sourced in Backstage) puts cloud cost right into engineers' daily service catalog, attributed per pipeline/service, so engineers — not finance — drive optimization; Spotify reports saving millions/year by "shifting cost left."
FinOps Foundation: systematizes showback/chargeback and unit economics into a cross-company common language.

4. Resource Efficiency: rightsizing, bin-packing, three purchasing tiers

Principle: Lifting utilization from 15% to 45% is a 3x saving. Four levers: ① Rightsizing — requests/limits fit real usage (VPA, Autopilot auto-tune); ② Bin-packing — the scheduler packs pods tightly to cut fragmentation; ③ Consolidation — periodically repack and retire empty nodes; ④ Purchasing mix — baseline on Reserved/Savings Plans (~40–60% off but locked 1–3 yrs), elastic peaks on On-demand, fault-tolerant batch on Spot (~70–90% off but reclaimable anytime).

Purchase type	Savings	Cost / fit
Reserved / Savings Plan	~40–60%	Locked 1–3 yrs; mismatches when architecture changes; for the stable baseline
On-demand	0 (list price)	Priciest but never reclaimed; for elastic peak backstop
Spot	~70–90%	Reclaimed anytime (~2 min notice); only for interruptible/retryable work

# Node purchasing: commit-discount baseline, Spot for elastic, On-demand backstop
if workload.interruptible and spot_available:
    launch(SPOT)            # 70-90% off; on reclaim notice -> drain gracefully
elif load <= committed_baseline:
    use(RESERVED)           # baseline locked, 40-60% off
else:
    launch(ON_DEMAND)       # peak backstop, priciest but never reclaimed

Real-world:

AWS Karpenter: consolidation continuously repacks load onto cheaper instances (including Spot-to-Spot) and retires empty nodes; Spot is typically 60–90% cheaper than On-demand.
Google Autopilot: a triple closed-loop controller auto-tunes per-task CPU/memory, eliminating the "over-request to avoid OOM" waste.
Dropbox: Magic Pocket moved ~90% of data off S3 to its own data centers; per its 2018 S-1 it saved ~$74.6M over two years — at sufficient scale, the build-vs-cloud efficiency math flips.

Scaling & Evolution

Reactive → Predictive: with daily/weekly periodicity, add prediction to absorb cold-start lag ahead of time.
Raise Spot share: move stateless/retryable services to Spot; use multiple instance types + multiple AZs to diversify reclaim risk and avoid draining one capacity pool.
Cost anomaly detection: alert on cost/req so a unit-cost jump after a deploy is caught (often memory leaks, N+1 queries, a forgotten debug log).
Scale-to-zero: low-frequency/internal services scale to 0 via KEDA, woken by events.
Carbon-aware: schedule deferrable batch to cleaner/cheaper grid windows (even cross-region) — saving money and carbon.

Common Pitfalls + Interview Questions

Pitfall 1: pushing utilization toward 100%. Ignoring the queueing knee, p99 degrades non-linearly past ~80% and SLO penalties claw back the box savings.

Pitfall 2: too-aggressive scale-in. Shrink then a burst hits — repeated flapping, worse than not shrinking; scale-out fast, scale-in slow is the rule.

Pitfall 3: watching total cost, not unit cost. Business 2x with cost 2x is healthy; track the cost/req trend, not the absolute.

Pitfall 4: using Spot like On-demand. Put stateful/non-interruptible services on Spot and a mass reclaim drops data/capacity.

Pitfall 5: incomplete tags / over-buying reservations. 20–30% of cost in the unallocated bucket breaks attribution; a 3-yr RI can cost more than On-demand once architecture shifts.

Likely interview follow-ups:

Average CPU is 15% and the boss wants to cut machines in half — how do you decide how much you can cut? (headroom, peak factor, queueing knee, N+1)
When does reactive vs predictive autoscaling fit? How do you combine them? What about cold-start lag?
How do you attribute a $2M bill to 500 teams? How do you fairly allocate shared cost (LBs, control plane, Spot discounts)?
What workloads suit Spot? How do you handle reclaim gracefully and bound reclaim risk?
cost per request suddenly jumps 20% — how do you localize which service and why?

Deeper Resources

"Designing Data-Intensive Applications" (Kleppmann): Ch. 1 on load parameters and scalability fundamentals.
"Autopilot: Workload Autoscaling at Google" (EuroSys 2020): vertical autoscaling and the root cause of over-requesting.
Netflix TechBlog "Scryer: Netflix's Predictive Auto Scaling Engine": predictive scaling design.
Spotify / Backstage "Cost Insights": shifting cost left and team-level attribution.
AWS "Optimizing Kubernetes compute costs with Karpenter consolidation": bin-packing and Spot.
FinOps Foundation: a systematized framework for showback/chargeback and unit economics.

Going Deeper (click to expand)

1. Pushing utilization from 70% to 90% saves machines — why might p99 make you lose more money?

It's the 1/(1-ρ) denominator. At 70% the queue factor ∝ 1/0.3 ≈ 3.3; at 90% ∝ 1/0.1 = 10 — wait time roughly triples, and it steepens the closer you get to 1. The box savings are linear (~22% fewer nodes), but p99 degradation is non-linear and easily breaks the SLO. The real math: saved instance cost vs (SLO penalties + client retries amplified by timeouts + those retries raising ρ further into a positive feedback loop). Headroom's value is absorbing bursts and buffering the queue; cutting it as "waste" trades tail-latency stability for an accounting number. Bottom line: derive the utilization target from the latency curve under the SLO, not from "higher is better."

2. A predictive autoscaler under-predicts a spike — what happens, and how do you make it "safe when wrong"?

Under-provisioning means scale-out can't catch up and the SLO breaks during the lag — worse than pure reactive, because you "thought you didn't need to scale" and lowered the baseline. Fail-safe design: ① asymmetric penalty — over-predict only costs a bit, under-predict violates the SLO, so build in safety margin and bias toward slight over-estimation. ② Layered backstop — predictive sets only the capacity floor (raise the baseline ahead), reactive stays on top to catch spikes the forecast missed. ③ Confidence intervals — emit uncertainty; volatile windows auto-add headroom. ④ Fallback — on anomalous signals (missing data, model drift) degrade to pure reactive; never let one bad forecast pin capacity to the floor. The essence: treat prediction as the optimizer and "reactive + headroom" as the safety net.

3. Very high utilization and a very high Spot share look thrifty. When does "extreme efficiency" bite back? (echoes Day 23 Reliability)

Efficiency and resilience are opposed. Squeezing out headroom and maxing Spot systematically removes every buffer: ① zero fault-absorption — a single AZ failure hits already-full nodes; shifting that traffic triggers a cascade (no N+1 to catch it). ② Spot is a shared pool — AWS reclaims tend to be batched by instance type/AZ, you compete with everyone for the same pool, and a mass reclaim collapses capacity instantly. ③ Autoscale lag is deadlier at high utilization — no spare to cover the boot-up minutes. So "extreme efficiency" suits interruptible batch, never the SLO-bearing online path. The right posture is tiering: core online services keep headroom + On-demand/RI floor, while elastic and offline workloads ride Spot and high utilization. Save where you can tolerate volatility.

4. cost per request is a good metric — when does it lie to you?

It's a ratio, and both ends can be gamed. ① Denominator dilution — pump in cheap requests (health checks, polls, retries) and cost/req drops on paper while total cost holds or rises. ② Non-homogeneous requests — a search vs a static page differ 100x; the average hides "expensive requests," so look at the P50/P99 per-request cost distribution, not one mean. ③ Fixed cost dominates at low traffic — control plane and minimum replicas are fixed, so cost/req spikes in the trough without any inefficiency. ④ Cross-service shifting — A offloads compute to B; A's cost/req improves, the global total doesn't move. ⑤ Use marginal cost — "is this extra traffic worth taking" needs marginal, not amortized average, cost. In short: cost/req is great for trends and alerts, but decisions need to split by request type, separate fixed/marginal, and beware the denominator game.

5. A team dumps all its work onto Spot and nightly batch to cut its own bill. Locally optimal — good or bad for the whole platform?

A classic tragedy of the commons + second-order effect. For the team alone: Spot + off-peak does lower the bill and the showback looks great. At platform scale: ① Spot is a shared capacity pool — everyone is incentivized to grab it, the night window quickly becomes a "peak" too, reclaim rates rise, available capacity falls, and nobody saves as expected. ② Nightly batch piles up, filling the original trough, erasing the peak-shaving dividend and possibly squeezing backup/maintenance windows. ③ It's an incentive-design fault — if the allocation key gives Spot discounts only to the user without reflecting the externality their crowding imposes, it rewards over-concentration. The fix is to bake a global view into attribution and scheduling: differentiated price signals for off-peak (busier windows cost more), platform-level Spot diversification and time-spreading of workloads, so local and global optima align. This is exactly how the choice of allocation key shapes behavior.