Day 22 Medium Deployment Canary Feature Flags DB Migration

Deployment & Release — Ship 100s/day, auto-roll-back bad versions in 5 minBlue-Green, Canary, Feature Flags & Schema Migration

Problem & Constraints

Design a release platform for 500+ microservices, thousands of deploys per day. The goal isn't "be able to ship" — it's to minimize the blast radius of a bad version under high deploy frequency: when a buggy build reaches production, it must be detected and rolled back before users notice, not after an alert wakes the on-call.

The cautionary tale is Knight Capital, 2012: a deploy pushed new code to 7 of 8 servers; the 8th still ran old code and reused a retired feature flag, losing ~$440M in 45 minutes and killing the company (SEC 8-K). The release system isn't a tail of CI — it's the core of reliability engineering.

Deploy frequency: thousands/day (Amazon is per-second scale). Human review of every build is infeasible → automated gates are a precondition.
Zero downtime: old and new coexist during rollout; requests must be seamless, connections unbroken.
Rollback SLO: from "anomaly detected" to "traffic shifted back" < 5 minutes (the main component of MTTR).
State constraint: stateless services are easy; stateful services with schema changes are the hard part — databases don't roll back.
Decoupling: deploy (put code out there) must be separate from release (let users see the feature), or you can't do gradual rollout.

High-Level Architecture

The core is a pipeline with automated canary analysis (ACA): a new version first takes a sliver of traffic, its golden signals (error rate, latency, saturation) are statistically compared against a baseline, significant regression auto-rolls-back, and only a passing canary ramps up. A feature-flag platform sits orthogonally, letting "release" be a config toggle without redeploying.

graph LR
    DEV["commit / CI
build + unit test"]
    ART["image registry
immutable artifact"]
    CD["release orchestrator
Spinnaker / Argo"]
    LB["traffic routing
service mesh / LB"]
    BASE["Baseline v1
old version"]
    CAN["Canary v2
new version · 5%"]
    ACA{"canary analysis
statistical compare"}
    OBS[("Metrics / Traces")]

    DEV --> ART --> CD --> LB
    LB -->|95%| BASE
    LB -->|5%| CAN
    BASE --> OBS
    CAN --> OBS
    OBS --> ACA
    ACA -->|pass| CD
    ACA -.->|regress→rollback| LB

    classDef ci fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef route fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef ver fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef judge fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class DEV,ART,CD ci
    class LB,OBS route
    class BASE,CAN ver
    class ACA judge

The canary takes a sliver of traffic; analysis decides pass/fail automatically — promote on pass, roll back on regression. No human in the loop.

Key Techniques

1. Deployment Strategy: Rolling vs Blue-Green vs Canary

Trade-off in one line: spend more resources + go slower in exchange for a smaller blast radius + faster rollback.

Principle: all three are about "how do old and new versions coexist and switch." Rolling replaces instances batch by batch — only 1× resources, but rollback means rolling again, slowly. Blue-Green runs two full environments; once green is ready, traffic cuts over at once, and rollback is just routing back to blue — second-level rollback, at the cost of 2× resources. Canary sends only 1%–5% of traffic to the new version, observes signals, then ramps or rolls back — smallest blast radius, but most complex and slowest to fully roll out.

	Rolling	Blue-Green	Canary
Resource cost	1× (+1 batch)	2×	1× + a bit
Rollback speed	slow (roll again)	seconds (swap route)	fast (pull canary)
Blast radius	medium (grows per batch)	large (all at once)	tiny (1–5%)
Old/new coexist	yes (needs compat)	brief	yes (needs compat)
Best for	default, stateless	need fast rollback, can afford 2×	high-risk change, strong observability

Key trap: Both Rolling and Canary keep old and new versions live at the same time, so the two versions must be mutually compatible (forward and backward) — this shifts difficulty from "deployment" to "compatibility design" (see techniques 3 & 4). Blue-Green seems to avoid coexistence, but when both share one database, the schema must still be compatible with both blue and green.

Real-world:

Netflix: orchestrates with Spinnaker — red/black (a blue-green variant) + canary, across AWS multi-region.
Kubernetes: native Deployment is a rolling update (maxSurge/maxUnavailable control batching); blue-green/canary via Argo Rollouts, Flagger.
Google: canary + progressive ramp is standard SRE practice; the Site Reliability Engineering book has a chapter on release engineering.

2. Automated Canary Analysis: Let Statistics Make the Release Decision

Trade-off in one line: spend statistical rigor + engineering complexity so no human has to eyeball dashboards.

Principle: a canary's value is in comparison, not in viewing the new version's metrics in isolation. The right approach runs a baseline simultaneously — the same old code, taking the same small traffic — to cancel environmental noise like "happened to hit a traffic spike." Then you run a statistical hypothesis test on error rate, latency percentiles, CPU, etc., to judge whether the difference is significant. Netflix Kayenta uses the Mann-Whitney U test (non-parametric, no normality assumption) to score each metric, aggregating into a "pass / marginal / fail" verdict.

# Core canary judgement logic (pseudo-code)
def judge_canary(canary, baseline, metrics):
    scores = []
    for m in metrics:                       # error rate, p99, CPU...
        # non-parametric test: is canary significantly worse than baseline?
        p = mann_whitney_u(canary[m], baseline[m])
        if significant(p) and worse(canary[m], baseline[m]):
            scores.append(FAIL)
        else:
            scores.append(PASS)
    fail_ratio = scores.count(FAIL) / len(scores)
    if fail_ratio > 0.5:  return "ROLLBACK"   # auto rollback
    if fail_ratio > 0:    return "MANUAL"      # human in the loop
    return "PROMOTE"                           # auto ramp

Trade-off:

No baseline, absolute thresholds only: ✅ simple; ❌ thresholds need hand-tuning, false alarms (a traffic spike pushes latency up and triggers rollback).
Canary vs baseline comparison: ✅ cancels environmental noise, self-adapting thresholds; ❌ extra baseline resources, needs enough samples (low-traffic services aren't statistically significant).
Canary at only 5% traffic: the observation window must be long enough to accumulate samples — in tension with "ramp fast."

Real-world:

Netflix + Google: open-sourced Kayenta, using the Mann-Whitney U test; at launch it already ran ~30% of Netflix's canary judgements, ~200/day.
Argo Rollouts / Flagger: on K8s, wire to Prometheus for metric analysis, auto-promote or roll back by threshold/comparison.

3. Deploy ≠ Release: Decouple with Feature Flags

Trade-off in one line: take on flag config debt + branching complexity to gain gradual rollout, instant kill switch, and release without redeploy.

Principle: deploying code to production isn't exposing the feature to users. Wrap new functionality in a feature flag, off by default at deploy (dark launch), then turn it on dynamically by user/percentage/region via a config service. This buys three things: ① gradual rollout (1% → 50% → 100%, kill instantly on trouble, no redeploy); ② decoupling release timing from deploy timing (merge to main anytime, flip on at marketing's schedule); ③ A/B experiments. Knight Capital's disaster stemmed exactly from flag governance failure — it reused an old flag that should have been deleted.

# Feature flag call (pseudo-code)
if flags.enabled("new_checkout_v2", user=ctx.user,
                 rollout_pct=5, allow=["beta_team"]):
    return checkout_v2(ctx)     # new path: only 5% + beta group
else:
    return checkout_v1(ctx)     # old path: fallback
# kill switch: set rollout_pct to 0 — full shutoff with no deploy

Trade-off:

No flags (control via deploy/rollback): ✅ clean code; ❌ every rollout needs a redeploy, slow to kill, can't segment finely.
With flags: ✅ second-level kill switch, fine-grained rollout, experiments; ❌ flag debt — uncleaned old flags are time bombs, if/else branch combinations explode, test coverage suffers.
Discipline: temporary flags must have a "sunset date"; after 100% ramp, go back and delete the code (many teams trip here).

Real-world:

Flickr: the 2009 classic "Flipping Out" established the feature-flag + dark-launch paradigm.
Facebook: Gatekeeper controls feature exposure by region/cohort, fully decoupling deploy from release.
LaunchDarkly: turned feature flags into a standalone SaaS category, the face of progressive delivery.

4. Backward Compatibility & DB Migration: The One Thing That Can't Roll Back

Trade-off in one line: spend multi-step effort + temporary dual-writes for zero-downtime, rollback-at-any-moment schema change.

Principle: code rolls back, data does not. Once you drop a column or change a type, old code reading it crashes. While old and new coexist, the schema must run for both. The fix is Expand-Contract (a.k.a. Parallel Change): split the breaking change into three phases — Expand adds only (new column/table, leave old alone); Migrate dual-writes + backfills history, switch reads to the new structure; Contract drops the old column only after all old versions are gone. Every step is backward-compatible; rollback just stops at the current step. The API layer is the same: Stripe's date-based versions "pin" an account to the version of its first call, internally write only the latest logic, then a response compatibility layer transforms results back to the old format — never truly breaking the API since 2011.

graph LR
    E["① Expand
add new_col
don't drop old"]
    M["② Migrate
dual-write old+new
backfill + read new"]
    C["③ Contract
drop old_col
old versions gone"]
    E --> M --> C
    E -.rollback ok.-> X1["old code still works"]
    M -.rollback ok.-> X2["old column still there"]
    classDef step fill:#1a2530,stroke:#5eead4,color:#e8eef5
    classDef safe fill:#0e2030,stroke:#64c8ff,color:#7a8590
    class E,M,C step
    class X1,X2 safe

Trade-off:

Direct ALTER (one shot): ✅ simple; ❌ old code crashes during coexistence, big-table ALTER may lock, no rollback.
Three-phase Expand-Contract: ✅ compatible throughout, stoppable anytime; ❌ more steps, maintain dual-writes during Migrate, forgetting Contract leaves tech debt.
Discipline: Expand and Contract must be two separate releases, confirming all old instances are gone in between — otherwise you get Knight-Capital-style version mismatch.

Real-world:

Stripe: date-based rolling versions + response compatibility layer, 13 years with zero breaking changes — the industry benchmark for API versioning.
Martin Fowler: formally named the pattern Parallel Change (expand / migrate / contract).
GitHub / Shopify: big-table migrations use online schema-change tools (gh-ost / lhm) with expand-contract to avoid lock-induced downtime.

Scaling & Optimization

Automated progressive rollout: after the canary passes, ramp by 5%→25%→50%→100%, running analysis at each step, fully automatic (Argo Rollouts steps).
Staggered multi-region: ship one small region first, bake for 24h (catch bugs that only surface in certain timezones/traffic patterns), then go global.
Rollback vs roll-forward: with schema changes, "roll forward to fix" is often safer than rollback — data is already written in the new shape. This demands an extremely fast hotfix pipeline.
Flag governance: build a flag lifecycle board, auto-remind to clean up stale flags, treat "flag debt" as managed tech debt.
Deploy as code: version the release process (pipeline as code), audit every change, make it replayable.

Common Pitfalls + Interview Probes

1. Treating "can deploy" as "can release": without decoupling deploy from release, gradual rollout means repeated redeploys. Interviewers expect you to raise feature flags proactively.

2. Forgetting old/new must be bidirectionally compatible: during rolling/canary, old and new coexist — data written by the new version must be readable by the old, and vice versa. The classic trip-up is the ordering mistake of "drop the API field before the client ships."

3. Treating DB migration like code deployment: assuming a deploy rollback fixes everything, while the schema is already changed and old code crashes reading it. Always expand-contract, and ship Expand/Contract as two releases.

4. Canary with no baseline: judging on the new version's absolute metrics alone, misled by traffic fluctuation. The right way is canary vs same-traffic baseline statistical comparison.

5. Not cleaning up flags: Knight Capital reused a retired flag, triggering dead code, $440M in 45 minutes. Temporary flags must have expiry and cleanup.

Deep Resources

Designing Data-Intensive Applications, Ch 4 (Kleppmann): evolutionary schema, forward/backward compatibility from first principles.
Netflix TechBlog: Automated Canary Analysis with Kayenta — ACA's statistical method and engineering.
Stripe Blog: APIs as infrastructure: future-proofing Stripe with versioning.
Martin Fowler bliki: ParallelChange, BlueGreenDeployment, CanaryRelease, FeatureToggle.
Flickr code blog: Flipping Out — the origin of the feature-flag + dark-launch paradigm.

Deep Thinking (click to expand)

1. Blue-green claims "second-level rollback," but if both environments share one database, when does that promise break?

Blue-green's instant rollback only holds for stateless versions with no schema change. Once the green version runs a breaking schema change (drop column, change type) at launch, switching back to blue means blue's code reads the already-changed database and crashes anyway — the database is shared by blue and green; it didn't roll back.

More insidious is data contamination: in the few minutes green ran, it wrote data only legal in the new format (new enum values, new JSON shape). Back on blue, blue reads this "future data" and fails to deserialize.

The right move: walk the schema through expand-contract to a "compatible with both blue and green" intermediate state first, then do the blue-green switch. In other words, data migration must lead code release in tempo; blue-green only solves rollback for the stateless part. This is also why "roll forward is often safer than rollback when schema changes are involved."

2. A low-QPS internal service (tens of requests/minute) wants automated canary analysis. What fundamental obstacle does it hit, and what do you do?

Obstacle: insufficient sample size makes statistical testing meaningless. The canary only takes 5% traffic; QPS is already low, so 5% might be a handful of requests per minute. Tests like Mann-Whitney U on small samples either never reach significance (letting real bugs through) or let a single outlier swing the verdict (false alarms). Statistics needs enough n.

What to do: ① raise the canary fraction (give low-traffic services 50% — the absolute volume is small so blast radius stays controlled); ② lengthen the window (look at hours of accumulation, not per-minute); ③ downgrade to blue-green + threshold alerts, drop statistical comparison, rely on a simple error-rate threshold + human confirmation; ④ synthetic traffic, use replay/load tests to feed the canary controlled load for samples. The essence: ACA is designed for high traffic; low-traffic services should switch to plainer strategies rather than do statistics for statistics' sake.

3. Expand-Contract requires "ship Expand and Contract as two releases." If a team cuts corners and ships it as one, what's the worst that happens?

Shipping as one means "add new column + dual-write + drop old column" all in the same version. The problem is the coexistence window of rolling deployment: while the new version is replacing the old batch by batch, the cluster simultaneously has new instances that "believe the old column is gone" and old instances "still reading/writing the old column."

The new instances already DROPped the old column → the old instances' queries immediately error with column not found. This is the structural cause of Knight Capital: different instances disagreeing on the data contract at the same moment.

Even without rolling — cutting over with blue-green in one shot — rollback can't go back: the old column is physically deleted. The essence of two releases is to insert a sync point between them that "confirms all old instances are gone," reducing the distributed problem of "version coexistence" into two serial, individually-compatible steps. The step you save is safety itself.

4. Feature flags give an instant kill switch — can they replace canary and rollback? What's the relationship among the three?

They can't replace each other; the three act at different layers and compose:

Canary answers "should this binary go to production" — a risk assessment of the whole deployment unit, granularity "version."
Feature flag answers "code already in production — should users see this feature" — granularity "feature / cohort," and toggleable without redeploy.
Rollback is the backstop — when the first two failed to stop it and the bad version is already causing impact, swap the binary back.

The typical pipeline stacks all three: new feature hidden behind a flag (off) ships with the version → canary analysis confirms the version itself is healthy → full deploy → then independently ramp the feature with the flag, killing it instantly on trouble. Problems flags can't turn off (framework upgrades, dependency changes, memory leaks) are what canary interception and rollback are for. Treating the flag as an omni-switch and skipping canary means giving up protection against "version-level regression."

5. Estimate: a version has 0.1% chance of introducing a crippling bug. With 1000 deploys/day shipped full-blast, vs "canary catches 95%," how many production incidents per year differ?

Full-blast: each deploy has 0.1% crippling chance, 1000/day → expected 1000 × 0.001 = 1 crippling incident/day → ~365/year. The system is basically unusable.

Canary catches 95%: the canary takes a sliver of traffic, so even a bad version that slips through only affects 5% of users for a short window, and 95% of bad versions are auto-rolled-back before ramp → true full-blast incidents drop to 365 × 5% ≈ 18/year, and each incident's user impact is further shrunk by the canary traffic fraction (impact = probability × blast radius, both factors suppressed).

Second-order insight: reliability comes not from "making bugs not happen" (the 0.1% base rate is hard to lower further) but from lowering the cost of each failure. High deploy frequency is actually safer — the more frequent the deploys, the smaller each change, the easier the canary identifies "which change broke it" from the signals. That's the math behind the counterintuitive "deploy more often to be safer": high frequency + small batches + automatic interception yields far lower overall risk than "batch up a big load and cautiously ship it all at once."

← Back to index