Day 22 Medium Deployment Canary Feature Flags DB Migration

Deployment & Release — Ship 100s/day, auto-roll-back bad versions in 5 minBlue-Green, Canary, Feature Flags & Schema Migration

Problem & Constraints

Design a release platform for 500+ microservices, thousands of deploys per day. The goal isn't "be able to ship" — it's to minimize the blast radius of a bad version under high deploy frequency: when a buggy build reaches production, it must be detected and rolled back before users notice, not after an alert wakes the on-call.

The cautionary tale is Knight Capital, 2012: a deploy pushed new code to 7 of 8 servers; the 8th still ran old code and reused a retired feature flag, losing ~$440M in 45 minutes and killing the company (SEC 8-K). The release system isn't a tail of CI — it's the core of reliability engineering.

High-Level Architecture

The core is a pipeline with automated canary analysis (ACA): a new version first takes a sliver of traffic, its golden signals (error rate, latency, saturation) are statistically compared against a baseline, significant regression auto-rolls-back, and only a passing canary ramps up. A feature-flag platform sits orthogonally, letting "release" be a config toggle without redeploying.

graph LR
    DEV["commit / CI
build + unit test"] ART["image registry
immutable artifact"] CD["release orchestrator
Spinnaker / Argo"] LB["traffic routing
service mesh / LB"] BASE["Baseline v1
old version"] CAN["Canary v2
new version · 5%"] ACA{"canary analysis
statistical compare"} OBS[("Metrics / Traces")] DEV --> ART --> CD --> LB LB -->|95%| BASE LB -->|5%| CAN BASE --> OBS CAN --> OBS OBS --> ACA ACA -->|pass| CD ACA -.->|regress→rollback| LB classDef ci fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef route fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef ver fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef judge fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class DEV,ART,CD ci class LB,OBS route class BASE,CAN ver class ACA judge

The canary takes a sliver of traffic; analysis decides pass/fail automatically — promote on pass, roll back on regression. No human in the loop.

Key Techniques

1. Deployment Strategy: Rolling vs Blue-Green vs Canary

Trade-off in one line: spend more resources + go slower in exchange for a smaller blast radius + faster rollback.

Principle: all three are about "how do old and new versions coexist and switch." Rolling replaces instances batch by batch — only 1× resources, but rollback means rolling again, slowly. Blue-Green runs two full environments; once green is ready, traffic cuts over at once, and rollback is just routing back to blue — second-level rollback, at the cost of 2× resources. Canary sends only 1%–5% of traffic to the new version, observes signals, then ramps or rolls back — smallest blast radius, but most complex and slowest to fully roll out.

RollingBlue-GreenCanary
Resource cost1× (+1 batch)1× + a bit
Rollback speedslow (roll again)seconds (swap route)fast (pull canary)
Blast radiusmedium (grows per batch)large (all at once)tiny (1–5%)
Old/new coexistyes (needs compat)briefyes (needs compat)
Best fordefault, statelessneed fast rollback, can afford 2×high-risk change, strong observability
Key trap: Both Rolling and Canary keep old and new versions live at the same time, so the two versions must be mutually compatible (forward and backward) — this shifts difficulty from "deployment" to "compatibility design" (see techniques 3 & 4). Blue-Green seems to avoid coexistence, but when both share one database, the schema must still be compatible with both blue and green.
Real-world:

2. Automated Canary Analysis: Let Statistics Make the Release Decision

Trade-off in one line: spend statistical rigor + engineering complexity so no human has to eyeball dashboards.

Principle: a canary's value is in comparison, not in viewing the new version's metrics in isolation. The right approach runs a baseline simultaneously — the same old code, taking the same small traffic — to cancel environmental noise like "happened to hit a traffic spike." Then you run a statistical hypothesis test on error rate, latency percentiles, CPU, etc., to judge whether the difference is significant. Netflix Kayenta uses the Mann-Whitney U test (non-parametric, no normality assumption) to score each metric, aggregating into a "pass / marginal / fail" verdict.

# Core canary judgement logic (pseudo-code)
def judge_canary(canary, baseline, metrics):
    scores = []
    for m in metrics:                       # error rate, p99, CPU...
        # non-parametric test: is canary significantly worse than baseline?
        p = mann_whitney_u(canary[m], baseline[m])
        if significant(p) and worse(canary[m], baseline[m]):
            scores.append(FAIL)
        else:
            scores.append(PASS)
    fail_ratio = scores.count(FAIL) / len(scores)
    if fail_ratio > 0.5:  return "ROLLBACK"   # auto rollback
    if fail_ratio > 0:    return "MANUAL"      # human in the loop
    return "PROMOTE"                           # auto ramp
Trade-off:
Real-world:

3. Deploy ≠ Release: Decouple with Feature Flags

Trade-off in one line: take on flag config debt + branching complexity to gain gradual rollout, instant kill switch, and release without redeploy.

Principle: deploying code to production isn't exposing the feature to users. Wrap new functionality in a feature flag, off by default at deploy (dark launch), then turn it on dynamically by user/percentage/region via a config service. This buys three things: ① gradual rollout (1% → 50% → 100%, kill instantly on trouble, no redeploy); ② decoupling release timing from deploy timing (merge to main anytime, flip on at marketing's schedule); ③ A/B experiments. Knight Capital's disaster stemmed exactly from flag governance failure — it reused an old flag that should have been deleted.

# Feature flag call (pseudo-code)
if flags.enabled("new_checkout_v2", user=ctx.user,
                 rollout_pct=5, allow=["beta_team"]):
    return checkout_v2(ctx)     # new path: only 5% + beta group
else:
    return checkout_v1(ctx)     # old path: fallback
# kill switch: set rollout_pct to 0 — full shutoff with no deploy
Trade-off:
Real-world:

4. Backward Compatibility & DB Migration: The One Thing That Can't Roll Back

Trade-off in one line: spend multi-step effort + temporary dual-writes for zero-downtime, rollback-at-any-moment schema change.

Principle: code rolls back, data does not. Once you drop a column or change a type, old code reading it crashes. While old and new coexist, the schema must run for both. The fix is Expand-Contract (a.k.a. Parallel Change): split the breaking change into three phases — Expand adds only (new column/table, leave old alone); Migrate dual-writes + backfills history, switch reads to the new structure; Contract drops the old column only after all old versions are gone. Every step is backward-compatible; rollback just stops at the current step. The API layer is the same: Stripe's date-based versions "pin" an account to the version of its first call, internally write only the latest logic, then a response compatibility layer transforms results back to the old format — never truly breaking the API since 2011.

graph LR
    E["① Expand
add new_col
don't drop old"] M["② Migrate
dual-write old+new
backfill + read new"] C["③ Contract
drop old_col
old versions gone"] E --> M --> C E -.rollback ok.-> X1["old code still works"] M -.rollback ok.-> X2["old column still there"] classDef step fill:#1a2530,stroke:#5eead4,color:#e8eef5 classDef safe fill:#0e2030,stroke:#64c8ff,color:#7a8590 class E,M,C step class X1,X2 safe
Trade-off:
Real-world:

Scaling & Optimization

Common Pitfalls + Interview Probes

1. Treating "can deploy" as "can release": without decoupling deploy from release, gradual rollout means repeated redeploys. Interviewers expect you to raise feature flags proactively.
2. Forgetting old/new must be bidirectionally compatible: during rolling/canary, old and new coexist — data written by the new version must be readable by the old, and vice versa. The classic trip-up is the ordering mistake of "drop the API field before the client ships."
3. Treating DB migration like code deployment: assuming a deploy rollback fixes everything, while the schema is already changed and old code crashes reading it. Always expand-contract, and ship Expand/Contract as two releases.
4. Canary with no baseline: judging on the new version's absolute metrics alone, misled by traffic fluctuation. The right way is canary vs same-traffic baseline statistical comparison.
5. Not cleaning up flags: Knight Capital reused a retired flag, triggering dead code, $440M in 45 minutes. Temporary flags must have expiry and cleanup.

Deep Resources

Deep Thinking (click to expand)

1. Blue-green claims "second-level rollback," but if both environments share one database, when does that promise break?

Blue-green's instant rollback only holds for stateless versions with no schema change. Once the green version runs a breaking schema change (drop column, change type) at launch, switching back to blue means blue's code reads the already-changed database and crashes anyway — the database is shared by blue and green; it didn't roll back.

More insidious is data contamination: in the few minutes green ran, it wrote data only legal in the new format (new enum values, new JSON shape). Back on blue, blue reads this "future data" and fails to deserialize.

The right move: walk the schema through expand-contract to a "compatible with both blue and green" intermediate state first, then do the blue-green switch. In other words, data migration must lead code release in tempo; blue-green only solves rollback for the stateless part. This is also why "roll forward is often safer than rollback when schema changes are involved."

2. A low-QPS internal service (tens of requests/minute) wants automated canary analysis. What fundamental obstacle does it hit, and what do you do?

Obstacle: insufficient sample size makes statistical testing meaningless. The canary only takes 5% traffic; QPS is already low, so 5% might be a handful of requests per minute. Tests like Mann-Whitney U on small samples either never reach significance (letting real bugs through) or let a single outlier swing the verdict (false alarms). Statistics needs enough n.

What to do: ① raise the canary fraction (give low-traffic services 50% — the absolute volume is small so blast radius stays controlled); ② lengthen the window (look at hours of accumulation, not per-minute); ③ downgrade to blue-green + threshold alerts, drop statistical comparison, rely on a simple error-rate threshold + human confirmation; ④ synthetic traffic, use replay/load tests to feed the canary controlled load for samples. The essence: ACA is designed for high traffic; low-traffic services should switch to plainer strategies rather than do statistics for statistics' sake.

3. Expand-Contract requires "ship Expand and Contract as two releases." If a team cuts corners and ships it as one, what's the worst that happens?

Shipping as one means "add new column + dual-write + drop old column" all in the same version. The problem is the coexistence window of rolling deployment: while the new version is replacing the old batch by batch, the cluster simultaneously has new instances that "believe the old column is gone" and old instances "still reading/writing the old column."

The new instances already DROPped the old column → the old instances' queries immediately error with column not found. This is the structural cause of Knight Capital: different instances disagreeing on the data contract at the same moment.

Even without rolling — cutting over with blue-green in one shot — rollback can't go back: the old column is physically deleted. The essence of two releases is to insert a sync point between them that "confirms all old instances are gone," reducing the distributed problem of "version coexistence" into two serial, individually-compatible steps. The step you save is safety itself.

4. Feature flags give an instant kill switch — can they replace canary and rollback? What's the relationship among the three?

They can't replace each other; the three act at different layers and compose:

  • Canary answers "should this binary go to production" — a risk assessment of the whole deployment unit, granularity "version."
  • Feature flag answers "code already in production — should users see this feature" — granularity "feature / cohort," and toggleable without redeploy.
  • Rollback is the backstop — when the first two failed to stop it and the bad version is already causing impact, swap the binary back.

The typical pipeline stacks all three: new feature hidden behind a flag (off) ships with the version → canary analysis confirms the version itself is healthy → full deploy → then independently ramp the feature with the flag, killing it instantly on trouble. Problems flags can't turn off (framework upgrades, dependency changes, memory leaks) are what canary interception and rollback are for. Treating the flag as an omni-switch and skipping canary means giving up protection against "version-level regression."

5. Estimate: a version has 0.1% chance of introducing a crippling bug. With 1000 deploys/day shipped full-blast, vs "canary catches 95%," how many production incidents per year differ?

Full-blast: each deploy has 0.1% crippling chance, 1000/day → expected 1000 × 0.001 = 1 crippling incident/day → ~365/year. The system is basically unusable.

Canary catches 95%: the canary takes a sliver of traffic, so even a bad version that slips through only affects 5% of users for a short window, and 95% of bad versions are auto-rolled-back before ramp → true full-blast incidents drop to 365 × 5% ≈ 18/year, and each incident's user impact is further shrunk by the canary traffic fraction (impact = probability × blast radius, both factors suppressed).

Second-order insight: reliability comes not from "making bugs not happen" (the 0.1% base rate is hard to lower further) but from lowering the cost of each failure. High deploy frequency is actually safer — the more frequent the deploys, the smaller each change, the easier the canary identifies "which change broke it" from the signals. That's the math behind the counterintuitive "deploy more often to be safer": high frequency + small batches + automatic interception yields far lower overall risk than "batch up a big load and cautiously ship it all at once."