Ship changes without downtime or big-bang risk. Recreate, rolling, blue-green, canary, and progressive delivery with automated analysis and rollback, plus feature flags to decouple deploy from release.
Someone clicked deploy. The pipeline did exactly what it was told: it tore down every old container and started the new ones, all of them, at once. The new build had a bad config value that only showed up under real traffic. Within ninety seconds, 100% of users were hitting 500s. There was no old version still running to fall back to. The fix was to roll forward, rebuild, and redeploy, twenty-five minutes of a hard outage, on a Friday evening, because the change went to everyone simultaneously with nothing left to catch the fall.
The bug was trivial. The blast radius was not. That gap, between how small the mistake was and how total the outage was, is the entire subject of this article. The deploy strategy, not the bug, decided the size of the crater.
Who this is for
Engineers and SREs who ship to production and want releases to stop being scary. You should be comfortable with containers and a basic CI/CD pipeline. By the end you'll know when to reach for recreate vs rolling vs blue-green vs canary, how to gate a canary on real metrics with automatic rollback, and how feature flags let you deploy code without releasing it.
The one principle that fixes Friday
Separate deploy from release. Roll out to a few before you roll out to all.
Deploy means new code is running on a server. Release means real users are being served by it. Big-bang deploys fuse those two events into one irreversible moment. Every strategy below is a way to pry them apart, to get new code running while you still control who sees it, and to keep the old version alive long enough to retreat to it.
Dipping one toe in before you jumpRouting 5% of traffic to the new version (canary)
It's freezing, pull your toe back outMetrics breach a threshold, automatic rollback
Feels fine, wade in fullyPromote the canary to 100% of traffic
Two identical pools, swim in one at a timeBlue-green: full standby environment, flip the switch
You don't dive into water you haven't tested.
The picture: how a canary actually flows
Traffic shifts gradually from v1 to v2. An analysis gate decides: promote to 100%, or roll back to v1.
1
Deploy v2 alongside v1
The canary version starts up next to the stable one. v1 still serves everyone, v2 gets zero traffic until it's healthy.
2
Shift a small slice
The router sends ~5% of traffic to v2. The other 95% stays on v1, so most users are untouched if v2 is broken.
3
Analyze against the baseline
Compare v2's error rate, latency, and saturation to v1 over a fixed window. v1 is the live control group, that's the point of running both.
4
Gate the decision
If metrics stay within threshold, increase the weight (5 → 25 → 50 → 100). If anything breaches, halt.
5
Promote or roll back
Healthy: v2 takes 100% and v1 is retired. Unhealthy: weight snaps back to 0% on v2, rollback is just a routing change, no rebuild.
Four strategies, side by side
These are the four you'll actually pick between. Each trades cost and complexity for a smaller blast radius and a faster escape hatch. Read the table as a ladder: the further down you go, the safer the release and the more infrastructure it asks for.
Strategy
Downtime
Blast radius / risk
Cost
Rollback speed
Recreate
Yes, full gap
Total: everyone, instantly
Lowest (1 env)
Slow, redeploy old build
Rolling
None
Medium: grows pod by pod
Low (1 env)
Medium, roll back pod by pod
Blue-green
None
Total on flip, but instant revert
High (2 full envs)
Instant, flip router back
Canary
None
Tiny: a small % first
Medium (partial v2)
Fast, drop canary weight to 0
Downtime, risk, cost, and rollback speed across the four core strategies.
Recreate is the Friday story, stop all, start all. Use it only where a moment of downtime is acceptable and two versions truly cannot coexist (e.g. a breaking schema lock). Rolling replaces instances a few at a time; it's the Kubernetes default and a fine baseline, but it has no metrics gate, a bad version still reaches everyone, just more slowly. Blue-green keeps a full standby (green) and flips traffic in one router change; reverting is just flipping back, but you pay for two full environments. Canary is the most surgical: a small slice sees v2 first, judged against live v1, before anyone else is exposed.
A real canary config (Argo Rollouts)
Here's a canary expressed declaratively with Argo Rollouts. It's a drop-in replacement for a Kubernetes Deployment that adds weighted steps, pauses, and, critically, an automated analysis gate that aborts the rollout if metrics go bad.
Without the `analysis` step, this is just a slow rolling deploy. The `successCondition` is what makes the rollout self-driving: it watches Prometheus, and if the 5xx ratio crosses 2% three times, Argo automatically aborts and shifts traffic back to v1, no human, no pager, no Friday outage.
Feature flags: deploy without releasing
Canary controls which servers see new code by routing traffic. Feature flags control which code path runs, per request, regardless of which server you hit. They're the cleanest way to fully separate deploy from release: ship the code to 100% of servers turned off, then turn it on for 1% of users, then 50%, then everyone, all without another deploy.
checkout.ts
typescript
// Code is deployed everywhere, but dark until the flag flips.if (flags.isEnabled("new-pricing-engine", { userId })) {
returncomputePriceV2(cart); // new path
}
returncomputePriceV1(cart); // old path, still the default
Kill switch. A bad feature is turned off in seconds from a dashboard, no rebuild, no redeploy, no rollback pipeline.
Targeted release. Enable for internal users, then a 1% cohort, then a region, then everyone, decoupled from the deploy cadence.
Trunk-based development. Merge half-finished work to main behind an off flag, avoiding long-lived branches and merge hell.
The cost: flag debt. Every flag is a branch in your code. Delete flags once a feature is fully rolled out, or the codebase rots into a maze of dead conditionals.
Common mistakes that cost hours
No automated rollback. A canary with manual promotion only works while a human is watching. At 3am nobody is. If the gate can't roll back on its own, you've just added latency to the same outage.
No metrics gate. Shifting weight on a timer alone, 5%, wait, 50%, wait, without checking error rate or latency is theater. You'll dutifully promote a broken version on schedule. The gate, not the pause, is what makes it safe.
Canary without enough traffic. A 5% slice of a low-traffic service might be three requests an hour. Your statistics are noise and the gate can't tell signal from chance. Low-traffic services need a longer bake window, a bigger initial weight, or synthetic traffic.
Forgetting database migrations. Blue-green and canary assume v1 and v2 run at once, so the schema must be compatible with both. Use expand-and-contract migrations (add columns, never rename/drop in the same release).
Leaving flags on forever. Stale feature flags become permanent untested branches. Treat flag cleanup as part of finishing the feature.
Takeaways
The whole article in seven lines
The deploy strategy, not the bug, decides the blast radius.
Separate **deploy** (code is running) from **release** (users are served).
A canary is only as good as its **analysis gate**, automate the rollback.
Compare the canary to live v1 as the control group; that's why you run both.
Feature flags decouple release from deploy entirely and give you a sub-second kill switch.
Keep migrations backward-compatible and delete stale flags, both versions run at once.
Where to go next
Deployment strategy is one pillar of safe delivery. Pair it with declarative delivery and with resilience patterns so a bad release degrades gracefully instead of cascading.
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.