Deployment Strategies: Blue-Green, Canary & Progressive Delivery

On this page

It was Friday at 5pm
The one principle that fixes Friday
The picture: how a canary actually flows
Four strategies, side by side
A real canary config (Argo Rollouts)
Feature flags: deploy without releasing
Common mistakes that cost hours
Takeaways
Where to go next

It was Friday at 5pm

Someone clicked deploy. The pipeline did exactly what it was told: it tore down every old container and started the new ones, all of them, at once. The new build had a bad config value that only showed up under real traffic. Within ninety seconds, 100% of users were hitting 500s. There was no old version still running to fall back to. The fix was to roll forward, rebuild, and redeploy, twenty-five minutes of a hard outage, on a Friday evening, because the change went to everyone simultaneously with nothing left to catch the fall.

The bug was trivial. The blast radius was not. That gap, between how small the mistake was and how total the outage was, is the entire subject of this article. The deploy strategy, not the bug, decided the size of the crater.

Who this is for

Engineers and SREs who ship to production and want releases to stop being scary. You should be comfortable with containers and a basic CI/CD pipeline. By the end you'll know when to reach for recreate vs rolling vs blue-green vs canary, how to gate a canary on real metrics with automatic rollback, and how feature flags let you deploy code without releasing it.

The one principle that fixes Friday

Separate deploy from release. Roll out to a few before you roll out to all.
The whole article in one line

Deploy means new code is running on a server. Release means real users are being served by it. Big-bang deploys fuse those two events into one irreversible moment. Every strategy below is a way to pry them apart, to get new code running while you still control who sees it, and to keep the old version alive long enough to retreat to it.

Dipping one toe in before you jumpRouting 5% of traffic to the new version (canary)

It's freezing, pull your toe back outMetrics breach a threshold, automatic rollback

Feels fine, wade in fullyPromote the canary to 100% of traffic

Two identical pools, swim in one at a timeBlue-green: full standby environment, flip the switch

You don't dive into water you haven't tested.

The picture: how a canary actually flows

Traffic shifts gradually from v1 to v2. An analysis gate decides: promote to 100%, or roll back to v1.

1
Deploy v2 alongside v1
The canary version starts up next to the stable one. v1 still serves everyone, v2 gets zero traffic until it's healthy.
2
Shift a small slice
The router sends ~5% of traffic to v2. The other 95% stays on v1, so most users are untouched if v2 is broken.
3
Analyze against the baseline
Compare v2's error rate, latency, and saturation to v1 over a fixed window. v1 is the live control group, that's the point of running both.
4
Gate the decision
If metrics stay within threshold, increase the weight (5 → 25 → 50 → 100). If anything breaches, halt.
5
Promote or roll back
Healthy: v2 takes 100% and v1 is retired. Unhealthy: weight snaps back to 0% on v2, rollback is just a routing change, no rebuild.

Four strategies, side by side

These are the four you'll actually pick between. Each trades cost and complexity for a smaller blast radius and a faster escape hatch. Read the table as a ladder: the further down you go, the safer the release and the more infrastructure it asks for.

Strategy	Downtime	Blast radius / risk	Cost	Rollback speed
Recreate	Yes, full gap	Total: everyone, instantly	Lowest (1 env)	Slow, redeploy old build
Rolling	None	Medium: grows pod by pod	Low (1 env)	Medium, roll back pod by pod
Blue-green	None	Total on flip, but instant revert	High (2 full envs)	Instant, flip router back
Canary	None	Tiny: a small % first	Medium (partial v2)	Fast, drop canary weight to 0

Downtime, risk, cost, and rollback speed across the four core strategies.

Recreate is the Friday story, stop all, start all. Use it only where a moment of downtime is acceptable and two versions truly cannot coexist (e.g. a breaking schema lock). Rolling replaces instances a few at a time; it's the Kubernetes default and a fine baseline, but it has no metrics gate, a bad version still reaches everyone, just more slowly. Blue-green keeps a full standby (green) and flips traffic in one router change; reverting is just flipping back, but you pay for two full environments. Canary is the most surgical: a small slice sees v2 first, judged against live v1, before anyone else is exposed.

A real canary config (Argo Rollouts)

Here's a canary expressed declaratively with Argo Rollouts. It's a drop-in replacement for a Kubernetes Deployment that adds weighted steps, pauses, and, critically, an automated analysis gate that aborts the rollout if metrics go bad.

rollout.yaml

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 10
  selector:
    matchLabels: { app: checkout }
  template:
    spec:
      containers:
        - name: checkout
          image: registry/checkout:v2   # the new version
  strategy:
    canary:
      steps:
        - setWeight: 5           # 5% of traffic to v2
        - pause: { duration: 5m } # bake, then analyze
        - analysis:
            templates:
              - templateName: error-rate-gate
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100         # full promotion
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-gate
spec:
  metrics:
    - name: error-rate
      interval: 1m
      # fail the rollout if 3 consecutive checks breach
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{app="checkout",status=~"5.."}[2m]))
              /
            sum(rate(http_requests_total{app="checkout"}[2m]))
      # abort if 5xx ratio exceeds 2%
      successCondition: result[0] < 0.02

The gate is the whole point

Without the `analysis` step, this is just a slow rolling deploy. The `successCondition` is what makes the rollout self-driving: it watches Prometheus, and if the 5xx ratio crosses 2% three times, Argo automatically aborts and shifts traffic back to v1, no human, no pager, no Friday outage.

Feature flags: deploy without releasing

Canary controls which servers see new code by routing traffic. Feature flags control which code path runs, per request, regardless of which server you hit. They're the cleanest way to fully separate deploy from release: ship the code to 100% of servers turned off, then turn it on for 1% of users, then 50%, then everyone, all without another deploy.

checkout.ts

typescript

// Code is deployed everywhere, but dark until the flag flips.
if (flags.isEnabled("new-pricing-engine", { userId })) {
  return computePriceV2(cart);   // new path
}
return computePriceV1(cart);     // old path, still the default

Kill switch. A bad feature is turned off in seconds from a dashboard, no rebuild, no redeploy, no rollback pipeline.
Targeted release. Enable for internal users, then a 1% cohort, then a region, then everyone, decoupled from the deploy cadence.
Trunk-based development. Merge half-finished work to main behind an off flag, avoiding long-lived branches and merge hell.
The cost: flag debt. Every flag is a branch in your code. Delete flags once a feature is fully rolled out, or the codebase rots into a maze of dead conditionals.

Common mistakes that cost hours

No automated rollback. A canary with manual promotion only works while a human is watching. At 3am nobody is. If the gate can't roll back on its own, you've just added latency to the same outage.
No metrics gate. Shifting weight on a timer alone, 5%, wait, 50%, wait, without checking error rate or latency is theater. You'll dutifully promote a broken version on schedule. The gate, not the pause, is what makes it safe.
Canary without enough traffic. A 5% slice of a low-traffic service might be three requests an hour. Your statistics are noise and the gate can't tell signal from chance. Low-traffic services need a longer bake window, a bigger initial weight, or synthetic traffic.
Forgetting database migrations. Blue-green and canary assume v1 and v2 run at once, so the schema must be compatible with both. Use expand-and-contract migrations (add columns, never rename/drop in the same release).
Leaving flags on forever. Stale feature flags become permanent untested branches. Treat flag cleanup as part of finishing the feature.

Takeaways

The whole article in seven lines

The deploy strategy, not the bug, decides the blast radius.
Separate **deploy** (code is running) from **release** (users are served).
Recreate = downtime; rolling = safer default; blue-green = instant revert; canary = smallest blast radius.
A canary is only as good as its **analysis gate**, automate the rollback.
Compare the canary to live v1 as the control group; that's why you run both.
Feature flags decouple release from deploy entirely and give you a sub-second kill switch.
Keep migrations backward-compatible and delete stale flags, both versions run at once.

Where to go next

Deployment strategy is one pillar of safe delivery. Pair it with declarative delivery and with resilience patterns so a bad release degrades gracefully instead of cascading.

GitOps: Declarative Delivery with ArgoCD & Flux, make the desired state the source of truth, so promotions and rollbacks are git operations.
Graceful Degradation & Load Shedding, what to do when a release does slip through and the system is under stress.
Practice in the kubectl lab, drive a rolling update and watch pods cut over.
Wire the pipeline in the CI/CD lab, gate a deploy on tests and a metrics check.
Go end to end on the DevOps Engineer path.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Deploying Your First Production App the Right Way

Read

DevOps

What DevOps Actually Is (It's Not a Job Title)

Read

DevOps

CI/CD Fundamentals: What a Pipeline Really Does

Read