Zero-downtime deployment: blue-green, canary & rolling explained
The short version: Zero-downtime deployment means shipping a new version without users noticing. Three strategies achieve it — blue-green (two environments, swap), canary (send a small % of traffic first), and rolling (replace instances one at a time). All three depend on the same foundation: a health-gated cutover and graceful draining, so no in-flight request is ever dropped.
What zero-downtime actually requires
The strategy names get the attention, but they all rest on three things. Without these, any of them still drops requests:
- Health-gated cutover — traffic only moves to an instance once it passes a readiness check.
- Graceful draining — the old instance finishes in-flight requests before it stops (it handles SIGTERM).
- Backward-compatible changes — old and new versions can run at the same time, including the database schema.
The three strategies
Blue-green
Two identical environments. Deploy to the idle one, test it, then flip all traffic. Instant rollback: flip back.
Canary
Release to a small slice of users first. If metrics stay healthy, ramp up; if not, pull it back before it spreads.
Rolling
Replace instances a few at a time. No second environment needed, but old and new run together mid-rollout.
Which one should you use?
| Strategy | Best for | Cost |
|---|---|---|
| Blue-green | Instant rollback, simple mental model | 2× infrastructure during deploy |
| Canary | Risky changes, catching bugs early | Needs traffic-splitting + metrics |
| Rolling | Default for most apps, no spare env | Old + new run together briefly |
The part everyone forgets: the database
During any of these, two versions of your app run at once — so the database must work for both. Use expand-then-contract migrations: add the new column/table first (old code ignores it), deploy the new code, then remove the old column in a later release. A migration that the running version can't handle will cause errors no deployment strategy can hide.
Errors you might hit
Most zero-downtime failures show up as one of these during the cutover:
Health-gated, approval-gated deploys
Infraveil bakes the zero-downtime foundation in. On your own servers, its agent only marks a new version live once it passes a health check, drains the old instance before stopping it, and records the whole rollout behind your approval — with one-click rollback. You get safe cutovers without assembling the health-check, draining, and routing machinery yourself.
Frequently asked questions
What is zero-downtime deployment?
Releasing a new version without any interruption to users. It's achieved by running the new version alongside the old, shifting traffic only when the new one is healthy, and letting the old one finish its in-flight requests before stopping.
What's the difference between blue-green and canary?
Blue-green swaps all traffic from the old environment to the new one at once (with instant rollback). Canary sends a small percentage of traffic to the new version first, then ramps up if it stays healthy.
Which deployment strategy is best?
Rolling is a sensible default for most apps. Use blue-green when you want instant rollback and can afford a second environment; use canary for risky changes where you want to limit the blast radius.
Why do I still get errors during a zero-downtime deploy?
Usually the database or API isn't backward-compatible, so the old and new versions can't coexist — or the cutover isn't gated on a health check. Use expand-then-contract migrations and a readiness gate.