Backend Guide

Zero-downtime deployment: blue-green, canary & rolling explained

The short version: Zero-downtime deployment means shipping a new version without users noticing. Three strategies achieve it — blue-green (two environments, swap), canary (send a small % of traffic first), and rolling (replace instances one at a time). All three depend on the same foundation: a health-gated cutover and graceful draining, so no in-flight request is ever dropped.

What zero-downtime actually requires

The strategy names get the attention, but they all rest on three things. Without these, any of them still drops requests:

Health-gated cutover — traffic only moves to an instance once it passes a readiness check.
Graceful draining — the old instance finishes in-flight requests before it stops (it handles SIGTERM).
Backward-compatible changes — old and new versions can run at the same time, including the database schema.

The three strategies

Blue-green

Two identical environments. Deploy to the idle one, test it, then flip all traffic. Instant rollback: flip back.

Canary

Release to a small slice of users first. If metrics stay healthy, ramp up; if not, pull it back before it spreads.

Rolling

Replace instances a few at a time. No second environment needed, but old and new run together mid-rollout.

Which one should you use?

Strategy	Best for	Cost
Blue-green	Instant rollback, simple mental model	2× infrastructure during deploy
Canary	Risky changes, catching bugs early	Needs traffic-splitting + metrics
Rolling	Default for most apps, no spare env	Old + new run together briefly

The part everyone forgets: the database

During any of these, two versions of your app run at once — so the database must work for both. Use expand-then-contract migrations: add the new column/table first (old code ignores it), deploy the new code, then remove the old column in a later release. A migration that the running version can't handle will cause errors no deployment strategy can hide.

Errors you might hit

Most zero-downtime failures show up as one of these during the cutover:

502 after deploy

Traffic hit an instance that wasn't ready.

504 Gateway Timeout

The new version is reachable but too slow.

SIGTERM / draining

The old instance dropped requests on stop.

ECONNRESET

Connections reset when the upstream swapped.

How Infraveil handles this

Health-gated, approval-gated deploys

Infraveil bakes the zero-downtime foundation in. On your own servers, its agent only marks a new version live once it passes a health check, drains the old instance before stopping it, and records the whole rollout behind your approval — with one-click rollback. You get safe cutovers without assembling the health-check, draining, and routing machinery yourself.

Traffic shifts only to instances that pass a health check

Old instances drain before they're stopped — no dropped requests

Every rollout approval-gated and recorded, with one-click rollback

See how it works More deploy fixes

Frequently asked questions

What is zero-downtime deployment?

Releasing a new version without any interruption to users. It's achieved by running the new version alongside the old, shifting traffic only when the new one is healthy, and letting the old one finish its in-flight requests before stopping.

What's the difference between blue-green and canary?

Blue-green swaps all traffic from the old environment to the new one at once (with instant rollback). Canary sends a small percentage of traffic to the new version first, then ramps up if it stays healthy.

Which deployment strategy is best?

Rolling is a sensible default for most apps. Use blue-green when you want instant rollback and can afford a second environment; use canary for risky changes where you want to limit the blast radius.

Why do I still get errors during a zero-downtime deploy?

Usually the database or API isn't backward-compatible, so the old and new versions can't coexist — or the cutover isn't gated on a health check. Use expand-then-contract migrations and a readiness gate.