Run your own backend

What to do when production is down

Q: What is the first thing to do in an outage?

Confirm the scope: is it actually down, for everyone, or just you or one region? Check from outside (an uptime probe, a different network) so you are not chasing a local problem. Then stop the bleeding - if a deploy or change went out recently, rolling it back is the most common and fastest restore, before you even know the root cause.

Q: Should I roll back or fix forward?

Roll back by default when a recent change caused it - it is the fastest path back to a known-good state, and you can diagnose calmly afterward. Fix forward only when a rollback is impossible (a migration already changed data) or clearly riskier than a small, well-understood fix. The goal in the moment is restored service, not a perfect fix.

Q: How should I communicate during an incident?

Early and honestly. Post that you are aware and investigating before you have answers, give a rough scope and a next-update time, and update on that cadence even if there is no change. Internally, name one person to coordinate so responders are not duplicating work. Silence during an outage erodes trust faster than the outage itself.

Q: Do I really need a postmortem?

Yes, for anything user-facing - and a blameless one. The value is not blame, it is the timeline, the root cause, and the concrete follow-ups that stop a repeat. Capture what happened, why, how it was detected and resolved, and the specific changes (alerting, a guardrail, a rollback button) that would have prevented or shortened it. A postmortem you never write is an incident you get to have again.

The short version: work the sequence, do not panic-poke. Confirm it is really down (and for everyone), stop the bleeding - the fastest fix is usually rolling back the last change - then communicate, capture evidence, diagnose, fix, and verify. Restore service first and find root cause second. The single best preparation is a runbook and a rollback path you set up before the incident, not during it.

The first five minutes

Confirm the scope

Is it down for everyone, or just you or one region? Check from outside before you chase a local problem.

Stop the bleeding

If a change just went out, roll it back - the fastest restore is usually undoing the last deploy.

Communicate early

Post that you are aware and investigating, with a rough scope and a next-update time - before you have answers.

Preserve evidence

Capture the logs and metrics before you restart - a restart that fixes it can erase why it broke.

The sequence

Mitigate - restore service

Roll back the last change, fail over, or shed load. Get back to a known-good state; do not wait until you fully understand the cause.

Diagnose - with the evidence you kept

# correlate the outage time with what changed and what the logs say
journalctl -u myapp --since "15 min ago" | tail -100
dmesg | grep -i "killed process" | tail   # OOM?  df -h   # disk?

Fix, verify, then write it up

Apply the real fix, confirm the service is healthy from outside, update your status, then write a blameless postmortem while it is fresh.

The discipline

Mitigate before you diagnose

The instinct under pressure is to find the root cause first. Resist it. Every minute spent diagnosing while the site is down is a minute of outage - and the most common cause is the last change, so rolling back often restores service before you even know why. Stabilize, communicate, then investigate calmly. The other half is preparation: a written runbook, a one-click rollback, and alerts that fire on symptoms mean the incident is a procedure, not an improvisation.

Put numbers on the stakes with the downtime cost calculator, and capture the writeup with the incident postmortem template.

How Infraveil handles this

See it, roll back, and have the record

An incident is faster when you can see status in one place, undo the last change safely, and have a record of what happened. On your own servers, Infraveil surfaces health and failures in one view, ties recovery to approval-gated, recorded actions, and keeps an audit trail - so “what changed right before this” and “who rolled it back” are answerable, and the postmortem writes half of itself.

Status and failures surfaced in one view, on infra you own

Recovery and rollback approval-gated and recorded

An audit trail that makes the postmortem timeline factual, not reconstructed

See how it works Postmortem template

Frequently asked questions

What is the first thing to do in an outage?

Confirm the scope - is it down for everyone or just you? Check from outside. Then stop the bleeding: if a change went out recently, rolling it back is the fastest restore, before you know the root cause.

Should I roll back or fix forward?

Roll back by default when a recent change caused it - fastest path to known-good. Fix forward only when rollback is impossible (data already changed) or clearly riskier than a small, understood fix. Restored service is the goal in the moment.

How should I communicate during an incident?

Early and honestly: post that you are aware and investigating, give rough scope and a next-update time, and keep that cadence. Name one coordinator internally. Silence erodes trust faster than the outage.

Do I really need a postmortem?

Yes, and a blameless one. Capture the timeline, root cause, detection and resolution, and concrete follow-ups that prevent a repeat. A postmortem you never write is an incident you get to have again.

Related guides

Zero-downtime deployments

Monitor your own backend

502 Bad Gateway after deploy

Free tools for this

Client-side, no signup — they run in your browser.

Incident postmortem template

Blameless writeup, structured

Downtime cost calculator

What the outage is costing

Deploy error decoder

Decode what broke