What to do when production is down
The short version: work the sequence, do not panic-poke. Confirm it is really down (and for everyone), stop the bleeding - the fastest fix is usually rolling back the last change - then communicate, capture evidence, diagnose, fix, and verify. Restore service first and find root cause second. The single best preparation is a runbook and a rollback path you set up before the incident, not during it.
The first five minutes
Confirm the scope
Is it down for everyone, or just you or one region? Check from outside before you chase a local problem.
Stop the bleeding
If a change just went out, roll it back - the fastest restore is usually undoing the last deploy.
Communicate early
Post that you are aware and investigating, with a rough scope and a next-update time - before you have answers.
Preserve evidence
Capture the logs and metrics before you restart - a restart that fixes it can erase why it broke.
The sequence
Mitigate - restore service
Roll back the last change, fail over, or shed load. Get back to a known-good state; do not wait until you fully understand the cause.
Diagnose - with the evidence you kept
# correlate the outage time with what changed and what the logs say
journalctl -u myapp --since "15 min ago" | tail -100
dmesg | grep -i "killed process" | tail # OOM? df -h # disk?Fix, verify, then write it up
Apply the real fix, confirm the service is healthy from outside, update your status, then write a blameless postmortem while it is fresh.
Mitigate before you diagnose
The instinct under pressure is to find the root cause first. Resist it. Every minute spent diagnosing while the site is down is a minute of outage - and the most common cause is the last change, so rolling back often restores service before you even know why. Stabilize, communicate, then investigate calmly. The other half is preparation: a written runbook, a one-click rollback, and alerts that fire on symptoms mean the incident is a procedure, not an improvisation.
Put numbers on the stakes with the downtime cost calculator, and capture the writeup with the incident postmortem template.
See it, roll back, and have the record
An incident is faster when you can see status in one place, undo the last change safely, and have a record of what happened. On your own servers, Infraveil surfaces health and failures in one view, ties recovery to approval-gated, recorded actions, and keeps an audit trail - so “what changed right before this” and “who rolled it back” are answerable, and the postmortem writes half of itself.
Frequently asked questions
What is the first thing to do in an outage?
Confirm the scope - is it down for everyone or just you? Check from outside. Then stop the bleeding: if a change went out recently, rolling it back is the fastest restore, before you know the root cause.
Should I roll back or fix forward?
Roll back by default when a recent change caused it - fastest path to known-good. Fix forward only when rollback is impossible (data already changed) or clearly riskier than a small, understood fix. Restored service is the goal in the moment.
How should I communicate during an incident?
Early and honestly: post that you are aware and investigating, give rough scope and a next-update time, and keep that cadence. Name one coordinator internally. Silence erodes trust faster than the outage.
Do I really need a postmortem?
Yes, and a blameless one. Capture the timeline, root cause, detection and resolution, and concrete follow-ups that prevent a repeat. A postmortem you never write is an incident you get to have again.
Client-side, no signup — they run in your browser.