Run your own backend

What to do when production is down

The short version: work the sequence, do not panic-poke. Confirm it is really down (and for everyone), stop the bleeding - the fastest fix is usually rolling back the last change - then communicate, capture evidence, diagnose, fix, and verify. Restore service first and find root cause second. The single best preparation is a runbook and a rollback path you set up before the incident, not during it.

The first five minutes

Confirm the scope

Is it down for everyone, or just you or one region? Check from outside before you chase a local problem.

Stop the bleeding

If a change just went out, roll it back - the fastest restore is usually undoing the last deploy.

Communicate early

Post that you are aware and investigating, with a rough scope and a next-update time - before you have answers.

Preserve evidence

Capture the logs and metrics before you restart - a restart that fixes it can erase why it broke.

The sequence

1

Mitigate - restore service

Roll back the last change, fail over, or shed load. Get back to a known-good state; do not wait until you fully understand the cause.

2

Diagnose - with the evidence you kept

# correlate the outage time with what changed and what the logs say
journalctl -u myapp --since "15 min ago" | tail -100
dmesg | grep -i "killed process" | tail   # OOM?  df -h   # disk?
3

Fix, verify, then write it up

Apply the real fix, confirm the service is healthy from outside, update your status, then write a blameless postmortem while it is fresh.

The discipline

Mitigate before you diagnose

The instinct under pressure is to find the root cause first. Resist it. Every minute spent diagnosing while the site is down is a minute of outage - and the most common cause is the last change, so rolling back often restores service before you even know why. Stabilize, communicate, then investigate calmly. The other half is preparation: a written runbook, a one-click rollback, and alerts that fire on symptoms mean the incident is a procedure, not an improvisation.

Put numbers on the stakes with the downtime cost calculator, and capture the writeup with the incident postmortem template.

How Infraveil handles this

See it, roll back, and have the record

An incident is faster when you can see status in one place, undo the last change safely, and have a record of what happened. On your own servers, Infraveil surfaces health and failures in one view, ties recovery to approval-gated, recorded actions, and keeps an audit trail - so “what changed right before this” and “who rolled it back” are answerable, and the postmortem writes half of itself.

Status and failures surfaced in one view, on infra you own
Recovery and rollback approval-gated and recorded
An audit trail that makes the postmortem timeline factual, not reconstructed

Frequently asked questions

What is the first thing to do in an outage?

Confirm the scope - is it down for everyone or just you? Check from outside. Then stop the bleeding: if a change went out recently, rolling it back is the fastest restore, before you know the root cause.

Should I roll back or fix forward?

Roll back by default when a recent change caused it - fastest path to known-good. Fix forward only when rollback is impossible (data already changed) or clearly riskier than a small, understood fix. Restored service is the goal in the moment.

How should I communicate during an incident?

Early and honestly: post that you are aware and investigating, give rough scope and a next-update time, and keep that cadence. Name one coordinator internally. Silence erodes trust faster than the outage.

Do I really need a postmortem?

Yes, and a blameless one. Capture the timeline, root cause, detection and resolution, and concrete follow-ups that prevent a repeat. A postmortem you never write is an incident you get to have again.