Run your own backend

Monitor a backend you run yourself

Q: What should I actually alert on?

Alert on symptoms users feel: the site is down or slow (failed health checks, high p95/p99 latency), errors are spiking, or a resource is about to run out (disk nearly full, sustained high per-core load, repeated OOM kills). Avoid alerting on raw metrics that do not map to user impact - that is how you train yourself to ignore alerts.

Q: How often should health checks run?

Often enough to catch an outage quickly but not so often you create load - every 15 to 60 seconds is typical for an external uptime check, with a couple of consecutive failures required before alerting to avoid flapping. Liveness and readiness probes inside an orchestrator run more frequently and are separate from external uptime monitoring.

Q: Logs or metrics - which do I need?

Both, for different jobs. Metrics tell you something is wrong and trend over time (latency rising, error rate up); logs tell you why, for a specific request. Start with health checks and a few key metrics for alerting, and keep structured logs you can search when an alert fires.

The short version: monitoring is three layers - health checks (is it up and ready), metrics (CPU, memory, disk, latency, error rate), and alerting (tell me when it breaks). You do not need a heavyweight SaaS for most apps: a health endpoint, an external uptime watcher, and alerts to Slack or email cover the essentials, on infrastructure you own. Alert on user-facing symptoms, not on every metric.

The three layers

Health checks - a cheap, frequent yes/no: is the service up, and is it ready to serve (dependencies reachable)?
Metrics - numbers over time: CPU, memory, disk, request latency, error rate. These tell you something is wrong and where it is trending.
Alerting - the part that wakes you up. It should fire on user-facing problems, route somewhere you will see it, and not cry wolf.

What to actually watch

Is it up and ready?

An external uptime check plus /healthz and /readyz - down or not-ready is the most important signal.

Resource saturation

Disk filling, sustained high per-core load, repeated OOM kills - the things that take a healthy box down.

Latency (p95 / p99)

Tail latency, not the average - the slowest 1-5% is what users feel and what an SLO is written against.

Error rate

A rising rate of 5xx or failed requests is an early, direct signal of a user-facing problem.

Set it up in three steps

Expose health endpoints

# /healthz = liveness (no dependency check), /readyz = readiness (pings the DB)
# generate the right ones for your stack, then point checks at them.

Watch from outside, alert on failure

# an external watcher hits /healthz on an interval and alerts after N
# consecutive failures (avoids flapping) - to Slack, Discord, or email.

Add resource + error alerts

# disk > 85%, sustained per-core load > 1.0, repeated OOM, 5xx rate spike.
df -h; uptime; dmesg | grep -i "killed process" | tail

The discipline

Alert on symptoms, not noise

The fastest way to make monitoring useless is to alert on everything - you learn to ignore it, and the one real alert gets lost. Alert on what a user would feel: down, slow, erroring, or about to run out of a resource. Make every page actionable, require a couple of consecutive failures before firing, and route lower-severity signals to a dashboard or daily digest rather than a 3am page.

Turn your uptime target and acceptable latency into concrete numbers with the SLA calculator and latency percentile calculator - they tell you how much downtime and tail latency your SLO actually allows.

How Infraveil handles this

Status, health, and recovery in one view

Monitoring is only useful if it leads to action. On your own servers, Infraveil health-checks your services, surfaces status and failures in one view, and ties them to gated, recorded recovery - so a down or saturated service is not just a chart you happen to be watching, but something the system catches, shows, and helps you fix, on infrastructure you own.

Health checks and status for every service, in one view, on infra you own

Failures surfaced with context, not buried in a dashboard you forgot

Recovery approval-gated and recorded

See how it works Generate an uptime monitor

Frequently asked questions

Do I need Datadog or a big monitoring SaaS?

Not for most apps. A health endpoint, an external uptime check, basic resource and log alerts, and latency percentiles cover the essentials and run on your own servers. Reach for a full platform with many services, high cardinality, or deep tracing needs - not as a default.

What should I actually alert on?

Symptoms users feel: down or slow (failed health checks, high p95/p99), error spikes, or a resource about to run out (disk near full, sustained high load, repeated OOM). Avoid alerting on raw metrics with no user impact - that trains you to ignore alerts.

How often should health checks run?

Every 15-60 seconds for an external uptime check, with a couple of consecutive failures required before alerting to avoid flapping. Orchestrator liveness/readiness probes run more frequently and are separate from external uptime monitoring.

Logs or metrics - which do I need?

Both. Metrics tell you something is wrong and trend over time; logs tell you why, for a specific request. Start with health checks and a few key metrics for alerting, and keep structured logs you can search when an alert fires.

Related guides

Readiness vs liveness checks

Exit code 137 (OOMKilled)

Zero-downtime deployments

Free tools for this

Client-side, no signup — they run in your browser.

Uptime monitor generator

Self-hosted watcher + Slack/email alerts

Healthcheck generator

/healthz + /readyz done right

Latency percentile calculator

p50 / p90 / p95 / p99