Backend Guide

Readiness vs liveness health checks, explained

The short version: A readiness check answers "can I serve traffic right now?" — failing it pulls the instance out of the load balancer without killing it. A liveness check answers "am I hung and unrecoverable?" — failing it restarts the process. Confusing the two is a top cause of outages and crash loops: a too-aggressive liveness check kills healthy-but-busy apps, and a missing readiness check sends traffic to instances that aren't ready.

Two checks, two completely different jobs

They sound similar and are constantly mixed up, but they trigger opposite actions:

Readiness — gates traffic

"Can I serve right now?" Fail it and the load balancer stops sending requests, but the process keeps running. Used during startup, draining, or temporary overload.

Liveness — restarts the process

"Am I hung beyond recovery?" Fail it and the orchestrator kills and restarts the container. Used only for deadlocks a restart can fix.

The mistake that causes crash loops

The classic outage: a liveness probe with too short an initial delay. The app is simply slow to boot, the probe fails, the orchestrator kills it — before it ever finished starting. It restarts, fails the probe again, and you get a CrashLoopBackOff on a perfectly healthy app.

The fix is a startup probe (or a generous initial delay) so liveness doesn't even begin until the app has had time to boot:

startupProbe:                 # gives the app time to start
  httpGet: { path: /healthz, port: 3000 }
  failureThreshold: 30
  periodSeconds: 5            # up to 150s before liveness kicks in
livenessProbe:
  httpGet: { path: /healthz, port: 3000 }
readinessProbe:               # gates traffic, separately
  httpGet: { path: /ready, port: 3000 }

Designing a good health endpoint

Keep /healthz cheap and fast — it should not run heavy queries or call external services, or a slow dependency will make a healthy app look dead. For readiness specifically, it's fine to check that critical dependencies (like the database) are reachable, since you genuinely shouldn't take traffic without them. Never let a third-party outage fail your liveness check, or you'll restart-loop over something a restart can't fix.

Errors you might hit

Health-check mistakes usually surface as these:

CrashLoopBackOff

Liveness killing an app that's just slow to start.

502 after deploy

Traffic routed before readiness passed.

504 Gateway Timeout

A slow health endpoint or slow app.

Draining on deploy

Readiness flips are how you drain safely.

How Infraveil handles this

Health checks done right, by default

Infraveil runs health checks on your services on your own servers and only marks a service live once it genuinely passes — with a sensible startup window so a slow boot isn't mistaken for a failure. It restarts what truly hangs and gates traffic on readiness, so you get the benefit of both checks without tuning probe thresholds by trial and error.

Traffic gated on real readiness; genuine hangs restarted automatically

A startup window so slow boots aren't killed into a crash loop

Service health in one view, on servers you control

See how it works More deploy fixes

Frequently asked questions

What's the difference between readiness and liveness?

Readiness gates traffic — failing it removes the instance from the load balancer without killing it. Liveness restarts the process — failing it kills and restarts the container. One controls routing, the other controls restarts.

Can a health check cause CrashLoopBackOff?

Yes. A liveness probe that starts too early or is too aggressive will kill an app that's merely slow to boot, causing it to restart-loop. Use a startup probe or a longer initial delay.

Should my health check call the database?

For readiness, checking critical dependencies is reasonable — you shouldn't take traffic without them. For liveness, avoid it: a dependency outage would restart-loop your app over something a restart can't fix.

What is a startup probe?

A probe that runs only while the app is starting and disables liveness/readiness until it passes. It gives slow-booting apps time to start without being killed by an impatient liveness check.