Backend Guide

Readiness vs liveness health checks, explained

The short version: A readiness check answers "can I serve traffic right now?" — failing it pulls the instance out of the load balancer without killing it. A liveness check answers "am I hung and unrecoverable?" — failing it restarts the process. Confusing the two is a top cause of outages and crash loops: a too-aggressive liveness check kills healthy-but-busy apps, and a missing readiness check sends traffic to instances that aren't ready.

Two checks, two completely different jobs

They sound similar and are constantly mixed up, but they trigger opposite actions:

Readiness — gates traffic

"Can I serve right now?" Fail it and the load balancer stops sending requests, but the process keeps running. Used during startup, draining, or temporary overload.

Liveness — restarts the process

"Am I hung beyond recovery?" Fail it and the orchestrator kills and restarts the container. Used only for deadlocks a restart can fix.

The mistake that causes crash loops

The classic outage: a liveness probe with too short an initial delay. The app is simply slow to boot, the probe fails, the orchestrator kills it — before it ever finished starting. It restarts, fails the probe again, and you get a CrashLoopBackOff on a perfectly healthy app.

The fix is a startup probe (or a generous initial delay) so liveness doesn't even begin until the app has had time to boot:

startupProbe:                 # gives the app time to start
  httpGet: { path: /healthz, port: 3000 }
  failureThreshold: 30
  periodSeconds: 5            # up to 150s before liveness kicks in
livenessProbe:
  httpGet: { path: /healthz, port: 3000 }
readinessProbe:               # gates traffic, separately
  httpGet: { path: /ready, port: 3000 }

Designing a good health endpoint

Keep /healthz cheap and fast — it should not run heavy queries or call external services, or a slow dependency will make a healthy app look dead. For readiness specifically, it's fine to check that critical dependencies (like the database) are reachable, since you genuinely shouldn't take traffic without them. Never let a third-party outage fail your liveness check, or you'll restart-loop over something a restart can't fix.

How Infraveil handles this

Health checks done right, by default

Infraveil runs health checks on your services on your own servers and only marks a service live once it genuinely passes — with a sensible startup window so a slow boot isn't mistaken for a failure. It restarts what truly hangs and gates traffic on readiness, so you get the benefit of both checks without tuning probe thresholds by trial and error.

Traffic gated on real readiness; genuine hangs restarted automatically
A startup window so slow boots aren't killed into a crash loop
Service health in one view, on servers you control

Frequently asked questions

What's the difference between readiness and liveness?

Readiness gates traffic — failing it removes the instance from the load balancer without killing it. Liveness restarts the process — failing it kills and restarts the container. One controls routing, the other controls restarts.

Can a health check cause CrashLoopBackOff?

Yes. A liveness probe that starts too early or is too aggressive will kill an app that's merely slow to boot, causing it to restart-loop. Use a startup probe or a longer initial delay.

Should my health check call the database?

For readiness, checking critical dependencies is reasonable — you shouldn't take traffic without them. For liveness, avoid it: a dependency outage would restart-loop your app over something a restart can't fix.

What is a startup probe?

A probe that runs only while the app is starting and disables liveness/readiness until it passes. It gives slow-booting apps time to start without being killed by an impatient liveness check.