Readiness vs liveness health checks, explained
The short version: A readiness check answers "can I serve traffic right now?" — failing it pulls the instance out of the load balancer without killing it. A liveness check answers "am I hung and unrecoverable?" — failing it restarts the process. Confusing the two is a top cause of outages and crash loops: a too-aggressive liveness check kills healthy-but-busy apps, and a missing readiness check sends traffic to instances that aren't ready.
Two checks, two completely different jobs
They sound similar and are constantly mixed up, but they trigger opposite actions:
Readiness — gates traffic
"Can I serve right now?" Fail it and the load balancer stops sending requests, but the process keeps running. Used during startup, draining, or temporary overload.
Liveness — restarts the process
"Am I hung beyond recovery?" Fail it and the orchestrator kills and restarts the container. Used only for deadlocks a restart can fix.
The mistake that causes crash loops
The classic outage: a liveness probe with too short an initial delay. The app is simply slow to boot, the probe fails, the orchestrator kills it — before it ever finished starting. It restarts, fails the probe again, and you get a CrashLoopBackOff on a perfectly healthy app.
The fix is a startup probe (or a generous initial delay) so liveness doesn't even begin until the app has had time to boot:
startupProbe: # gives the app time to start
httpGet: { path: /healthz, port: 3000 }
failureThreshold: 30
periodSeconds: 5 # up to 150s before liveness kicks in
livenessProbe:
httpGet: { path: /healthz, port: 3000 }
readinessProbe: # gates traffic, separately
httpGet: { path: /ready, port: 3000 }
Designing a good health endpoint
Keep /healthz cheap and fast — it should not run heavy queries or call external services, or a slow dependency will make a healthy app look dead. For readiness specifically, it's fine to check that critical dependencies (like the database) are reachable, since you genuinely shouldn't take traffic without them. Never let a third-party outage fail your liveness check, or you'll restart-loop over something a restart can't fix.
Errors you might hit
Health-check mistakes usually surface as these:
Health checks done right, by default
Infraveil runs health checks on your services on your own servers and only marks a service live once it genuinely passes — with a sensible startup window so a slow boot isn't mistaken for a failure. It restarts what truly hangs and gates traffic on readiness, so you get the benefit of both checks without tuning probe thresholds by trial and error.
Frequently asked questions
What's the difference between readiness and liveness?
Readiness gates traffic — failing it removes the instance from the load balancer without killing it. Liveness restarts the process — failing it kills and restarts the container. One controls routing, the other controls restarts.
Can a health check cause CrashLoopBackOff?
Yes. A liveness probe that starts too early or is too aggressive will kill an app that's merely slow to boot, causing it to restart-loop. Use a startup probe or a longer initial delay.
Should my health check call the database?
For readiness, checking critical dependencies is reasonable — you shouldn't take traffic without them. For liveness, avoid it: a dependency outage would restart-loop your app over something a restart can't fix.
What is a startup probe?
A probe that runs only while the app is starting and disables liveness/readiness until it passes. It gives slow-booting apps time to start without being killed by an impatient liveness check.