Deploy Error Decoder

Exit code 137 (OOMKilled) — what it means and how to fix it

Q: How do I confirm it was OOMKilled?

Run kubectl describe pod and check Reason: OOMKilled, or docker inspect --format '{{.State.OOMKilled}}'.

Quick answer: Exit code 137 means the process was killed by SIGKILL (128 + 9). In containers it's almost always OOMKilled — the container exceeded its memory limit and the kernel killed it. Confirm with kubectl describe pod (look for Reason: OOMKilled) or docker inspect, then either raise the memory limit or fix whatever is leaking.

What the error looks like

The container stops with status 137, and the reason is in the runtime, not the app logs:

$ kubectl describe pod api-xxx
    Last State:  Terminated
      Reason:    OOMKilled
      Exit Code: 137

$ docker ps -a
STATUS
Exited (137) 12 seconds ago

137 is the generic "killed by SIGKILL" code. Inside a container the kernel's OOM killer is the usual sender, but a manual docker kill or a failed liveness probe can also produce it.

Why it happens

The memory limit is too low

The container's limit is below what the app genuinely needs at peak, so normal load trips it.

A memory leak

Usage climbs steadily over hours until it hits the ceiling — restarts "fix" it only until the next climb.

A load spike

A big request, batch job, or traffic burst briefly balloons memory past the limit.

The runtime ignores the cgroup limit

Older Node/JVM size their heap to the host's RAM, not the container limit, and overshoot.

Diagnose it in three steps

1

Confirm it's really OOM

kubectl describe pod <pod>     # Reason: OOMKilled?
docker inspect <id> --format '{{.State.OOMKilled}}'

2

See actual usage vs the limit

kubectl top pod <pod>
docker stats <id>
# Is it near the limit at idle, or only under load?

3

Decide: too-small limit or leak?

# Flat-but-high usage  -> raise the limit / requests.
# Slowly climbing usage -> it's a leak; raising the limit only delays it.

The real fix

Right-size the limit and make the runtime cgroup-aware

If usage is legitimately high, raise the limit (and set requests so the scheduler reserves it). If a runtime overshoots its container, tell it the ceiling:

# Node: cap the heap to the container, not the host
NODE_OPTIONS=--max-old-space-size=460   # ~90% of a 512Mi limit

# JVM: size to the cgroup
JAVA_TOOL_OPTIONS=-XX:MaxRAMPercentage=75

If it's a leak, raising the limit just postpones the crash — capture a heap profile under load and fix what grows without bound.

How Infraveil handles this

Restarts you can see, on servers you own

Infraveil won't fix a memory leak for you — that's your code — but it makes OOM kills visible and survivable. Running on your own servers, its agent restarts a killed service, surfaces the event in one log and status view instead of a silent loop, and lets you roll back to the last release that didn't OOM while you chase the root cause.

OOM kills surface in one status + log view, not a silent restart loop

Automatic restart keeps the service up while you investigate

One-click rollback to the last release that stayed within memory

See how it works More deploy fixes

Frequently asked questions

What does exit code 137 mean?

The process was terminated by SIGKILL (128 + 9). In containers this is most often the kernel OOM killer reclaiming memory because the container exceeded its limit.

How do I confirm it was OOMKilled?

Run kubectl describe pod and check Reason: OOMKilled, or docker inspect <id> --format '{{.State.OOMKilled}}'.

Should I just raise the memory limit?

Only if usage is genuinely high but stable. If memory climbs without bound, that's a leak — raising the limit delays the crash rather than fixing it.

Why does my app OOM in a container but not locally?

The container's memory limit is lower than your machine's RAM, and some runtimes size their heap to the host unless told the cgroup limit (e.g. Node's --max-old-space-size).