Reliability

How Infraveil stays up

A clean test screenshot proves nothing on its own. What matters is how the software behaves when something breaks. This page explains the two components that keep your services running — the launcher and the agent — what each one checks, and what happens when a deploy fails, a server loses contact, or a service stops responding.

Runs on servers you own
Built to recover from failure
Code you can read yourself
The Point

Judge the behavior, not the screenshot

A test result is easy to fake. The thing worth checking is how the software acts under pressure: does it run your services the way your config says, does it verify your code before running it, does it bring services back when they crash, and does it keep going when a server loses contact with us? Here are the four things worth checking.

Launcher

Runs on your server. It compares what should be running against what is actually running, restarts what is missing, and stops trying when a service keeps failing instead of hammering the machine.

Agent

Pulls your code, checks it has not been tampered with, runs it, watches its health, keeps the last working version, and rolls back if a new version turns out to be broken.

Recovery

Failure is expected, not a surprise. The design keeps working through lost connections to us, corrupted downloads, services that stop responding, and repeated crashes.

Proof

The launcher and agent code run on your own servers. You can read it and confirm that recovery, verification, and rollback work the way this page says.

What Matters

What this page is trying to show

Not enough on its own
A test screenshot, a green pass badge, or a result file with good numbers.
What counts
Does it bring services back on its own, recover safely, and keep your servers in sync when something fails?
Why this matters
Anything works when everything is healthy. What separates real software is how it behaves when things go wrong.
How To Read This

What is worth trusting, and what is not

Weak proof
A success page, a result file, or a chart can be cropped, staged, or shown without context. They are fine as supporting material, but they are easy to fake on their own.
Stronger proof
The actual code that runs your services: how it checks downloads, saves its state, rolls back bad versions, watches health, and limits restarts. That is much harder to fake, because it decides what happens under pressure.
What we claim
Not that nothing ever fails. We claim Infraveil is built to handle failure on purpose: crashes, broken deploys, lost connections, and unhealthy services, and to recover from each one without losing track of what should be running.
What to check
Whether tampered or corrupted code is rejected before it runs, whether restarts have limits, and whether the system can fall back to the last working version instead of just going down.
Check it yourself
Once Infraveil is on your server, the launcher and agent code are right there to read. You do not have to take this page at its word. You can confirm how it verifies code, recovers, and limits restarts directly from the code that is actually running.
The Numbers

The limits built into the software

These are the actual settings the launcher and agent follow. They decide how many times a service is retried, how often health is checked, and when the software stops and waits instead of retrying forever.

Crash-loop window
5 min
The launcher counts how often a service crashes within a five-minute window, so one bad service does not get restarted forever.
Restart limit per service
6 tries
After six failed restarts in that window, the launcher pauses the service and waits, instead of overloading the server with a service that keeps dying.
Failed-deploy limit
4 tries
If a new version of your code fails four times, the agent rolls back to the last working version instead of trusting the new one.
Restart limit per crash
8 tries
A crashed service is restarted up to eight times before the agent gives up and reports it, rather than retrying endlessly.
Health check
Every 30s
The agent checks each service every thirty seconds. A process that is running but not responding still counts as a problem.
State saved to disk
On disk
The launcher writes what it is running to disk, so after a reboot it picks up where it left off instead of starting from scratch.
Code you can read
On your server
The launcher and agent run on your own machine, so you can open the code and confirm these limits for yourself.
One source of truth
One source
Status pages and incident reports are built from server state, service health, logs, request traffic, and security events.
Claims And Proof

What we claim, and why you can believe it

Claim
The launcher does more than download your code. It actively keeps your services running.
Why
It saves what it is running to disk, counts crashes over time, limits restarts, pauses services that keep failing, and can fall back to the last working version. That is real supervision, not a simple start script.
Claim
The agent checks your code before running it, rather than running whatever it downloads.
Why
Each download is decrypted, checked against a hash from the server, and verified with a signature before it runs. The working copy is saved to disk in one step so it cannot be left half-written. Tampered or corrupted code is rejected, not run.
Claim
The software expects things to break and is built to handle it.
Why
There are firm limits on failed deploys, service restarts, and crash loops. When a limit is hit, the software pauses, rolls back, or saves its state, instead of restarting the same broken thing forever.
Claim
If a server loses contact with us, your app keeps running.
Why
When it cannot reach us, the launcher keeps your services running on the last instructions it had, and the agent keeps using the last working version of your code. A connection problem on our side does not take your app down.
Claim
You do not have to take any of this on faith.
Why
The launcher and agent run on your own server, so the code is right there to read. Every claim on this page about recovery, verification, and restart limits can be checked against the code actually running, not hidden behind us.
Claim
The dashboard is built from reported server and service state, not estimates.
Why
The launcher reports server and service state through /launcher/sync, and the agent reports each service's health and version through /agent/heartbeat. From those reports, plus logs, request traffic, and security events, the server builds the status pages and incident reports you see.
How it works, step by step
What the launcher and agent do, written out in plain steps.
Overview
Launcher: keep services running
It checks what should be running and fixes what is not.
POST /launcher/sync
state = desired_state_from_server()
local = inspect_running_agents()

for agent in state:
    if agent.should_stop:
        kill(agent)
    elif agent.must_restart:
        fetch_fresh_payload()
        write_atomically()
        spawn(agent)
    elif agent.crashed:
        if crash_budget_exhausted():
            suspend_with_cooldown()
        else:
            spawn(last_known_good_payload)
Agent: check before running
Code is verified before it is ever run.
GET /agent/secureportal
payload = decrypt(encrypted_blob)

if sha256(payload) != server_hash:
    reject_payload()

if hmac_signature_invalid():
    reject_payload()

cache_version_atomically(payload)
launch_services(payload)
Recovery: handle the failures
What the software does when something breaks.
if control_plane_unreachable:
    keep_local_services_alive()

if payload_turns_unstable:
    rollback_to_cached_version()

if service_health_check_fails:
    recycle_service_with_budget()

if launcher_restarts:
    restore_runtime_state_from_disk()
Dashboard: one clear picture
Status and incidents come from launcher and agent reports, not estimates.
POST /launcher/sync
launcher = host_runtime_state()

POST /agent/heartbeat
agent = service_process_state()

GET /client/api/operations
incidents = correlate(
    launcher,
    agent,
    console_logs,
    request_trace,
    security_events,
    pipeline_telemetry
)
Launcher: what it does
It keeps your services running and recovers from crashes.
Run + Recover
saves what it is running to disk
keeps a steady connection to our servers
counts crashes within a 5 minute window
pauses a service after 6 failed restarts
falls back to the last working version
Agent: what it does
It verifies your code, runs it, and watches its health.
Verify + Watch
keeps each version of your code on disk
counts failed deploys over time
rolls back to the last working version after 4 failures
checks each service's health every 30 seconds
restarts a crashed service up to 8 times
Why Each Part Exists

Why each part is there

Why the launcher

Something on the server has to remember what should be running and act on it. The launcher notices when a service has stopped, restarts it safely, and picks up where it left off after a reboot, so a brief problem does not turn into a lasting one.

Why the agent

Downloading code and safely running it are two different jobs. The agent is the safety check: it verifies your code, keeps the versions that worked, watches each running service, and decides whether a new version is safe to keep or should be rolled back.

Why keep old versions

If the only way to run your app is "our servers are up and the newest download worked," then one hiccup takes you down. Keeping the last working version means a short problem on our side leaves you running on what already worked, not offline.

Why restart limits

Restarting a broken service forever is not recovery, it just hides the problem and wears down the server. Firm limits let the software stop, flag the issue, and keep the machine healthy until someone can look at it.

When Things Break

What happens when things go wrong

The server loses contact with us

A dropped connection does not take your app down. The launcher keeps your services running on the last instructions it had, and the agent keeps using the last working version until contact is back.

A new deploy is broken

The agent counts how often the new version fails. Once it fails too many times, it rolls back to the last working version instead of trusting the new one just because it is newer.

A service stops responding

The agent checks each service's health every thirty seconds. A service that is running but not responding is treated as a real problem and restarted, up to its restart limit.

A service keeps crashing

The launcher counts repeated crashes and pauses a service that keeps failing, then waits before trying again, so one broken service does not drag the whole server down.

How We Test It

How we actually test it

Step 1

Start with the code: how the launcher keeps services running, how the agent verifies and rolls back code, how state is saved, and how health is checked.

Step 2

Put it under real pressure: heavy traffic, crashing services, a broken deploy, and a deliberate loss of contact with our servers, to see how it actually responds.

Step 3

Capture logs, screenshots, and results to back up what happened, after the behavior has already made the case rather than in place of it.

Step 4

Be honest about the limits. We do not claim nothing fails. We show how failures are caught, contained, and recovered from.

What Counts As Proof

What counts as proof, in order

Strongest

The code itself, running on your server, showing how it recovers, verifies, and saves state.

Then

Live behavior that matches the code: restarts, rollbacks, health fixes, and staying up when contact with us drops.

Supporting

Logs, screenshots, and result files that back up the story once the behavior has already shown it.

Bottom Line

The short version

The best reason to trust Infraveil is not a green pass badge. It is that the launcher keeps your services running and recovers from crashes, the agent verifies your code before running it and rolls back bad versions, and the whole thing is built to handle failure instead of hoping it never happens.

Logs and screenshots help, but they are supporting material. The behavior is the proof, and you can confirm it from the code running on your own server.

Weak claim
"The test page says it passed."
Strong claim
"Here is exactly how the launcher, agent, rollback, and restart limits keep your services running."