How Infraveil stays up
A clean test screenshot proves nothing on its own. What matters is how the software behaves when something breaks. This page explains the two components that keep your services running — the launcher and the agent — what each one checks, and what happens when a deploy fails, a server loses contact, or a service stops responding.
Judge the behavior, not the screenshot
A test result is easy to fake. The thing worth checking is how the software acts under pressure: does it run your services the way your config says, does it verify your code before running it, does it bring services back when they crash, and does it keep going when a server loses contact with us? Here are the four things worth checking.
Runs on your server. It compares what should be running against what is actually running, restarts what is missing, and stops trying when a service keeps failing instead of hammering the machine.
Pulls your code, checks it has not been tampered with, runs it, watches its health, keeps the last working version, and rolls back if a new version turns out to be broken.
Failure is expected, not a surprise. The design keeps working through lost connections to us, corrupted downloads, services that stop responding, and repeated crashes.
The launcher and agent code run on your own servers. You can read it and confirm that recovery, verification, and rollback work the way this page says.
What this page is trying to show
What is worth trusting, and what is not
The limits built into the software
These are the actual settings the launcher and agent follow. They decide how many times a service is retried, how often health is checked, and when the software stops and waits instead of retrying forever.
What we claim, and why you can believe it
/launcher/sync, and the agent reports each service's health and version through /agent/heartbeat. From those reports, plus logs, request traffic, and security events, the server builds the status pages and incident reports you see.POST /launcher/sync
state = desired_state_from_server()
local = inspect_running_agents()
for agent in state:
if agent.should_stop:
kill(agent)
elif agent.must_restart:
fetch_fresh_payload()
write_atomically()
spawn(agent)
elif agent.crashed:
if crash_budget_exhausted():
suspend_with_cooldown()
else:
spawn(last_known_good_payload)
GET /agent/secureportal
payload = decrypt(encrypted_blob)
if sha256(payload) != server_hash:
reject_payload()
if hmac_signature_invalid():
reject_payload()
cache_version_atomically(payload)
launch_services(payload)
if control_plane_unreachable:
keep_local_services_alive()
if payload_turns_unstable:
rollback_to_cached_version()
if service_health_check_fails:
recycle_service_with_budget()
if launcher_restarts:
restore_runtime_state_from_disk()
POST /launcher/sync
launcher = host_runtime_state()
POST /agent/heartbeat
agent = service_process_state()
GET /client/api/operations
incidents = correlate(
launcher,
agent,
console_logs,
request_trace,
security_events,
pipeline_telemetry
)
saves what it is running to disk
keeps a steady connection to our servers
counts crashes within a 5 minute window
pauses a service after 6 failed restarts
falls back to the last working version
keeps each version of your code on disk
counts failed deploys over time
rolls back to the last working version after 4 failures
checks each service's health every 30 seconds
restarts a crashed service up to 8 times
Why each part is there
Something on the server has to remember what should be running and act on it. The launcher notices when a service has stopped, restarts it safely, and picks up where it left off after a reboot, so a brief problem does not turn into a lasting one.
Downloading code and safely running it are two different jobs. The agent is the safety check: it verifies your code, keeps the versions that worked, watches each running service, and decides whether a new version is safe to keep or should be rolled back.
If the only way to run your app is "our servers are up and the newest download worked," then one hiccup takes you down. Keeping the last working version means a short problem on our side leaves you running on what already worked, not offline.
Restarting a broken service forever is not recovery, it just hides the problem and wears down the server. Firm limits let the software stop, flag the issue, and keep the machine healthy until someone can look at it.
What happens when things go wrong
A dropped connection does not take your app down. The launcher keeps your services running on the last instructions it had, and the agent keeps using the last working version until contact is back.
The agent counts how often the new version fails. Once it fails too many times, it rolls back to the last working version instead of trusting the new one just because it is newer.
The agent checks each service's health every thirty seconds. A service that is running but not responding is treated as a real problem and restarted, up to its restart limit.
The launcher counts repeated crashes and pauses a service that keeps failing, then waits before trying again, so one broken service does not drag the whole server down.
How we actually test it
Start with the code: how the launcher keeps services running, how the agent verifies and rolls back code, how state is saved, and how health is checked.
Put it under real pressure: heavy traffic, crashing services, a broken deploy, and a deliberate loss of contact with our servers, to see how it actually responds.
Capture logs, screenshots, and results to back up what happened, after the behavior has already made the case rather than in place of it.
Be honest about the limits. We do not claim nothing fails. We show how failures are caught, contained, and recovered from.
What counts as proof, in order
The code itself, running on your server, showing how it recovers, verifies, and saves state.
Live behavior that matches the code: restarts, rollbacks, health fixes, and staying up when contact with us drops.
Logs, screenshots, and result files that back up the story once the behavior has already shown it.
The short version
The best reason to trust Infraveil is not a green pass badge. It is that the launcher keeps your services running and recovers from crashes, the agent verifies your code before running it and rolls back bad versions, and the whole thing is built to handle failure instead of hoping it never happens.
Logs and screenshots help, but they are supporting material. The behavior is the proof, and you can confirm it from the code running on your own server.