Green Is Not Proof of Life

Jeff Moser, BitSalt

We shipped a feature last month that passed 498 tests across all four states of the world. Developer green. QA green. Full pipeline exit, declared done twice. The feature never ran end-to-end. Not once.

The feature was a repo-refresh daemon — its whole job was to keep local copies of our git repositories current by running git inside a container. That’s not a metaphor: the core operation was literally invoking the git binary to clone and fetch. So when we finally ran it in the real environment, it blew up on the first call with a FileNotFoundError — the git binary wasn’t in the runtime image. That’s not a subtle failure. That’s not an edge case. That’s the kind of thing you catch on the first try, if “the first try” ever happens.

It didn’t happen, because the tests mocked the point where the code actually calls git. CI was never testing whether git was there. It was testing the logic around that call — which worked fine. Green every time.


What the tests were actually checking

The feature shelled out to git via a subprocess call. The unit tests mocked that call. So when CI ran 498 tests — across binary absent, mount read-only, uid mismatched, no credentials — it was checking the logic that surrounds the call to git, not the call to git itself.

Four things needed to be true in the deployed environment for the feature to run. All four were false. None of them were visible to the tests:

  1. The git binary was missing from the runtime image. First real call: FileNotFoundError: [Errno 2] No such file or directory: 'git'.
  2. The container’s bind mount was read-only. Git couldn’t write to .git/FETCH_HEAD: error: cannot open '.git/FETCH_HEAD': Read-only file system.
  3. The repo clone was owned by host uid 1002; the container ran as uid 1001. Git refused: dubious ownership, permission denied.
  4. GitHub credentials didn’t exist inside the container. fatal: could not read Username for 'https://github.com'. The credentials lived with the host deploy user, outside the container. This one was never fixed — it’s what eventually caused us to abandon the approach entirely.

Each of these is mundane. Each is exactly the kind of thing you catch on the first real end-to-end run. All four lived in the one place the tests never actually reached — the real git call itself.

The pipeline called it done. It wasn’t done. It had never run.


The same problem, a different way

A few weeks later, a different feature turned up the same failure mode.

The gifts application had a lock/unlock admin page — a control that lets an operator lock the app during maintenance. The backend was fully built: two API endpoints, a controller, middleware, the whole stack. The navigation was wired. The tests were green.

The React component was six lines long and rendered one thing:

<h1>Lock/Unlock Placeholder</h1>

The unit test for it looked like this:

expect(screen.getByRole('heading', { name: /lock\/unlock placeholder/i })).toBeInTheDocument();

The test passed because the placeholder was faithfully rendered. It certified the stub as “covered.” Backend complete, nav wired, all tests green — the feature read as done without anyone explicitly confirming the operator control surface existed.

We only found it when someone was sent to read the component to write its operating guide.


Why this is structural, not a quality problem

These aren’t stories about agents writing bad code. The code these agents wrote was correct. The tests they wrote were correct, given what the tests were trying to test. The failure was somewhere else.

Agents are good at satisfying formal verification surfaces. Give an agent a test suite, and it will make the tests pass. Give it a linter, and it will make the linter happy. Give it a type checker, and it will give you types that check. This is genuinely useful — it means you spend less time on the mechanical layer of keeping things consistent.

The problem is that “all checks green” is not the same claim as “the thing runs.” And the gap between those two claims is where agentic workflows are structurally weak.

An agent can write a test that mocks the call to git and make it pass. It cannot run the feature in the container and observe whether git is there. An agent can write a unit test that asserts a placeholder heading exists. It cannot open a browser, navigate to the page, and notice that the control surface is missing. The verification surfaces agents work against are static — tests, types, lint — and static verification doesn’t cover execution against a real environment.

This isn’t a complaint about agents. It’s a description of what they are. They’re text processors with access to a code editor and a terminal. They cannot log into staging. They cannot attach to a running container. The gap is architectural, not a quality-of-agent problem.

The consequence is that “CI passed” carries less information than it used to. When a human developer writes a feature, there’s a good chance they ran it at some point during development — not because of process, just because that’s how you develop software. You write a thing, you try the thing, you see what happens. That informal proof-of-life is built into the workflow.

When an agent writes a feature, no such thing happens. The agent writes code, runs tests, makes them pass, and calls it done. The gap between “tests pass” and “runs in production” has to be closed somewhere else, or it doesn’t get closed.


What we added

We now have what we’re calling a proof-of-life gate. The rule is simple: if a feature shells out to a system binary, depends on mount configuration, or needs credentials that only exist in the deployed environment, it isn’t done on green tests alone. Someone — or something — has to prove it ran. In a real environment. With the run captured.

That’s it. Not a new testing framework. Not a new CI stage. Just: did the thing run? Show the output.

For the sync daemon case, that would have been a log line from a real container, with a real git call, against a real repository. Any one of the four failure layers would have appeared immediately. We’d have caught it before the second pipeline exit, let alone the production deploy.

For the placeholder case, it would have been a screenshot or a browser smoke test that navigated to the page. One look at an <h1>Lock/Unlock Placeholder</h1> and the gap is obvious.

Neither of these is hard to produce. They’re only easy to skip when the process doesn’t require them.


Green is not proof of life.

A feature isn’t done until someone proves it actually runs. And “CI passed” doesn’t count when CI fakes the very call that has to work in production, or when the test asserts the placeholder’s own text. The formal verification surface is valuable — it catches a real class of problems. It just doesn’t catch this one.

If you’re running agentic workflows, this is the gap worth watching. The agents will keep the tests green. Keeping the tests honest is still on you.