Passing Tests Don't Mean Your AI Agent Actually Works
Passing Tests Don't Mean Your AI Agent Actually Works
The test suite was green. All 47 tests passed. Then nine features were broken in production.
The problem was obvious in hindsight - the tests mocked the OS file picker. They mocked the accessibility tree responses. They mocked the clipboard. Every system interaction that could fail in the real world was replaced with a predictable stub that always returned exactly what the test expected.
The Mocking Trap
When you build an AI agent that interacts with a desktop operating system, the interesting failures happen at the boundary between your code and the OS. The file picker dialog that takes 800ms to appear. The accessibility tree that returns stale data after a window resize. The clipboard that gets overwritten by another process between your copy and paste.
Mocking all of these boundaries means you are testing your logic in isolation from the environment where it actually runs. Your tests verify that your code handles the happy path correctly. They say nothing about whether the happy path actually exists on a real machine.
What to Test Instead
Integration tests against real OS APIs are slow and flaky. That is the point - they are flaky because the real environment is flaky, and your agent needs to handle that.
Run a subset of tests against actual system APIs. Let the file picker open and close. Let the accessibility tree return whatever it returns. Record and replay real interactions instead of writing synthetic mocks.
The goal is not 100% pass rate. The goal is knowing which failures your agent will encounter in production before your users encounter them first.
Coverage Is Not Confidence
Test coverage measures what code paths your tests execute. It does not measure whether those code paths produce correct behavior in the real world. An agent with 95% test coverage and zero integration tests is less reliable than one with 60% coverage and a real-world test suite.
- AI Agent Self Report Trap Screenshot Verification
- Post Action Verification AI Agents 200 OK
- Non Deterministic Agents Deterministic Feedback
Fazm is an open source macOS AI agent. Open source on GitHub.