Screenshots Are Better Than LLM Self-Reports for Multi-Agent Verification
Screenshots Are Better Than LLM Self-Reports for Multi-Agent Verification
Multi-agent systems often use a judge-reflection pattern. One agent performs an action. Another agent evaluates whether it succeeded. The idea is that a second opinion catches errors the first agent missed.
The problem is that both agents are LLMs. The judge is reading the executor's text description of what happened. If the executor says "I clicked the submit button and the form was submitted successfully," the judge has no independent way to verify this. It is evaluating a claim, not observing reality.
Why Text-Based Verification Fails
LLMs are confident narrators. When an agent reports that it completed a task, the report sounds plausible regardless of whether the task actually succeeded. The agent might say it clicked a button that does not exist. It might report a successful form submission when the page actually showed a validation error.
A judge LLM reading this report has the same blindspot. It evaluates whether the text description is internally consistent, not whether it matches reality. Two LLMs agreeing that something happened is not evidence that it happened.
Screenshots as Ground Truth
A screenshot captures the actual state of the screen after an action. Did the button click navigate to a new page? The screenshot shows either the new page or the same page. Did the form submit successfully? The screenshot shows either a confirmation message or an error.
No interpretation needed. No self-reporting. The visual evidence either confirms the action or contradicts it.
Practical Implementation
After every significant action, the agent captures a screenshot. A vision model evaluates the screenshot against the expected outcome. "Expected: confirmation dialog. Actual screenshot shows: error message - email field is required." This is a concrete, verifiable check.
The cost of screenshot verification is a few cents per image. The cost of an agent confidently reporting success while actually failing is wasted time and broken workflows. Screenshots are cheaper than debugging phantom completions.
Fazm is an open source macOS AI agent. Open source on GitHub.