Non-Deterministic Agents Need Deterministic Feedback Loops
Non-Deterministic Agents Need Deterministic Feedback Loops
Non-deterministic agents with deterministic feedback loops - that is the whole trick. The agent itself will never be perfectly predictable. LLMs produce different outputs for the same input. But the system that checks whether the agent did the right thing? That has to be rock solid.
The Problem with Pure Non-Determinism
When both the agent and its verification are non-deterministic, you get chaos. The agent produces a wrong answer, the LLM-based reviewer decides it looks plausible, and the error ships. You have no ground truth. Everything is vibes.
This is not hypothetical. Research on LLM-based evaluation systems consistently finds that "LLM-as-judge" setups have their own failure modes: they hallucinate compliance, they are biased toward verbose outputs, and they can be fooled by confident-sounding wrong answers. Using one LLM to evaluate another LLM's output doubles the non-determinism rather than eliminating it.
The solution is not better LLM reviewers. It is fewer LLM reviewers.
What Deterministic Feedback Actually Looks Like
Good feedback loops have clear, binary outcomes. The key criterion: does verifying this result require any LLM judgment? If yes, you have not built a deterministic feedback loop yet.
Deterministic checks:
- Did the test suite pass? Yes or no. Exit code 0 or non-zero.
- Does the file exist at the expected path?
os.path.exists()returns True or False. - Did the API return a 200 status code? Check the integer, not the meaning.
- Does the output match the expected schema? JSON Schema validation is deterministic.
- Is the database row count what we expect?
SELECT COUNT(*) FROM tableis deterministic. - Did the deployment health check pass? HTTP 200 from
/healthis deterministic.
None of these require asking another LLM. They are assertions that either pass or fail.
The Three-Layer Architecture
The system that works in production:
Layer 1: Agent (non-deterministic) Receives a task, reasons about it, takes action. Gets creative freedom. This is where the LLM runs.
Layer 2: Verification (deterministic) Checks the result against hard criteria. No LLM involved. Pure programmatic assertions.
Layer 3: Retry routing (deterministic) If verification fails, formats the specific failure reason and routes back to Layer 1. The failure message must be precise - "test suite failed with 3 errors in user_auth.py lines 44, 67, 89" is useful. "Something went wrong" is not.
async def run_agent_with_verification(task: str, max_retries: int = 3) -> AgentResult:
for attempt in range(max_retries):
# Layer 1: Non-deterministic agent execution
result = await agent.execute(task)
# Layer 2: Deterministic verification
verification = verify_result(result)
if verification.passed:
return AgentResult(success=True, output=result, attempts=attempt + 1)
# Layer 3: Deterministic retry routing with specific failure info
task = f"""
Previous attempt failed. Specific failures:
{verification.failure_details}
Original task: {task}
Fix only the items listed above. Do not change what was already correct.
"""
print(f"Attempt {attempt + 1} failed: {verification.failure_details[:200]}")
return AgentResult(success=False, output=None, attempts=max_retries)
def verify_result(result: AgentOutput) -> VerificationResult:
failures = []
# These are all deterministic checks - no LLM involved
if result.type == "code_change":
exit_code, output = run_subprocess(["npm", "test"])
if exit_code != 0:
failures.append(f"Test suite failed:\n{output}")
exit_code, output = run_subprocess(["npx", "tsc", "--noEmit"])
if exit_code != 0:
failures.append(f"TypeScript errors:\n{output}")
elif result.type == "file_operation":
for expected_path in result.expected_files:
if not Path(expected_path).exists():
failures.append(f"Expected file not found: {expected_path}")
elif result.type == "api_call":
if result.status_code not in (200, 201, 204):
failures.append(f"API returned {result.status_code}: {result.response_body}")
return VerificationResult(
passed=len(failures) == 0,
failure_details="\n".join(failures)
)
Practical Examples by Domain
Code changes - the agent modifies files, then npm test / pytest / go test provides deterministic feedback. The test suite was written by humans and verifies human-defined expectations. The agent does not get to grade its own work.
# After agent modifies code
npm test 2>&1
if [ $? -ne 0 ]; then
echo "VERIFICATION_FAILED: test suite output above"
exit 1
fi
echo "VERIFICATION_PASSED"
Email drafts - the agent writes a draft, then a schema validator checks for required fields, length limits, and prohibited content.
def verify_email_draft(draft: dict) -> list[str]:
errors = []
if not draft.get("subject") or len(draft["subject"]) < 5:
errors.append("Subject missing or too short")
if not draft.get("body") or len(draft["body"]) < 50:
errors.append("Body missing or too short")
if len(draft.get("body", "")) > 5000:
errors.append(f"Body too long: {len(draft['body'])} chars (max 5000)")
if "PLACEHOLDER" in (draft.get("body", "") + draft.get("subject", "")):
errors.append("Draft contains unfilled PLACEHOLDER text")
return errors
File organization - the agent organizes files, then a directory listing confirms expected structure. Every file should be somewhere specific. Anything in the wrong location is a failure.
Deployments - the agent runs a deployment command, then a health check endpoint confirms it is running. Poll the /health endpoint until it returns 200 or timeout after 60 seconds.
How This Handles the "Eval-Driven Development" Pattern
The research community has converged on what they call Evaluation-Driven Development for LLM agents - adapting TDD principles to non-deterministic systems. The core insight: write the deterministic evaluation criteria before the agent code.
Define what success looks like in machine-checkable terms before you start. "The agent should send the correct email" is not machine-checkable. "The agent should POST to /api/send-email with to, subject, and body fields, receive a 200 response, and the email should appear in the sent folder within 30 seconds" is machine-checkable.
This forces clarity about what you actually want. It also creates the verification logic that your retry layer needs.
The Takeaway
You cannot make LLMs deterministic. Stop trying. The effort that goes into making LLM reviewers more accurate would be better spent building programmatic assertions that never have that problem in the first place.
The agent does not need to be right the first time. It needs to know quickly and precisely when it is wrong. Deterministic feedback gives it that.
Fazm is an open source macOS AI agent. Open source on GitHub.