Affordable AI Agent Evaluation - Recording and Replaying Tool Call Traces

Fazm Team··2 min read

How Can You Afford Evals?

The biggest misconception about AI agent evaluation is that it requires expensive infrastructure. It does not. The simplest approach - recording tool call traces and replaying them deterministically - costs almost nothing and catches most regressions.

The Recording Approach

Every time your AI agent runs, it makes a series of tool calls. Log them. Each trace becomes a test case:

  1. Capture the input - the user's request and initial context
  2. Record every tool call - the function name, arguments, and response
  3. Save the final output - what the agent delivered to the user

That is your golden dataset. No synthetic data generation. No expensive annotation. Just real usage patterns captured automatically.

Deterministic Replay

The key insight is that you do not need to re-run the LLM to test most regressions. Mock the LLM responses from your trace, then verify that:

  • The tool calls happen in the expected order
  • The arguments match what you recorded
  • The final output is equivalent to the original

When you update your agent's prompt, system instructions, or tool definitions, replay your traces. Any deviation from expected behavior shows up immediately.

What This Catches

This approach is surprisingly effective at catching:

  • Prompt regressions - a small wording change that breaks a previously working flow
  • Tool schema changes - renaming a parameter that silently breaks calls
  • Context window issues - traces that worked at 50 tool calls but break at 100

What It Does Not Catch

Replay testing will not find novel failure modes. It only validates known-good behavior. You still need humans reviewing agent outputs periodically. But for the cost - essentially just disk space for JSON logs - it covers a huge percentage of real regressions.

Start recording traces today. Future you will be grateful when a "minor" prompt change breaks something and you catch it in CI instead of production.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts