Production Cost Guide

GenAI Agent Cost Scaling in Production: Token Explosion, Caching, and Observability

Designing production-grade GenAI systems requires going beyond the hype. The demo that costs $0.05 per run can cost $5.00 in production when you account for tool call overhead, retry loops, context accumulation, and edge case handling. This guide provides real cost data from production systems, explains the mechanisms behind token explosion, and covers practical strategies for context pruning, caching, and observability that keep costs manageable at scale.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. The Token Explosion Problem

Token usage in agentic systems scales non-linearly. A simple question-answer call uses N tokens. An agent that makes tool calls uses N * M tokens, where M is the number of tool interactions. But it gets worse:

Cumulative context - Each turn in an agent loop includes all previous turns. Turn 1 sends the system prompt + user message. Turn 5 sends the system prompt + user message + 4 tool calls + 4 tool results + 4 assistant responses. By turn 20, you are sending 100K+ tokens per inference call.
Tool result verbosity - A file read returns the entire file content. A database query returns all matching rows. A web scrape returns full page content. These inflate the context rapidly.
Retry amplification - When a tool call fails and the agent retries with modified parameters, each retry adds both the failed attempt and the new attempt to the context. Three retries of a verbose tool call can add 50K+ tokens.
Multi-agent multiplication - In multi-agent systems, each agent maintains its own context. Five parallel agents each running 10 turns is 50 independent inference calls, each with growing context.

Real production numbers from an AI coding agent running typical development tasks:

Task Complexity	Turns	Total Input Tokens	Total Output Tokens	Approx Cost (Sonnet)
Simple (1 file edit)	3-5	15K-30K	2K-5K	$0.05-0.12
Medium (multi-file feature)	10-20	100K-300K	10K-30K	$0.40-1.20
Complex (architecture change)	30-60	500K-1.5M	30K-80K	$2.00-6.00
Complex with retries	50-100+	1M-5M	50K-200K	$5.00-20.00

The critical insight: input tokens dominate costs in agentic systems because of context accumulation. Output tokens are relatively modest. This is why context management is the primary cost lever.

2. Tool Call Overhead: The Hidden Cost Multiplier

Each tool call in an agent loop has a hidden cost structure:

Tool schema tokens - The tool definitions (names, descriptions, parameter schemas) are sent with every inference call. A typical MCP setup with 20 tools adds 3,000-8,000 tokens to every request.
Tool result tokens - The results from previous tool calls are included in context. A file read that returns 500 lines of code adds ~2,000 tokens. This accumulates across all previous turns.
Reasoning tokens - The model reasons about which tool to call and how to interpret results. This is output tokens but adds up, especially with extended thinking/chain-of-thought.

A concrete example: an agent with 15 available tools, running for 20 turns where each turn reads a file and makes an edit:

Tool schemas per request: ~5,000 tokens
Accumulated tool results by turn 20: ~80,000 tokens
System prompt + instructions: ~3,000 tokens
Total input at turn 20: ~88,000 tokens
Sum of all inputs across 20 turns: ~900,000 tokens

This is why a task that seems like it should cost $0.50 ends up costing $3.00. The cumulative context is the primary driver, not any individual tool call.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Context Pruning Strategies

Context pruning reduces the accumulated context without losing important information. The most effective strategies:

Summarization checkpoints - After every N turns (typically 5-10), summarize the conversation so far into a compact representation and start a fresh context. Claude Code's /compact command does this manually. Some frameworks do it automatically.
Tool result truncation - Limit tool result size. If a file read returns 1,000 lines, only include the relevant 50 lines. If a database query returns 500 rows, include the first 20 with a count of total results.
Sliding window - Only keep the last N turns in context, dropping older turns. Simple but effective for tasks where recent context matters more than historical context.
Selective retention - Keep tool calls that produced useful results, drop tool calls that failed or returned irrelevant data. This requires classification logic but produces the best results.
External memory - Store important facts, decisions, and state in an external document (scratch pad, file, or database) rather than keeping them in context. The agent reads this document when needed rather than carrying everything in every request.

Teams implementing context pruning typically see 40-70% reduction in total token usage for complex tasks. The tradeoff is potential quality degradation if important context is pruned. The key is pruning verbosity (full file contents) while retaining decisions (what was changed and why).

4. Caching Layers for GenAI Systems

Caching at multiple layers can dramatically reduce costs:

Prompt caching (provider-level) - Anthropic and OpenAI offer prompt caching that stores the static prefix of your requests. For agent systems where the system prompt + tool schemas are identical across calls, this can reduce input costs by 80-90% for the cached portion. Anthropic charges 10% of standard input pricing for cached tokens.
Semantic caching (application-level) - Cache responses to similar queries. If the agent asks "what are the files in /src" five times, return the cached result instead of making a new inference call. Tools like GPTCache and custom vector-based caches enable this.
Tool result caching - Cache the results of deterministic tool calls. A file read, directory listing, or API schema query returns the same result for minutes or hours. Cache these and inject the cached result instead of re-reading.
Computation caching - If the agent needs to analyze a large codebase, cache the analysis results. The codebase does not change between consecutive agent runs, so the analysis can be reused.

Prompt caching is the single biggest win because it applies to every request automatically. For a typical agent setup:

System prompt: ~2,000 tokens (cacheable)
Tool schemas: ~5,000 tokens (cacheable)
CLAUDE.md/project context: ~3,000 tokens (cacheable)
Total cacheable prefix: ~10,000 tokens
Cost savings at 90% cache discount: $0.03-0.05 per turn saved
Over a 20-turn session: $0.60-1.00 saved

5. Model Routing and Cost Optimization

Not every agent action requires the most capable (and expensive) model. Model routing directs different types of actions to different models:

Action Type	Recommended Model	Cost Ratio
Complex reasoning, architecture decisions	Opus / o1	1x (baseline)
Code generation, moderate analysis	Sonnet / GPT-4o	0.2-0.3x
Simple edits, formatting, classification	Haiku / GPT-4o-mini	0.02-0.05x
Summarization, context pruning	Haiku / GPT-4o-mini	0.02-0.05x

In a typical agent session, only 10-20% of turns require complex reasoning. The rest are navigation, tool execution, and simple decisions that a smaller model handles equally well. Routing these to cheaper models reduces overall costs by 60-80% with negligible quality impact.

Desktop agents benefit from model routing particularly well. Deciding where to click in a UI (given an accessibility tree) is a simple task that Haiku handles effectively. Reasoning about what workflow steps to take next may require Sonnet or Opus. Fazm and similar tools can route different action types to appropriate models automatically.

6. Observability for Token Economics

You cannot optimize what you cannot measure. Essential observability for GenAI costs:

Per-request token logging - Log input tokens, output tokens, cached tokens, model used, and latency for every API call. This is your raw data.
Per-task cost aggregation - Group requests by task/session and compute total cost per task. This reveals which task types are expensive.
Context growth tracking - Track how context size grows across turns in each session. Identify sessions where context explodes and investigate why.
Tool call frequency analysis - Which tools are called most? Which return the most data? Which trigger the most retries? These are your optimization targets.
Cost anomaly alerts - Set alerts for sessions that exceed 3x the median cost for their task type. These are either bugs, infinite loops, or tasks that need architectural changes.

Tools for GenAI observability:

Helicone - API proxy that logs everything with zero code changes
LangSmith - Full trace visualization for LangChain-based systems
Braintrust - Eval + observability with cost tracking
Custom dashboards - Grafana/Datadog with token metrics from API logs

7. A Cost Budget Framework for Production Agents

A practical framework for managing GenAI agent costs:

Per-task budgets - Set a maximum cost for each task type. Simple tasks: $0.50. Medium tasks: $2.00. Complex tasks: $10.00. If the agent exceeds the budget, it stops and reports partial results.
Per-user daily budgets - Cap what each developer can spend daily. $20-50/day is typical for active AI coding usage. This prevents runaway costs from misconfigured agents.
Team monthly budgets - Set a team-level monthly cap. Review actual spend weekly and adjust. Typical range for a 5-person engineering team: $500-2,000/month for AI coding tools.
Cost-per-outcome tracking - The most sophisticated metric. Track cost per PR merged, cost per bug fixed, cost per feature shipped. This shows ROI directly and identifies where AI spending generates value vs. waste.

The economics of GenAI agents in production are favorable when managed correctly. A $2 agent task that replaces 2 hours of developer work is excellent ROI. A $20 agent task that produces code requiring 3 hours of cleanup is negative ROI. The difference is almost always in task scoping, spec quality, and cost management - not in the model itself.

Cost-Efficient Desktop AI Automation

Fazm uses accessibility APIs instead of screenshots, reducing token costs by 10-20x compared to screenshot-based desktop agents.

Try Fazm Free