How to Reduce AI Agent Token Costs: MCP Strategies for Code Intelligence
Running five parallel Claude Code agents on a large codebase can burn through $50-100 per hour in token costs. Most of that spend is waste, with agents reading entire files when they only need a function signature, re-parsing the same modules across sessions, and stuffing context windows with irrelevant code. This guide covers practical strategies that teams are using to cut AI coding agent costs by 60-80% without sacrificing output quality. The techniques apply to Claude Code, Cursor, Copilot Workspace, and any LLM-based coding agent.
1. Why AI Agent Token Costs Spiral Out of Control
AI coding agents are fundamentally greedy with context. When an agent needs to understand a function, it typically reads the entire file. When it needs to understand how a module is used, it may read every file that imports it. When it encounters an error, it re-reads files it already processed earlier in the conversation. Each of these reads consumes input tokens, and input tokens are the primary cost driver.
The problem compounds with parallel agents. If you have five agents working on different features in the same codebase, each one independently reads the same shared modules, configuration files, and type definitions. There is no shared context layer between agents, so the same 500-line utility file gets tokenized five separate times.
Common sources of token waste in coding agents:
- Full-file reads when partial information suffices - An agent reading a 2,000-line file to understand a single function signature consumes roughly 6,000 tokens when it could get the same information from 50 tokens.
- Redundant context across conversation turns - Long conversations re-send the entire message history with each turn. A file read in turn 3 stays in context through turn 30, consuming tokens every single inference.
- No structural awareness of code - Without a code index, agents grep for patterns and read matching files sequentially. A tree-sitter index could resolve the same query in a single lookup returning just the relevant symbol.
- Overly broad tool calls - Generic file-read tools return entire files. Without tools that support line ranges, symbol lookup, or structural queries, every read is a full read.
- Context window stuffing - Agents often front-load context by reading project documentation, configuration, and type files before starting the actual task. Much of this context goes unused.
2. Real Numbers: What Parallel Agents Actually Cost
Let us look at real token consumption for a common scenario: five parallel Claude Code agents working on a medium-sized TypeScript project (roughly 200 files, 50k lines of code). Each agent is working on a separate feature branch.
| Metric | Per Agent (Naive) | 5 Agents / Hour |
|---|---|---|
| Input tokens | ~800K-1.5M | ~4M-7.5M |
| Output tokens | ~80K-200K | ~400K-1M |
| Cost (Sonnet 4, $3/$15 per 1M) | $3.60-7.50 | $18-37 |
| Cost (Opus 4, $15/$75 per 1M) | $18-37 | $90-187 |
| Wasted tokens (estimated) | 40-70% | 40-70% |
The "wasted tokens" row is the key insight. When you instrument agent sessions and analyze what context actually influenced the output, you find that 40-70% of input tokens were unnecessary. The agent read files it never referenced in its output, re-read files already in context, or loaded entire modules when it only needed a type definition.
At Opus-tier pricing with five parallel agents, that waste translates to $35-130 per hour in tokens that did not contribute to the result. Over an 8-hour workday, you are looking at $280-1,040 in avoidable spend. That is the opportunity cost that makes optimization worth investing in.
3. Signature-Only Retrieval and Tree-Sitter Indexing
The highest-impact optimization is giving agents access to code structure without requiring them to read full files. This is where tools like pitlane-mcp come in.
Signature-only retrieval means returning just function signatures, type definitions, class outlines, and export declarations instead of full implementations. When an agent needs to understand how to call a function, it rarely needs the 50-line implementation body. It needs the function name, parameters, return type, and maybe a one-line docstring.
Tree-sitter indexing makes this possible at scale. Tree-sitter is a fast, incremental parser that builds concrete syntax trees for source code. Unlike regex-based grep, tree-sitter understands code structure. It can extract every function declaration, class definition, import statement, and type alias from a codebase in milliseconds, and it updates incrementally as files change.
Here is what this looks like in practice. Instead of an agent reading a 400-line React component file (roughly 1,200 tokens), a tree-sitter-powered MCP tool returns:
// UserProfile.tsx - signatures only (47 tokens)
interface UserProfileProps {
userId: string;
onUpdate: (user: User) => void;
showAvatar?: boolean;
}
export function UserProfile(props: UserProfileProps): JSX.Element
function useProfileData(userId: string): { user: User; loading: boolean }
function handleSubmit(formData: FormData): Promise<void>The agent gets everything it needs to use this component, call its functions, or understand its interface, at 4% of the token cost. If the agent does need the full implementation of a specific function, it can request just that function body as a follow-up.
pitlane-mcp (v0.2.0) implements exactly this pattern as an MCP server. It builds a tree-sitter index of your codebase and exposes tools for symbol lookup, signature retrieval, dependency graphs, and scoped code reads. Any AI agent that supports MCP can use it, including Claude Code, and the agent automatically gets token-efficient code access without needing custom prompting or workflow changes.
4. Scoped Context Windows with MCP
The Model Context Protocol (MCP) is not just a tool-calling standard. It is the key enabler for scoped, efficient context management in AI agent workflows.
Without MCP, agents interact with codebases through generic file-system tools: read file, write file, search files, list directory. These tools have no concept of code structure, relevance, or scope. The agent decides what to read based on file names and grep results, which leads to reading far more than necessary.
With purpose-built MCP servers, agents get access to high-level, structured tools that return exactly the context needed:
- Code intelligence MCP servers (like pitlane-mcp) provide symbol lookup, call graphs, type hierarchies, and signature extraction. Instead of grepping for a function name across 200 files, the agent calls a single tool that returns the definition location, signature, and callers.
- Database MCP servers provide schema-aware query tools. The agent sees table schemas and relationships without reading migration files or ORM models in full.
- Documentation MCP servers serve relevant docs on demand. Instead of stuffing the system prompt with 10,000 tokens of API documentation, the agent queries for the specific endpoint or method it needs.
- Git MCP servers provide diff-aware context. When fixing a bug, the agent can see what changed recently in the relevant files instead of reading the entire file history.
The compound effect is significant. A well-configured MCP server stack acts as a context relevance filter between the codebase and the agent. Each tool call returns the minimum viable context for the agent to make progress, and the agent can always request more if needed.
For desktop automation workflows, the same principle applies. Tools like Fazm use MCP to scope their context to the specific application and UI state being automated, rather than capturing full-screen screenshots or entire accessibility trees. An agent automating a specific dialog box only receives context about that dialog, not the entire desktop state.
5. Comparison: Full Context vs Signature-Only vs Hybrid
Here is a concrete comparison for a typical coding task: implementing a new API endpoint that uses 3 existing services, 2 database models, and 1 utility module. The agent needs to understand the existing code, then write the new endpoint.
| Approach | Files Read | Input Tokens | Cost (Sonnet 4) | Quality |
|---|---|---|---|---|
| Full context | 12-18 full files | ~45,000 | $0.14 | Baseline |
| Signature-only | 6 signatures + 2 full files | ~12,000 | $0.04 | 95% of baseline |
| Hybrid (signatures first, expand on demand) | 6 signatures + 4 targeted reads | ~18,000 | $0.05 | 98% of baseline |
| Full context + conversation history | 12-18 full files (re-sent each turn) | ~180,000 (over 4 turns) | $0.54 | Baseline |
The hybrid approach is the sweet spot. Agents start with signatures and structural information, then selectively expand to full implementations only for the specific functions they need to understand deeply. This preserves nearly all output quality while cutting token usage by 60-75%.
The last row highlights a critical point: conversation history amplifies the cost of every early file read. A file read in turn 1 gets re-sent as input in turns 2, 3, and 4. With the hybrid approach, early turns use compact signatures, so the conversation history stays lean even as the task progresses.
6. Model Routing and Tiered Intelligence
Not every agent action requires the same model capability. Reading a file and extracting a function signature does not need Opus-level reasoning. Deciding on a complex architectural approach does. Model routing exploits this by directing different sub-tasks to different models based on complexity.
A practical tiered approach for coding agents:
- Tier 1 - Fast/cheap model (Haiku, GPT-4o-mini) - File navigation, pattern matching, simple code lookups, formatting, test generation from templates. These are mechanical tasks where speed matters more than reasoning depth. Cost: roughly $0.25-0.50 per million input tokens.
- Tier 2 - Mid-range model (Sonnet, GPT-4o) - Feature implementation, bug fixing, code review, refactoring. The bulk of coding work falls here. Good reasoning with reasonable cost. Cost: roughly $3 per million input tokens.
- Tier 3 - Reasoning model (Opus, o1) - Architecture decisions, complex debugging, security analysis, performance optimization. Reserve for tasks where deep reasoning directly impacts quality. Cost: roughly $15 per million input tokens.
In practice, 60-70% of agent actions fall into Tier 1, 25-30% into Tier 2, and only 5-10% truly need Tier 3. If you are running everything on Opus, switching to a routed approach can reduce costs by 5-8x with minimal quality impact on the final output.
Claude Code's built-in model routing already does some of this automatically, using Haiku for tool calls and Sonnet/Opus for reasoning. But you can push this further with custom MCP server configurations that pre-process context through cheaper models before sending refined queries to expensive ones.
7. Practical Tips You Can Implement Today
You do not need a complete infrastructure overhaul. Here are concrete steps you can take this week to start reducing token costs:
Set up a code intelligence MCP server
Install pitlane-mcp or a similar tree-sitter-based code indexer. Configure it in your Claude Code MCP settings so the agent automatically has access to signature-only retrieval. The agent will start using structural queries instead of full-file reads without any prompting changes.
Write a focused CLAUDE.md file
Your project's CLAUDE.md (or equivalent system prompt) sets the baseline context for every agent session. Keep it under 500 tokens. Remove boilerplate, link to docs instead of inlining them, and focus on information the agent needs for every task. A bloated CLAUDE.md costs tokens on every single inference call.
Use line-range reads instead of full-file reads
When you know which function or section you need, read just those lines. Claude Code's Read tool supports offset and limit parameters. A targeted 20-line read is 10-50x cheaper than reading the full file. Train your workflow to locate symbols first (via grep or index), then read just the relevant range.
Scope parallel agents tightly
When running multiple agents, give each one a narrow task description and constrained file scope. An agent told to "implement the payment webhook handler in src/api/webhooks/" will read fewer files than one told to "add Stripe payment support." Narrow scope reduces exploratory file reads, which is the biggest source of wasted tokens in parallel setups.
Monitor and baseline your token usage
You cannot optimize what you do not measure. Track token usage per task type and build baselines. If your typical feature implementation uses 80K input tokens, investigate any run that exceeds 150K. Common culprits are agents stuck in loops, reading the same files repeatedly, or exploring unrelated parts of the codebase.
Cache frequently accessed context
Anthropic's prompt caching reduces input token costs for repeated content. Ensure your system prompt and frequently referenced files benefit from caching by structuring them as stable prefixes. Files that change rarely (types, configs, API schemas) are ideal caching candidates.
Clean up conversation history
Long agent sessions accumulate massive context. If you are on turn 20 of a conversation, every message from turns 1-19 is being re-sent as input. Start new sessions for new tasks instead of continuing old ones. For long-running tasks, consider periodic conversation compaction, where a cheaper model summarizes the conversation so far and the agent continues with the summary instead of the full history.
Efficient Agent Automation
Fazm uses MCP-based context scoping and accessibility APIs to keep token costs low while automating desktop workflows across macOS applications.
Try Fazm Free