Stop Burning Money on API Fees

Matthew Diakonov··15 min read

Stop Burning Money on API Fees

A single agent running overnight without a budget cap burned $1,200 in API fees. It was stuck in a retry loop, calling the same endpoint thousands of times, getting the same error, and trying again. Nobody noticed until the invoice arrived.

That story is mild. In November 2025, four LangChain agents in a research pipeline entered an infinite conversation loop - two of them ping-ponging requests back and forth for 11 days straight. The bill: $47,000. In February 2026, a data enrichment agent misinterpreted an API error code as "try again with different parameters" and ran 2.3 million API calls over a single weekend. Another $47,000.

These are not edge cases. According to the 2025 State of AI Cost Management Report, 80% of enterprises underestimate their AI infrastructure costs by more than 25%. And with agentic AI going mainstream in 2026, the problem is accelerating. After launch, most organizations report $3,200 to $13,000 per month in operational spend just on LLM API tokens, vector database hosting, monitoring, and prompt tuning.

This guide breaks down exactly how to prevent runaway costs, implement budget controls at every layer, and cut your API spending by 60-80% without sacrificing output quality.

Why AI Agents Are Uniquely Dangerous for Your Wallet

Traditional API usage is predictable. A web app makes a fixed number of calls per user interaction. You can estimate monthly costs from traffic patterns. Agents are different in three fundamental ways.

Agents are persistent by design. They keep trying until they succeed. When the task is impossible or the API is returning errors, persistence becomes expensive. An agent does not know when to quit - it interprets failure as a reason to try harder.

Agents make decisions about their own resource usage. A traditional app calls the API once per request. An agent might decide it needs to call the API 50 times to complete a single task - breaking a problem into sub-tasks, verifying its own output, retrying with different approaches. Each decision multiplies cost.

Agent loops are semantic, not syntactic. Unlike a while(true) bug that a linter can catch, an agent loop looks like legitimate work. The agent is generating thoughts, calling tools, and processing outputs. It thinks it is making progress. It is effectively trapped in a logical cul-de-sac, but from the outside, it looks like it is working. This makes detection harder than traditional infinite loops.

A single runaway loop running for two hours at GPT-4o rates can cost $15 to $40 depending on context size. Scale that to a fleet of agents running overnight or over a weekend, and the numbers get ugly fast.

The Six Layers of Budget Protection

Sustainable agent operations require defense in depth. No single control is sufficient. You need multiple overlapping layers, each catching what the others miss.

Layer 1: Per-Request Token Limits

Every API call should have a max_tokens parameter set. This is the most basic control - it caps how much a single response can cost.

# Always set max_tokens on every API call
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=2048,  # Hard ceiling per response
    temperature=0.7
)

For Anthropic's Claude API, the same principle applies:

response = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=messages
)

Set this based on the task. A classification task needs maybe 50 tokens. A code generation task might need 4,000. Do not use the model's maximum context window as your default - that is how you burn money on tasks that should cost pennies.

Layer 2: Per-Task Spending Caps

Wrap each agent task in a budget tracker that sums the cost of every API call within that task. When the budget is exhausted, the task terminates gracefully.

class TaskBudget:
    def __init__(self, max_cost_usd: float):
        self.max_cost = max_cost_usd
        self.spent = 0.0

    def track_call(self, input_tokens: int, output_tokens: int, model: str):
        cost = self._calculate_cost(input_tokens, output_tokens, model)
        self.spent += cost
        if self.spent >= self.max_cost:
            raise BudgetExhaustedError(
                f"Task budget of ${self.max_cost:.2f} exceeded. "
                f"Spent ${self.spent:.2f} across {self.call_count} calls."
            )
        return cost

    def _calculate_cost(self, input_tokens, output_tokens, model):
        # Prices per million tokens as of March 2026
        prices = {
            "gpt-4o":           {"input": 2.50, "output": 10.00},
            "gpt-5.2":          {"input": 1.75, "output": 14.00},
            "gpt-5-mini":       {"input": 0.25, "output": 2.00},
            "claude-opus-4":    {"input": 15.00, "output": 75.00},
            "claude-sonnet-4":  {"input": 3.00, "output": 15.00},
            "claude-haiku-4.5": {"input": 0.80, "output": 4.00},
        }
        p = prices.get(model, {"input": 5.0, "output": 15.0})
        return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

A good starting point: set per-task budgets at 3-5x what you expect the task to cost under normal operation. This gives enough room for retries and edge cases without allowing runaway spending.

Layer 3: Daily and Monthly Spending Caps

Both OpenAI and Anthropic support monthly spending limits directly in their dashboards. Set these as your absolute backstop. But do not rely on them alone - by the time a dashboard limit kicks in, you may have already burned through more than you wanted.

Implement your own daily caps in your agent orchestration layer:

class DailyBudget:
    def __init__(self, daily_limit_usd: float):
        self.daily_limit = daily_limit_usd
        self.today = date.today()
        self.spent_today = 0.0

    def check_and_track(self, cost: float):
        if date.today() != self.today:
            self.today = date.today()
            self.spent_today = 0.0

        self.spent_today += cost
        if self.spent_today >= self.daily_limit:
            self._alert_and_pause()

    def _alert_and_pause(self):
        # Send alert to Slack, PagerDuty, email, etc.
        send_alert(f"Daily budget of ${self.daily_limit} reached.")
        raise DailyBudgetExhaustedError()

Set alerts at 50% and 80% thresholds so you can investigate before hitting the hard limit. A sudden spike from $10/day to $30/day is a signal worth investigating even if it is within your cap.

Layer 4: Loop Detection

This is the layer most teams miss. A loop detector watches for repeated patterns in agent behavior and kills the loop before it drains the budget.

The simplest approach is counting consecutive tool calls with the same function name and similar arguments:

class LoopDetector:
    def __init__(self, max_similar_calls: int = 5, similarity_threshold: float = 0.9):
        self.recent_calls = []
        self.max_similar = max_similar_calls
        self.threshold = similarity_threshold

    def check(self, tool_name: str, arguments: dict):
        call_signature = {"tool": tool_name, "args": arguments}
        self.recent_calls.append(call_signature)

        # Keep a sliding window
        if len(self.recent_calls) > 20:
            self.recent_calls.pop(0)

        # Count similar recent calls
        similar_count = sum(
            1 for c in self.recent_calls[-self.max_similar:]
            if c["tool"] == tool_name
            and self._args_similar(c["args"], arguments)
        )

        if similar_count >= self.max_similar:
            raise LoopDetectedError(
                f"Agent made {similar_count} similar calls to {tool_name}. "
                f"Likely stuck in a retry loop."
            )

A more sophisticated approach uses embedding similarity over the agent's last N messages. If the cosine similarity between recent messages exceeds 0.95, the agent is probably repeating itself. This catches semantic loops where the agent is rephrasing the same request rather than making identical calls.

Layer 5: Rate Limiting on External API Calls

Even with budget caps, you want rate limits to slow down runaway agents before they hit those caps. Implement exponential backoff with jitter on all external API calls:

import time
import random

def call_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
    raise MaxRetriesExceededError()

The key insight: set max_retries low. Three to five retries is enough for transient errors. If the API is returning errors after five attempts, the problem is not transient and retrying will not fix it. Kill the task and alert a human.

Layer 6: Human Escalation

Every agent system needs a named human who gets paged when thresholds are breached. Not a team. Not a shared channel. A specific person with the authority and context to decide whether a runaway agent should be killed or allowed to continue.

Configure alerts for:

  • Any single task exceeding 2x its expected budget
  • Daily spend exceeding 80% of the daily cap
  • Any agent making more than 100 API calls on a single task
  • Spending rate jumping 3x above the trailing 7-day average

Model Routing: The Biggest Cost Lever

Budget controls prevent disasters. Model routing prevents waste. The difference between using Claude Opus for every task and routing intelligently can be 60-80% of your total spend.

Current Pricing Landscape (March 2026)

Here is what the major models cost per million tokens:

ModelInputOutputBest For
GPT-5.2$1.75$14.00Complex reasoning
GPT-5 mini$0.25$2.00General tasks
GPT-5 nano$0.05$0.40Simple classification
Claude Opus 4$15.00$75.00Deep analysis
Claude Sonnet 4$3.00$15.00Balanced performance
Claude Haiku 4.5$0.80$4.00Fast, simple tasks

The price spread is enormous. Claude Opus output tokens cost 187x more than GPT-5 nano output tokens. If you are routing every task to your most capable model, you are almost certainly overpaying.

Classifier-Based Routing

The most effective routing strategy trains a lightweight classifier to predict which model handles each query best. The classifier analyzes the incoming prompt and routes to the predicted optimal model.

class ModelRouter:
    def __init__(self):
        self.classifier = load_task_classifier()  # A small fine-tuned model

    def route(self, prompt: str, task_metadata: dict) -> str:
        complexity = self.classifier.predict(prompt)

        if complexity == "simple":
            # Classification, extraction, formatting
            return "gpt-5-nano"       # $0.05 / $0.40 per 1M tokens
        elif complexity == "moderate":
            # Summarization, general Q&A, code completion
            return "claude-haiku-4.5"  # $0.80 / $4.00 per 1M tokens
        elif complexity == "complex":
            # Multi-step reasoning, code generation
            return "claude-sonnet-4"   # $3.00 / $15.00 per 1M tokens
        else:
            # Novel problems, creative work, critical decisions
            return "claude-opus-4"     # $15.00 / $75.00 per 1M tokens

Research shows classifier-based routers approach best-single-model performance at significantly lower average cost. Most teams see a 50-70% reduction in API costs from routing alone.

Cascade Routing

An even more cost-effective approach is cascade routing - start with the cheapest model and only escalate when the output quality is insufficient.

async def cascade_call(prompt: str, quality_threshold: float = 0.8):
    models = ["gpt-5-nano", "gpt-5-mini", "claude-sonnet-4", "claude-opus-4"]

    for model in models:
        response = await call_model(model, prompt)
        quality_score = evaluate_response(response, prompt)

        if quality_score >= quality_threshold:
            return response  # Good enough - stop here

    # If we get here, use the last (most expensive) model's response
    return response

The key insight from research: most queries do not need escalation. In typical workloads, 60-70% of tasks are handled by the cheapest model, 20-25% by the mid-tier, and only 5-10% require the most expensive model. Combined cascade routing achieves 14% better cost-quality tradeoffs than either routing or cascading alone.

Prompt Caching: Free Money You Are Leaving on the Table

If your agents send system prompts, tool definitions, or other static content with every request, you are paying full price for the same tokens over and over. Prompt caching fixes this.

Both OpenAI and Anthropic offer prompt caching that charges only 10% of the base input price for cached tokens. On Anthropic, writing to the cache costs 25% more than base price, but reading cached content costs 90% less. This means caching pays for itself after just one cache read.

For agents, the impact is dramatic. A typical agent system prompt with tool definitions might be 3,000-5,000 tokens. If the agent makes 100 calls per task, that is 300,000-500,000 tokens of repeated input. At Claude Sonnet rates ($3/M input tokens), that is $0.90-$1.50 per task in wasted system prompt tokens. With caching, it drops to $0.09-$0.15.

To use prompt caching with Anthropic:

response = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your long system prompt with tool definitions...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=messages
)

OpenAI applies prompt caching automatically for prompts longer than 1,024 tokens. No code changes needed - just make sure your system prompts are consistent across requests.

Real-world results: teams report 60-90% reduction in input token costs from prompt caching alone, with latency improvements of up to 85% for cached prompts.

Semantic Caching: Avoid Calling the API Entirely

Prompt caching reduces the cost of repeated static content. Semantic caching goes further - it avoids calling the API at all for questions the agent has already answered.

Semantic caching works by converting queries into vector embeddings and measuring cosine similarity against previous queries. When similarity exceeds a threshold (typically 0.90-0.95), the system returns the cached response instead of calling the LLM.

import numpy as np
from pgvector.sqlalchemy import Vector

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92):
        self.threshold = similarity_threshold
        self.embedding_model = load_embedding_model()

    def get_or_call(self, prompt: str, model: str, call_fn):
        # Generate embedding for the prompt
        embedding = self.embedding_model.encode(prompt)

        # Search for similar cached prompts
        cached = self.db.query(CacheEntry).filter(
            CacheEntry.embedding.cosine_distance(embedding) < (1 - self.threshold)
        ).first()

        if cached:
            return cached.response  # Cache hit - no API call

        # Cache miss - call the API
        response = call_fn(prompt, model)

        # Store for future cache hits
        self.db.add(CacheEntry(
            prompt=prompt,
            embedding=embedding,
            response=response,
            model=model
        ))
        self.db.commit()
        return response

In production, semantic caching achieves a 38% reduction in direct LLM API calls with a 62% cache hit rate. Average latency for cache hits drops to around 250ms compared to 1.5 seconds or more for live LLM calls.

The caveat: semantic caching works best for tasks with repeatable queries - customer support, data classification, document processing. For novel creative tasks or unique analysis, cache hit rates will be low.

Batch Processing: 50% Off for Non-Urgent Work

Both OpenAI and Anthropic offer batch APIs that process requests asynchronously at a 50% discount. If your agent workload includes tasks that do not need real-time responses - content generation, data classification, report summarization, email drafting - batch them.

# Anthropic batch API example
batch = anthropic.batches.create(
    requests=[
        {
            "custom_id": f"task-{i}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}]
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

# Poll for results (typically completes within 24 hours)
result = anthropic.batches.retrieve(batch.id)

The tradeoff is latency. Batch requests can take up to 24 hours to complete. But for background agent tasks - processing overnight, generating reports, classifying backlogs - that latency is irrelevant and the 50% discount is pure savings.

The Sustainability Equation

An AI agent is only useful if it costs less than the value it produces. Here is a simple framework for evaluating whether your agent spending makes sense:

Calculate the value per task:

  • How long would a human take to do this task?
  • What is that human's fully loaded hourly cost?
  • Value per task = human time x hourly cost

Calculate the agent cost per task:

  • Average API spend per task completion
  • Include failed attempts and retries
  • Include infrastructure and monitoring costs

The math must work:

  • A $50/day agent that saves four hours of a $75/hour employee's time produces $300 of value. That is sustainable - 6x ROI.
  • A $500/day agent doing the same job produces the same $300 of value. That is a $200/day loss.

The optimization strategies above - model routing, prompt caching, semantic caching, batch processing - exist to push your agent costs down so the math works for more use cases. A task that is not viable at Claude Opus rates might be perfectly viable with intelligent routing to Haiku for 80% of the sub-tasks.

Implementation Checklist

If you are running AI agents in production or planning to, here is what to implement in priority order:

  1. Set max_tokens on every API call. Five minutes of work. Prevents any single call from being catastrophically expensive.
  2. Enable provider-side spending limits. Go to your OpenAI and Anthropic dashboards right now and set monthly caps. Ten minutes.
  3. Implement per-task budget tracking. Wrap your agent loop in a cost tracker that terminates on budget exhaustion. A few hours of work.
  4. Add loop detection. Count similar consecutive calls and kill loops after 5 repetitions. A few hours.
  5. Set up alerting. Connect budget warnings to Slack, PagerDuty, or email so anomalies are caught in minutes, not days.
  6. Implement model routing. Start simple - use cheap models for classification and extraction, expensive models for reasoning. Iterate from there.
  7. Enable prompt caching. Add cache_control to your system prompts. Minimal code change, major cost reduction.
  8. Evaluate semantic caching. Worth it if your agents handle repeatable query patterns. Not worth the complexity for purely novel tasks.
  9. Move non-urgent work to batch APIs. 50% discount for anything that does not need a real-time response.

Each layer compounds. Model routing cuts costs by 60%. Prompt caching cuts the remaining costs by another 60-90% on input tokens. Semantic caching eliminates another 30-40% of calls entirely. Combined, teams report total cost reductions of 80-90% compared to naive implementations.

The Bottom Line

The difference between a sustainable AI agent operation and a money pit is not the agent's capability - it is the controls around it. Every production agent system needs budget caps at multiple levels, intelligent model routing, caching at every layer, and a human in the loop when things go sideways.

The $47,000 overnight bills are not inevitable. They are the result of deploying agents without the controls that any production system requires. Build the controls first, then scale the agents.

Fazm is an open source macOS AI agent with built-in cost awareness. Open source on GitHub.

More on This Topic

Related Posts