Cost Optimization

Custom API Endpoints for AI Agents: Proxy Setups, LLM Routing, and Cost Optimization

A developer recently shared how they built a local Node.js proxy that translates Anthropic-format requests into GitHub Copilot Business calls, effectively running their entire AI agent setup on an existing $19/seat subscription. The reaction was immediate: dozens of developers had been sitting on unused Copilot seats and never thought to do the same thing. This guide covers the full landscape of custom API endpoints for AI agents, from proxy architecture to provider comparison to practical setup steps.

$0 extra API cost

“Built a GitHub Copilot proxy to avoid paying for Claude. Local Node proxy translates Anthropic requests to Copilot Business calls. Pointed my agent at it and it just worked.”

Developer community

1. Why Developers Build Custom API Proxies

AI agents that control a browser, write code, or automate desktop workflows make a lot of API calls. Every step in a multi-step task (reading context, deciding on an action, verifying the result) requires at least one round trip to a language model. A single hour of active agent use can burn through hundreds of thousands of tokens.

At standard retail API pricing, that volume adds up to real money quickly. But most developers and teams already have LLM capacity they are paying for elsewhere: GitHub Copilot Business seats, Azure OpenAI deployments, AWS Bedrock contracts, or spare GPUs running local models. The proxy pattern lets you route agent traffic through that existing capacity instead of paying again through a direct API.

Cost is the most common driver, but it is not the only one:

Compliance: Regulated industries cannot send data to third-party APIs without approval. A corporate gateway keeps all LLM traffic on approved infrastructure while still enabling AI automation.
Experimentation: A proxy lets you swap the underlying model without changing any agent configuration. Run A/B tests between Claude and GPT-4o, or try a new model release without touching production code.
Rate limit management: Centralized proxies can pool API keys, balance across multiple accounts, and retry intelligently when individual keys hit limits.
Audit logging: Corporate gateways log every request with caller identity, making it easy to track which agents are consuming what capacity and catch runaway costs early.

Key insight: An AI agent's value comes from its ability to perceive context and take action, not from which specific LLM endpoint it calls. Decoupling the two gives you control over cost, latency, compliance, and model choice simultaneously.

2. The GitHub Copilot Proxy Pattern

GitHub Copilot Business ($19/seat/month) includes access to Claude and other frontier models through the Copilot API. The access is bundled into the seat price, meaning that if your team already has Copilot seats, you are paying for model capacity you may not be using fully.

The proxy approach works as follows: a lightweight local server (Node.js or Python) listens on a localhost port and accepts requests in the Anthropic messages format. When a request arrives, the proxy translates it to the format the Copilot API expects, forwards it, and maps the response back to Anthropic format before returning it to the agent. The agent never knows a proxy is involved.

The minimum viable version of this proxy is under 100 lines:

// Minimal Anthropic-to-Copilot translation proxy (Node.js)
const express = require('express');
const app = express();
app.use(express.json());

app.post('/v1/messages', async (req, res) => {
  const { model, messages, system, max_tokens } = req.body;

  // Translate to Copilot / OpenAI chat completions format
  const openaiMessages = system
    ? [{ role: 'system', content: system }, ...messages]
    : messages;

  const response = await fetch('https://api.githubcopilot.com/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.COPILOT_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ model: 'gpt-4o', messages: openaiMessages, max_tokens }),
  });

  const data = await response.json();

  // Map back to Anthropic response format
  res.json({
    id: data.id,
    type: 'message',
    role: 'assistant',
    content: [{ type: 'text', text: data.choices[0].message.content }],
    model,
    stop_reason: 'end_turn',
    usage: { input_tokens: data.usage.prompt_tokens, output_tokens: data.usage.completion_tokens },
  });
});

app.listen(8080, '127.0.0.1');

With this running, you point your agent at http://localhost:8080 instead of https://api.anthropic.com and it just works.

Terms of service note: GitHub Copilot Business is licensed for development assistance. Using a proxy to route high-volume automation traffic through it may not fall within fair use depending on your agreement. Review your terms before relying on this pattern in production. Many teams use it for prototyping and switch to a dedicated API plan once they validate the workflow.

Already paying for Copilot or Azure?

Some AI agents support custom base URLs out of the box. Point them at your existing infrastructure and start automating without extra API costs.

Try Fazm Free

3. Endpoint Compatibility and the Anthropic Format Standard

The Anthropic SDK and tools built on it respect an environment variable called ANTHROPIC_BASE_URL. Setting it redirects all API calls to that URL instead of the default endpoint. This single environment variable is the entry point for any custom proxy setup:

export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=any-value-your-proxy-accepts

# Your agent or tool now routes all requests through the proxy
fazm start

For a proxy to be compatible with Anthropic-format clients, it must implement the following minimum interface:

POST /v1/messages accepting the Anthropic messages request body (model, messages array, optional system prompt, max_tokens)
Response body matching the Anthropic messages response schema (id, type, role, content array, usage)
Streaming support via SSE if the agent uses streaming mode, with properly formatted delta events
Handling of the anthropic-version header in incoming requests

This interface has become a de facto standard. Claude Code, most MCP-compatible tools, and any application using the official Anthropic SDK can point at any compatible proxy without modification. LiteLLM implements this interface and handles translation to OpenAI, Azure, Bedrock, Vertex, and dozens of other providers automatically.

4. Provider Comparison: Cost and Tradeoffs

The right backend depends on what you already have access to and what you prioritize. Here is how the main options compare for a team running an agent that makes roughly 500 API calls per day (a moderate workload for browser automation and document processing):

Endpoint	Est. Monthly Cost	Model Quality	Setup Effort
Direct Anthropic API	$50 to $200+	Highest (Claude 3.5+)	Minimal
Copilot Business proxy	$19/seat (flat)	High (GPT-4o, Claude)	Moderate (proxy code)
AWS Bedrock (Claude)	$30 to $150+	Highest (same models)	Moderate (IAM, VPC)
Azure OpenAI	$30 to $150+	High (GPT-4o)	Moderate (enterprise setup)
LiteLLM proxy (cloud)	Varies by backend	Depends on routing	Low (config file)
Self-hosted (Ollama)	$0 marginal (own hardware)	Lower (open source models)	High (GPU, maintenance)

The best choice is usually whichever option eliminates incremental cost without requiring new infrastructure purchases. If your team already has Copilot seats, the proxy approach is essentially free. If you are in a regulated environment, Bedrock or Azure is likely the path of least resistance because the compliance setup is already done for other workloads. If you have GPU capacity and care primarily about total cost at scale, self-hosting gives you the lowest marginal rate.

5. LLM Routing Strategies for AI Agents

Once you have a proxy in place, you can implement routing logic that sends different types of requests to different backends. This is where custom endpoints move from a cost-saving trick to a genuine architectural pattern.

Routing by Task Complexity

Simple structured tasks (form filling, data extraction, file renaming) do not need the most capable model available. Routing those to a smaller, cheaper model (Haiku, GPT-4o-mini, or a local Llama variant) while reserving the larger model for complex reasoning can cut costs by 60 to 80 percent without any perceptible quality difference on the simple tasks.

A simple routing heuristic:

Prompt length under 500 tokens, no tools involved: route to mini or local model
Multi-step reasoning, tool use, or code generation: route to full model
Ambiguous: check estimated output length. Short outputs are usually simple tasks.

Fallback and Retry Routing

A proxy can implement automatic fallback when a backend is unavailable or rate-limited. If the primary endpoint returns a 429 or 503, the proxy retries with a secondary provider. This eliminates the most common source of agent failures in production: transient API outages that interrupt multi-step tasks.

Load Balancing Across API Keys

If you have multiple API keys (from different accounts or team members), a proxy can round-robin across them, effectively multiplying your rate limit. This is particularly useful for parallel agent workflows where multiple agents run simultaneously. LiteLLM supports this natively through its load balancing configuration.

Cost Tracking per Agent

A centralized proxy is the best place to track token usage per agent or per task type. Adding a custom header to each request (e.g., X-Agent-ID) lets the proxy log usage broken down by caller, giving you visibility into which workflows are consuming the most capacity.

Want a local AI agent that supports custom endpoints?

Fazm is open source and runs locally on macOS. It supports custom API base URLs, so you can point it at your proxy, Bedrock, or self-hosted model without touching any code.

Try Fazm Free

6. Corporate Gateways and Compliance Requirements

For teams in regulated industries, sending AI agent traffic to a third-party API is often not an option without explicit approval. Corporate API gateways solve this by keeping all LLM traffic within approved infrastructure, while still giving developers access to capable models.

The most common enterprise setups:

AWS Bedrock with Claude: Data stays in your VPC, integrates with existing IAM policies and CloudTrail for audit logging. Bedrock exposes Claude models with Anthropic-format API support through the AWS SDK. Using a translation layer like LiteLLM, any Anthropic-compatible agent can route through Bedrock.
Azure OpenAI Service: Data residency guarantees by region, enterprise authentication through Azure AD, and content filtering controls. Azure exposes OpenAI-format endpoints, so a thin proxy handles the Anthropic-to-OpenAI translation.
LiteLLM as a unified gateway: LiteLLM running on internal infrastructure normalizes multiple provider APIs behind a single Anthropic-format endpoint. Teams get centralized spend tracking, per-user rate limiting, and audit logging without managing multiple provider integrations.
Custom proxy with PII redaction: Some teams add a preprocessing step that strips or masks sensitive fields before requests leave the internal network. This is common in healthcare (HIPAA) and financial services (SOC 2, PCI DSS) contexts.

The agent does not need to know about any of this. It sends requests to the configured endpoint and gets responses back. All compliance logic lives in the gateway layer, which is exactly where it should be.

Data Residency

By routing through Bedrock with region pinning or an Azure OpenAI deployment in a specific geography, you can guarantee that request content never leaves a defined boundary. This is often a hard requirement for EU data protection regulations, government workloads, and healthcare systems that operate under strict data localization rules.

7. Getting Started: Minimal Proxy Setup

The fastest path to a working custom endpoint uses LiteLLM, which handles format translation for most major providers out of the box.

Option A: LiteLLM (Recommended)

Install and configure LiteLLM to proxy through your backend of choice:

# Install
pip install litellm[proxy]

# Create config file (litellm_config.yaml)
model_list:
  - model_name: claude-3-5-sonnet    # Name your agent uses
    litellm_params:
      model: azure/gpt-4o            # Backend model
      api_base: https://your-deployment.openai.azure.com
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-02-01"

general_settings:
  master_key: sk-local-proxy-key     # Key your agent sends

# Start the proxy
litellm --config litellm_config.yaml --port 8080

# Point your agent at it
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=sk-local-proxy-key

Option B: Custom Proxy (More Control)

If you need custom logic (PII redaction, special auth headers, routing decisions), a minimal FastAPI proxy gives you full control:

from fastapi import FastAPI
from fastapi.responses import JSONResponse
import httpx, os

app = FastAPI()

@app.post("/v1/messages")
async def proxy_messages(request: dict):
    # Add custom logic here: PII redaction, logging, routing decisions
    upstream_url = os.environ["UPSTREAM_BASE_URL"]
    upstream_key = os.environ["UPSTREAM_API_KEY"]

    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{upstream_url}/v1/messages",
            json=request,
            headers={"x-api-key": upstream_key, "anthropic-version": "2023-06-01"},
        )
    return JSONResponse(content=response.json(), status_code=response.status_code)

Verifying the Setup

Before trusting the proxy with real work, verify the endpoint with a direct curl:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: your-proxy-key" \
  -d '{
    "model": "claude-3-5-sonnet-20241022",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Say hello."}]
  }'

If you get a valid Anthropic-format response, the proxy is working. Set the environment variables and any agent that supports ANTHROPIC_BASE_URL will start routing through your custom endpoint automatically.

Agents with native custom endpoint support

Claude Code: respects ANTHROPIC_BASE_URL natively
Fazm: exposes a settings field for custom base URLs, no environment variable required
Any tool built on the official Anthropic SDK: inherits the base URL override from the SDK configuration
LiteLLM-based tooling: can act as both a client and a proxy, supporting arbitrary upstream endpoints

Run your AI agent through any API endpoint

Fazm is open source, runs locally on macOS, and supports custom API endpoints natively. Use your Copilot subscription, corporate gateway, or self-hosted model without touching the agent code.

Try Fazm Free

Free to start. No credit card required.