API Proxy Architecture

Custom API Endpoint Proxies for AI Agents

A developer built a proxy that routes GitHub Copilot Business requests through the Anthropic messages format, allowing their entire AI agent setup to run on an existing $19/seat subscription. This guide covers the architecture patterns, the ANTHROPIC_BASE_URL convention, security considerations, and how to set up your own proxy for corporate or custom API endpoints.

Open Source

“Free to start, fully open source, runs locally. Works with any Anthropic-compatible API endpoint.”

Fazm community

1. Why Custom API Endpoints Matter for AI Agents

AI desktop agents make a lot of API calls. Every perception step, reasoning step, and action verification involves at least one round trip to a language model. For a moderately busy workflow, that is hundreds of thousands of tokens per day. At standard retail API pricing, costs compound quickly.

The insight that unlocks significant savings is that the agent itself does not care which backend is producing the completions. It sends a request in a standard format and expects a response in the same format. If you can put a proxy in front of a cheaper or already-paid-for backend, the agent keeps working without any changes.

This matters for three distinct reasons depending on your context:

Cost: Teams with GitHub Copilot Business seats, negotiated Azure OpenAI contracts, or spare GPU capacity can route agent traffic through existing infrastructure at near-zero marginal cost.
Compliance: Regulated industries often cannot send data to third-party APIs. A corporate gateway keeps all LLM traffic within approved infrastructure while still enabling AI automation.
Flexibility: Decoupling the agent from a single provider lets you swap backends, run A/B tests on models, or route by task complexity without touching agent code.

Key insight: The value of an AI agent comes from its ability to perceive context and take action, not from which specific API it calls. Decoupling the two gives you leverage over cost, latency, and data residency simultaneously.

2. Common Proxy Architectures

There are three well-established patterns for routing AI agent traffic through custom endpoints, each suited to different infrastructure constraints.

Corporate API Gateways

Enterprise teams often already run a centralized LLM gateway. AWS Bedrock, Azure OpenAI Service, and open source proxies like LiteLLM all expose standardized endpoints. The agent is configured to point at the internal gateway URL, and the gateway handles authentication, rate limiting, spend tracking, and audit logging. From the agent's perspective, the gateway looks identical to a direct API.

Copilot Routing

One creative approach that surfaced in developer communities is routing agent requests through a GitHub Copilot Business subscription. Copilot Business includes access to Claude and other frontier models through its own API. A local proxy accepts incoming requests in Anthropic format, translates them to Copilot's expected format, and returns the response. Since the Copilot subscription is already paid as a flat seat license, the marginal API cost for agent usage drops to zero.

This pattern generalizes beyond Copilot. Any subscription service that exposes an API and whose terms allow programmatic usage can be wired into an agent through a format-translation proxy.

Local Models

Running open source models locally (via Ollama, vLLM, or similar) gives you a fixed-cost endpoint with no per-token charges. Tools like Ollama expose an OpenAI-compatible API by default. A thin translation layer can bridge from Anthropic format to OpenAI format, making local models drop-in replacements for cloud APIs on simpler agent tasks. The tradeoff is model quality, particularly for complex multi-step reasoning, but for structured data extraction or form filling, local models perform well enough.

Already have API budget elsewhere?

Use your existing Copilot subscription, corporate gateway, or cloud deployment with any compatible AI agent. Your infrastructure, your budget.

Try Fazm Free

3. The ANTHROPIC_BASE_URL Pattern

The Anthropic SDK and most tools that consume Anthropic-compatible APIs respect an environment variable called ANTHROPIC_BASE_URL. When set, all API calls go to that base URL instead of the default https://api.anthropic.com.

Setting up a custom endpoint is a single configuration change:

export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=your-proxy-key

# Your agent or tool now routes through the proxy
fazm start

The proxy just needs to implement the same HTTP endpoints that the Anthropic API exposes (primarily POST /v1/messages) and return responses in the same JSON shape. The agent cannot tell the difference.

This pattern is supported by Claude Code, most MCP-compatible tools, and any application that uses the official Anthropic SDK. It is also the mechanism that LiteLLM uses when acting as a transparent proxy in front of multiple providers.

What a minimal proxy must implement

POST /v1/messages accepting the Anthropic messages request body
Response body matching the Anthropic messages response schema
Streaming support via SSE if your agent uses streaming mode
Appropriate anthropic-version header handling

For format translation (say, from Anthropic format to OpenAI format), the proxy maps the messages array and system prompt to the OpenAI chat completions format, calls the backend, and maps the response back. Libraries like litellm handle this translation automatically for dozens of providers.

4. Security Considerations for API Proxying

Running a proxy introduces attack surface that a direct API connection does not have. The considerations differ between local proxies (running on the same machine) and remote proxies (running on shared infrastructure).

Local Proxies

A proxy running on localhost is only accessible to processes on the same machine. The main risks are port binding conflicts and credential exposure. Your proxy likely holds credentials for the upstream service (a Copilot token, an Azure API key, etc.), and these should never be committed to version control or logged in plain text.

Store upstream credentials in environment variables or a secrets manager, not in config files.
Bind the proxy only to 127.0.0.1, not 0.0.0.0, unless you have a specific reason to expose it to the network.
Log request metadata (model, token counts, timestamps) but not request content, which may include sensitive data from your desktop.

Remote and Corporate Proxies

Shared gateways (Bedrock, Azure, LiteLLM running on a server) need proper authentication between the agent and the gateway. mTLS or a shared secret header are common approaches. The gateway should also:

Authenticate incoming requests before forwarding them upstream.
Enforce per-user or per-team rate limits to prevent runaway agent costs.
Audit log all requests with caller identity for compliance and debugging.
Optionally apply content filtering or PII redaction before requests leave your network.

Note on Copilot terms of service: GitHub Copilot Business is licensed for development assistance. Using a proxy to route non-development agent traffic through it at high volume may not fall within the intended use. Review your agreement before relying on this pattern in production.

Data Residency

For regulated use cases, the proxy architecture is a key compliance tool. By routing through Bedrock or Azure OpenAI with region pinning, you can guarantee that request content never leaves a specific geographic boundary. This is often required for healthcare, financial services, and government workloads.

5. Desktop AI Agents with Custom Endpoint Support

Not all desktop AI agents make it easy to swap the underlying API endpoint. Some hardcode the API URL, others require environment variable overrides, and a few expose the setting directly in their configuration interface.

When evaluating an agent for custom endpoint support, look for:

Support for ANTHROPIC_BASE_URL or an equivalent config field.
Documented behavior when the endpoint returns non-standard errors, so you know how the agent handles failures from a proxy.
Ability to pass custom headers if your gateway requires authentication tokens beyond the standard API key.

Fazm is one option in this space: it is open source, runs locally on macOS, and exposes a settings field for custom base URLs. You can point it at Ollama, a LiteLLM instance, or any Anthropic-compatible endpoint without touching the application code. Because it is open source, you can also inspect and modify how it constructs requests if your proxy has specific requirements.

Other agents in this category include Claude Code (which respects the environment variable natively) and any tool built on the official Anthropic SDK (which inherits the base URL override from the SDK configuration). The common thread is that Anthropic-format compatibility has become a de facto standard for this class of tools, making the proxy pattern broadly applicable.

6. Setting Up Your Own Proxy: Practical Steps

Here is the fastest path from zero to a working custom endpoint, assuming you want to use LiteLLM as the translation layer (it handles the most provider variations with the least custom code).

Step 1: Install LiteLLM

pip install litellm[proxy]

Step 2: Create a Config File

Create litellm_config.yaml pointing at your upstream provider:

model_list:
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: azure/gpt-4o
      api_base: https://your-deployment.openai.azure.com
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-02-01"

general_settings:
  master_key: sk-local-proxy-key

Step 3: Start the Proxy

litellm --config litellm_config.yaml --port 8080

Step 4: Point Your Agent at the Proxy

export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=sk-local-proxy-key

At this point, any agent or tool that respects ANTHROPIC_BASE_URL will send its requests to the proxy, which translates them to your Azure deployment (or whatever backend you configured) and returns responses in Anthropic format. The agent never knows a proxy is involved.

Step 5: Add Health Checks and Monitoring

Before depending on the proxy in production, add a lightweight health check endpoint and set up basic logging. LiteLLM exposes GET /health out of the box. For token usage monitoring, the dashboard UI (available when running with the --ui flag) shows spend per model and per caller.

For a minimal custom proxy (without LiteLLM)

If you need more control or want to integrate with a non-standard backend, the minimum viable proxy in Python is around 50 lines: a FastAPI app that accepts POST /v1/messages, transforms the request to your target format, calls the backend, and maps the response back. The most common gotcha is streaming: if your agent uses streaming mode, the proxy must implement SSE (server-sent events) correctly, returning text/event-stream with properly formatted delta events.

Run AI agents with your own API backend

Fazm is open source, runs locally on macOS, and supports any Anthropic-compatible API endpoint. Point it at your proxy, gateway, or self-hosted model and start automating.

Try Fazm Free

Free to start. No credit card required.