Cost Optimization

Custom AI API Endpoints: GitHub Copilot Proxy, OpenRouter, and Self-Hosted Models

Most developers pay twice for AI. They pay Anthropic or OpenAI directly for API access, and they also pay for GitHub Copilot Business seats that include model access they never fully use. This guide explains how API translation proxies work, which setups make sense in which situations, and how to configure AI coding tools and desktop agents to route through any compatible gateway via a single environment variable.

$0 extra API cost

“A developer named Iván built a local proxy that translates Anthropic-format API requests into GitHub Copilot Business calls, then pointed his AI agent at it via ANTHROPIC_BASE_URL. It worked immediately. This was later shipped as an official setting.”

Community report

1. Why Developers Want Custom API Endpoints

The most common reason is cost. Direct API pricing from Anthropic or OpenAI charges per token, and when you run AI coding assistants and desktop agents throughout the day, those charges stack up quickly. A moderately active setup making a few hundred API calls per day can easily reach $80 to $200 per month in direct API fees per developer.

But cost is only one driver. Here are the main reasons teams route through custom endpoints:

Using existing capacity. GitHub Copilot Business ($19/seat/month) includes model access. Many teams have dozens of seats and are paying for Claude and GPT-4o access they never tap directly. A proxy lets you use that capacity for more than just in-editor completions.
Corporate proxy requirements. Enterprise environments often require all outbound API calls to pass through an approved gateway for audit logging, DLP scanning, and rate limiting. Direct API calls to api.anthropic.com may be blocked entirely on managed networks.
Data residency and compliance. Healthcare, finance, and government teams cannot send certain data to third-party APIs without specific contractual guarantees. AWS Bedrock with Claude models or Azure OpenAI keeps data in a controlled environment with audit trails.
Privacy. Some developers simply prefer that their code and prompts not leave their machine. Running a local model via Ollama or LM Studio gives complete control over where data goes.
Fallback and load balancing. Routing through a gateway like OpenRouter lets you automatically fall back to another model if the primary provider has an outage, without changing any configuration in your tools.

The core insight: AI tools that support a configurable base URL let you separate the interface (the tool you use) from the backend (who runs the model). That separation is valuable regardless of which backend you choose.

2. How API Translation Proxies Work

AI tools built on the Anthropic SDK send requests to the Anthropic messages API format. The request body contains a model name, a list of messages, and optional parameters like temperature and max tokens. The response comes back in a predictable structure with a content array.

Different providers use different formats. GitHub Copilot uses a format derived from the OpenAI API. Google Gemini has its own format. Locally run models via Ollama expose an OpenAI-compatible API. None of these natively accept Anthropic-format requests.

A translation proxy sits between your AI tool and the actual model backend. It does three things:

Receives the request in Anthropic format from the tool (exactly as if it were going to api.anthropic.com)
Translates the request into the target provider's format, mapping message roles, content types, and parameters
Returns the response translated back into Anthropic format, so the tool processes it identically to a real Anthropic response

The tool never knows it is talking to a different backend. From its perspective, it sent a request to an API endpoint and got a valid Anthropic-format response. The proxy is completely transparent.

// Conceptual flow

AI Tool
  → POST http://localhost:3000/v1/messages
    (Anthropic format)
Proxy
  → POST https://api.github.com/copilot/chat/completions
    (Copilot/OpenAI format)
Model Response
  → Proxy translates back to Anthropic format
  → Tool receives valid Anthropic response

Libraries like LiteLLM make this translation mostly automatic. LiteLLM supports over 100 providers and handles the format mapping for you. For GitHub Copilot specifically, developers have written custom proxy scripts that handle the authentication and translation in under 100 lines of Python or Node.js.

Already paying for Copilot or Bedrock?

Fazm supports custom API endpoints natively via ANTHROPIC_BASE_URL. Point it at your existing infrastructure and automate without additional API costs.

Try Fazm Free

3. GitHub Copilot Business as an AI Backend

This is the setup that got the most attention in developer communities. A developer named Iván built a local proxy that accepts Anthropic-format API requests and translates them into GitHub Copilot Business API calls. He then pointed his AI desktop agent at it by setting a custom base URL. The agent worked without any code changes because it kept receiving valid Anthropic-format responses.

The appeal is clear: GitHub Copilot Business costs $19 per seat per month and includes access to Claude 3.5 Sonnet and GPT-4o through the Copilot API. If you already have those seats for your engineering team, you are paying for model capacity you can redirect to other tools.

A minimal Copilot proxy in Python looks roughly like this:

# Minimal Copilot-to-Anthropic proxy (simplified)

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

@app.route('/v1/messages', methods=['POST'])
def proxy_messages():
  body = request.json
  # translate body to Copilot format
  copilot_body = translate_to_copilot(body)
  resp = requests.post(
    'https://api.github.com/copilot/chat/completions',
    headers=get_copilot_headers(),
    json=copilot_body
  )
  return jsonify(translate_to_anthropic(resp.json()))

Terms of service note: Review your GitHub Copilot agreement before using this at scale. Copilot Business is licensed primarily for development assistance, and high-volume automation through a proxy may fall outside intended use. Some teams use this approach for prototyping and validation, then switch to a dedicated API plan once they confirm the workflow is worth the investment.

The broader point is the pattern itself, not the Copilot-specific implementation. Any provider that exposes an API can serve as a backend for tools that support a configurable endpoint. Copilot is just one example that happens to be cost-effective for teams that already have seats.

4. OpenRouter and Other API Gateways

OpenRouter is a unified API gateway that provides a single endpoint for dozens of models from Anthropic, OpenAI, Google, Mistral, and others. You pay OpenRouter per token at rates that are usually at or below direct provider pricing, and you get automatic fallback, load balancing, and a single billing account.

OpenRouter exposes an OpenAI-compatible API, so you need either a translation layer or a tool that supports OpenAI format natively. For Anthropic-format tools, a wrapper like LiteLLM handles the conversion:

# LiteLLM proxy config pointing to OpenRouter

model_list:
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: openrouter/anthropic/claude-3.5-sonnet
      api_key: os.environ/OPENROUTER_API_KEY
      api_base: https://openrouter.ai/api/v1

Other gateway options worth knowing about:

AWS Bedrock. Hosts Claude models with data staying in your AWS region. Integrates with IAM for authentication and CloudTrail for audit logging. Best for teams already running workloads on AWS.
Azure OpenAI Service. Hosts GPT-4o and other OpenAI models with Azure-level data residency guarantees. Works well for enterprises with existing Microsoft agreements.
LiteLLM proxy. Self-hosted gateway that normalizes over 100 provider APIs behind a single endpoint. Adds spend tracking, rate limiting, and fallback logic. The open source version is free to run.
Portkey. Cloud-hosted AI gateway with routing rules, caching, and observability built in. Useful for teams that want gateway features without maintaining their own infrastructure.

5. Self-Hosted and Open Source Models

For developers who want zero external dependencies and complete data privacy, running a local model is the clearest option. The tradeoff is hardware requirements and model capability: current open source models are not yet at Claude 3.5 Sonnet quality for complex reasoning tasks, but they work well for simpler automation and are improving rapidly.

Tool	API Format	Best For
Ollama	OpenAI-compatible	Local development, single machine
LM Studio	OpenAI-compatible	Desktop GUI, model management
vLLM	OpenAI-compatible	High-throughput serving, multi-GPU
text-generation-inference	OpenAI-compatible	Production serving with quantization

All of these tools expose OpenAI-compatible APIs by default. To use them with Anthropic-format tools, you need a translation layer. LiteLLM proxy running locally is the simplest approach: it accepts Anthropic requests, talks to Ollama, and returns Anthropic-format responses.

A practical hybrid setup: route simple, high-frequency tasks (data extraction, form filling, status checks) through a local model, and reserve cloud API calls for complex reasoning and multi-step workflows. A routing proxy can make this decision automatically based on prompt length or task type.

6. Configuring ANTHROPIC_BASE_URL

Tools built on the Anthropic SDK respect the ANTHROPIC_BASE_URL environment variable. When set, the SDK sends all API requests to that URL instead of the default https://api.anthropic.com. This is the standard way to redirect any Anthropic-based tool to a custom endpoint.

# Point any Anthropic SDK tool at a local proxy

export ANTHROPIC_BASE_URL=http://localhost:4000

# Or point at an enterprise gateway
export ANTHROPIC_BASE_URL=https://llm-gateway.company.internal

# Then run your tool as normal
fazm start

Some tools also expose this as a GUI setting rather than requiring an environment variable. Fazm, for example, includes a custom API endpoint setting in its preferences so users can configure it without touching the terminal. This was added directly in response to the community discovering that the proxy approach worked, making it an officially supported workflow.

For the proxy to work correctly, it needs to accept the full Anthropic messages API surface. At minimum, that means:

POST /v1/messages with the standard request body
Streaming support via Server-Sent Events if your tool uses streaming
Correct handling of the anthropic-version header (most proxies pass this through or ignore it)

LiteLLM handles all of this out of the box. Custom proxies need to implement streaming correctly to avoid timeouts on longer responses.

Testing your setup: Before relying on a custom endpoint for production workflows, test it with a simple curl request to verify the proxy is translating correctly and returning valid Anthropic-format responses.

curl http://localhost:4000/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: dummy" \
  -d '{"model":"claude-3-5-sonnet-20241022","max_tokens":100,"messages":[{"role":"user","content":"say hi"}]}'

7. Security Considerations

Custom API endpoints introduce a new component between your tool and the model. That component can see all prompts, responses, and any sensitive data passing through. Before adopting a custom endpoint setup, consider a few things:

Who runs the proxy. A locally running LiteLLM or Ollama instance is fully under your control. A third-party gateway like OpenRouter has its own data handling policies. Read them before routing sensitive data.
Authentication on local proxies. A proxy running on localhost is accessible to any process on your machine. If you share a machine, or if other processes could make requests to it, add a simple API key check to prevent unintended access.
TLS for remote gateways. Any gateway running over a network should use HTTPS. Plain HTTP is only acceptable for localhost connections.
Logging and retention. If you are using a proxy that logs requests for debugging, make sure those logs do not persist sensitive content longer than needed. This is especially important in regulated industries.
Model capability differences. Routing to a different model changes the tool's behavior. A proxy that routes to a weaker model might produce less reliable results for complex tasks. Always test end-to-end with realistic inputs before deploying to production workflows.

For most individual developers running a local proxy on their own machine, the security surface is minimal. The risk profile increases when you introduce remote gateways, shared infrastructure, or sensitive data. Scale your precautions to match your actual threat model.

AI desktop agents like Fazm operate with elevated permissions on your system (reading your screen, controlling applications). It is worth being deliberate about what data flows through the API during those operations. Running through a local proxy rather than a remote one keeps that data on your machine, which is a meaningful privacy benefit for developers who work with sensitive codebases or business data.

View Fazm on GitHub

Use your own AI backend with Fazm

Fazm supports custom API endpoints via ANTHROPIC_BASE_URL and a built-in settings UI. Route through GitHub Copilot, OpenRouter, AWS Bedrock, or a local Ollama instance. Free, open source, runs natively on macOS.

Try Fazm Free

No credit card required. Download and start automating in minutes.