Why We Still Don't Have a Proper Control Plane for LLM Usage

Matthew Diakonov··5 min read

Why We Still Don't Have a Proper Control Plane for LLM Usage

Every cloud service has a control plane - a management layer for provisioning, monitoring, and governing resource usage. Databases have them. Compute has them. Networking has them. LLM usage does not have a real one yet, and teams are paying for it in surprise bills and wasted tokens.

The LLM API market has matured fast on model capability but is still catching up on infrastructure maturity. Providers compete on benchmark performance and context window size. The operational management layer - the infrastructure that lets you actually govern how models are used at scale - is mostly something you build yourself.

What the Current State Looks Like

Right now, most teams cobble together:

  • A spreadsheet or Notion page tracking monthly API spend
  • Budget alerts set at 80% and 100% of a monthly cap (if they remembered to set them)
  • A shared API key with no per-project or per-user attribution
  • Manual review when costs spike

Some teams have added a proxy server - an intermediate layer that all LLM calls route through. That is getting closer. But the proxy usually does rate limiting and basic logging without the feedback loops that make it a real control plane.

Platforms like Langfuse, LLM Ops (Cloudidr), and similar tools are filling in the gap with token tracking and budget alerts. They are useful. But they are monitoring tools, not control planes. A control plane does not just observe - it governs.

What a Real Control Plane Would Provide

Rolling budgets with automatic enforcement

A weekly or monthly spend limit that automatically throttles requests as you approach it. Not an alert that you might see later - actual enforcement. Requests that would push you over budget get queued or downgraded instead of executing.

Automatic model downgrade

When budget gets tight, route requests to cheaper models instead of failing. This is the most operationally valuable feature. Claude Sonnet 4.6 at $3/$15 per MTok handles most tasks that Opus 4.6 at $5/$25 handles. GPT-5 mini at $0.25/$2 per MTok handles a large fraction of what GPT-5.2 at $1.75/$14 handles.

An automatic downgrade rule - "if 80% of monthly budget is consumed, route to the next cheaper tier" - keeps the service running without burning through your remaining budget on premium models.

Per-project and per-user attribution

Knowing your total monthly API spend is not enough. You need to know which project, which feature, and which user is responsible for which portion of that spend. Without attribution, you cannot make informed optimization decisions.

Usage analytics at the task level

Token consumption should be visible at the task level, not just the request level. A task that calls three tools and iterates four times has a token profile that is meaningfully different from a single-shot query. Understanding which workflows are expensive tells you where to invest in prompt optimization or local model routing.

Building It Yourself (For Now)

Until mature vendor solutions exist, here is the architecture that covers the most ground with the least complexity:

1. Proxy all LLM calls through a single gateway

Every service that calls an LLM model goes through a proxy. No direct API key usage in application code. The proxy is where you log, route, and enforce limits.

A minimal proxy in Node.js:

import express from "express";
import { createProxyMiddleware } from "http-proxy-middleware";

const app = express();

app.use("/v1", async (req, res, next) => {
  const project = req.headers["x-project-id"] as string;
  const remaining = await getBudgetRemaining(project);

  // Auto-downgrade when under 20% budget remaining
  if (remaining < 0.2 && req.body?.model?.includes("opus")) {
    req.body.model = req.body.model.replace("opus", "sonnet");
    console.log(`[budget] downgraded to sonnet for project ${project}`);
  }

  // Log the request before proxying
  const requestId = crypto.randomUUID();
  await logRequest({ requestId, project, model: req.body?.model, timestamp: Date.now() });

  // Store requestId for response logging
  res.locals.requestId = requestId;
  next();
});

2. Log every request with full attribution

Each log entry should capture: project ID, user ID (if applicable), model used, input token count, output token count, latency in milliseconds, and whether this was a downgraded request.

Estimated token count is good enough for routing decisions. Exact counts come back in the response and can update the record.

3. Set tiered budget alerts

Three thresholds: 50%, 80%, and 100% of your target. The 50% alert is informational. The 80% alert triggers automatic downgrade. The 100% alert triggers queue mode - requests are queued and processed at a rate that keeps spend flat.

4. Weekly review ritual

Automate a weekly digest: total spend by project, top 10 most expensive task types, percentage of requests that were downgraded, and any anomalies (single requests consuming abnormal tokens). The anomalies usually point to prompts or workflows that need optimization.

Why This Matters for Desktop Agents

For desktop AI agents running autonomously, the absence of a control plane is especially dangerous. An agent that loops on a task, calls tools in a recursive pattern, or receives unexpectedly large context can burn through API budget in minutes.

Budget-aware agents that check remaining budget before starting expensive operations, and automatically select cheaper models for lower-priority tasks, are meaningfully more sustainable to run. Building that budget-awareness into the agent's tool selection logic - not just monitoring it from the outside - is the difference between a production-viable agent and one you have to watch constantly.

The cloud industry solved this problem for compute a decade ago. LLM infrastructure is following the same path. In the meantime, a proxy with per-project attribution, automatic downgrade, and tiered alerts covers most of what you need.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts