Agentic AI token economics, the variable everyone misses is the per-turn input

Most writing on agentic AI cost talks about pricing tiers, caching, model routing, and CFO TCO. All useful. None of it touches the one variable that actually decides the bill on a computer-use agent, which is what gets re-sent on every loop iteration. The math is simple, the lever is concrete, and you can verify it in three constants in an open-source bridge.

Matthew Diakonov, Written with AI

Published May 5, 20268 min read

Direct answer (verified 2026-05-05)

An agent is a loop. On every iteration the runtime re-sends the system prompt, the tool schema, the conversation so far, and a fresh dump of world state to the model. Cost equals iteration count multiplied by per-turn input size. Input tokens dominate output by 30 to 100 times. For computer-use agents the per-turn input is dominated by screen-state representation, where accessibility-tree text runs 6 to 10 times cheaper than screenshots of the same window. That ratio is the whole economics conversation in one variable.

0-3000input tokens per screenshot turn

0-400input tokens per accessibility-tree turn

0xagent vs chat token cost (arxiv)

0Fazm hard cap on image turns per session

The unit economics

Cost = iterations times per-turn input

For a single chat completion you pay roughly the prompt plus the answer. For an agent loop you pay the prompt every iteration, plus a slowly growing history, plus a fresh chunk of whatever the agent needs to see this turn. Output is a rounding error. The arxiv paper How Do AI Agents Spend Your Money measured agentic coding tasks at roughly 1000 times the token cost of equivalent chat reasoning, with input tokens, not output, the dominant share.

The mental model worth holding: a 500-token first prompt does not cost 500 tokens. It costs 500 tokens on turn one, 1,200 on turn two, 2,500 on turn three, and so on, because every prior turn is now part of the input. By turn ten you can be at 30,000 tokens per call. Where exactly that curve sits depends on three things: how big the static prefix is, how aggressive the history pruning is, and how big the per-turn fresh payload is. The third one is where the asymmetric leverage lives.

What is in the input on every turn

Anatomy of one agent iteration

Useful to draw this once. The diagram below is one turn of a computer-use agent. The runtime concatenates four payloads, ships them to the model, the model emits one tool call, the tool runs, the result gets appended, and the next iteration starts. Every box is input-token cost.

One iteration, four payloads, the model sees them all

Three of the four input payloads are largely outside your control. The system prompt is whatever the agent author shipped. The tool schema is whatever tools you registered. The history is whatever happened in the loop already, modulo a pruning policy. The fourth, the fresh world-state dump, is the one you can change with a representation choice. For coding agents that chunk is retrieved code; for browsing agents it is page content; for desktop computer-use agents it is the active window. The encoding decision on that fourth payload swings the whole bill.

The biggest single lever

Screenshot vs accessibility tree, the 6 to 10x decision

The same active window can be encoded two ways. Way one is a PNG screenshot, handed to a vision model, which tokenises it into roughly 1,500 to 3,000 input tokens per turn depending on resolution and the provider tile policy. Way two is a compact accessibility tree where each interactive element is a tiny record of role, text, and bounding box, which fits in the 200 to 400 token range for the same window. Same information, different encoding, multiplicative cost gap.

Same window, two encodings, one decision that decides the bill

Render the active window as a PNG, hand it to a vision model. Easy to set up, works on any app, default in most computer-use research demos. Pays the cost on every turn.

1,500 to 3,000 input tokens per turn
Vision model required for every reasoning step
150,000 cumulative tokens over a 10-step task
Loop dominated by prefill on consumer Apple Silicon

Two consequences worth being explicit about. On a hosted model the difference is a 6 to 10x reduction in input dollars per run, which compounds with every long-running task. On a local model the dollars are zero either way, but prefill seconds are not, and prefill at 200 input tokens is qualitatively different from prefill at 3,000 input tokens. The accessibility-tree path is the only path that makes a 13B model on a laptop feel snappy in a real loop.

How this looks in real code

Three constants and two filters in the Fazm bridge

Most of the literature talks about token economics in the abstract. Easier to point at concrete code. Fazm ships an ACP bridge that sits between the desktop chat surface and the underlying agent runtime, and the bridge has explicit token-economics architecture decisions encoded as constants. All of these live in the open-source repo at github.com/m13v/fazm under acp-bridge/src/index.ts.

acp-bridge/src/index.ts

Reading top to bottom. MAX_SCREENSHOT_DIM caps any inadvertent screenshot at 1920 pixels because Retina captures otherwise breach the provider 2000-pixel ceiling. MAX_IMAGE_TURNS hard caps the number of image-bearing turns in a single session at twenty, because the API tightens its multi-image limits the moment a session has many images. The Playwright MCP server is spawned with --output-mode file and --image-responses omit so its tool results carry text snapshots of the page, not inlined base64 PNGs. And the bridge result handler iterates over every MCP content array and forwards only the text-typed items to the model, so even when an MCP tool insists on returning an image the model never has to tokenise it.

The point is not that those exact constants are the right numbers for every project. The point is that the architecture made a deliberate decision about which payload the model is allowed to see on every turn, and the decision is visible in code. Most agentic systems have not made that decision; they just plumb everything to the model and pay the bill.

The other levers, in order

What to optimise after representation

Once the per-turn fresh payload is sane, the remaining levers come into play in roughly this order. None of them beats the representation choice if the representation choice is wrong; all of them stack with it once it is right.

Levers ranked by impact for computer-use agents

Screen-state representation. Accessibility tree where the OS supports it, screenshot only as the explicit fallback. Single biggest variable. 6 to 10x.
Tool schema discipline. Agents pay for the full schema every turn. A 50-tool catalogue with verbose JSON schemas can be 5,000 tokens of static cost forever. Keep the surface lean.
Prompt caching. Static prefix (system prompt + tool schema) is a perfect cache candidate. Provider cache reads run at roughly 10 percent of input cost on Anthropic and OpenAI.
History pruning. Long sessions accumulate tool results no future turn will reference. Trim or summarise old turns explicitly; do not let the input grow forever.
Model routing. Cheap model for narrow turns, frontier model for the hard ones. Real wins, but only after representation, schema, and caching are in order.
Output token discipline. Output is a small share of total spend on agents. Worth tightening, not the headline lever.

Anchor metric

“MAX_IMAGE_TURNS is the hard cap on image-bearing turns per session enforced by the open-source Fazm bridge. Above this the runtime stops sending screenshots and falls back to text representations to keep the API call under multi-image limits.”

github.com/m13v/fazm, acp-bridge/src/index.ts line 1242

The fact that this number exists at all is the interesting part. A system without an explicit cap is implicitly trusting that no session will ever exceed the provider limit. That is a financial decision, not a technical one. Most production agents discover the cap the first time a long task fails with a 400 error and a nine-figure prefill bill. The cheaper education is to write the constant first.

How to audit your own agent

Three measurements that surface the hot variable

You cannot optimise what you do not measure, and most agent dashboards stop at total tokens per session. That is too coarse to find the leak. Three small additions to your logging surface the answer in one task run.

Log per-turn input and output tokens, not per-session. Most providers return them on every response. Plot the input count by turn index. A flat line means your loop is sane. A rising line means history pruning is missing.
Decompose the per-turn input into prefix, history, and fresh payload. Three numbers per turn. The prefix should be constant, the history monotonically growing, the fresh payload should sit in a tight band. If the fresh payload is the largest of the three, you are paying for the world-state representation; that is the lever to flip.
Compare a representative task on two representations. Run the same 10-step task with screenshots, then with accessibility-tree text, log both. The ratio you see is yours, not the literature average. Decide based on your ratio, not someone else's.

After step three the conversation about caching, routing, and pricing tier becomes a real engineering conversation grounded in measurements. Before step three it is vibes.

Want to run the per-turn audit on a real agent

Twenty minutes, screen share, your agent. We instrument per-turn input decomposition, identify your hot variable, and decide whether representation, schema discipline, or caching is the right first move.

Questions people ask after reading this

What is agentic AI token economics in one sentence?

It is the cost structure that emerges when an LLM is wrapped in a loop that re-sends the system prompt, the full tool schema, the running conversation, and a fresh dump of world state on every iteration. Cost equals iteration count times per-turn input size. Output tokens are a rounding error compared to that input size, often 30 to 100 times smaller. The whole field of optimisation is about either shrinking the per-turn input or shrinking the iteration count.

Why do agentic workflows cost so much more than chat?

Chat is one round trip; the model sees the input once, emits an answer, and the conversation moves on. An agent is a feedback loop. On every turn the runtime concatenates the system prompt, the entire tool schema, the conversation so far, and the freshest available world state, then the model emits a tool call, the runtime executes it, appends the result, and goes around again. A 500-token first prompt commonly inflates to 30,000 tokens of context by turn ten. The arxiv paper How Do AI Agents Spend Your Money found agentic coding tasks consume roughly 1000 times more tokens than chat for equivalent reasoning, and input tokens dominate the bill.

Is the answer really just caching and model routing?

No, those are second-order. They help, but they only matter if the per-turn input is well-shaped to begin with. Caching reduces the cost of re-sending an identical prefix; it does nothing for the trailing world-state chunk that genuinely changes every turn. Model routing helps if some turns are easy and some are hard, but it does not change the size of the input that goes to either model. The first-order question for any agentic system is what gets re-sent on every turn unchanged and what gets re-sent on every turn but slightly different. Optimise that first; cache the static parts second.

Why does screen-state representation dominate the math for computer-use agents?

A computer-use agent has to know what is on the screen before it can decide what to do next. There are two viable encodings. A screenshot fed to a vision model lands somewhere in the 1,500 to 3,000 input-token range per turn depending on resolution and tile policy. A compact accessibility tree of the same window, where each interactive element becomes a tiny record of role, text, and bounding box, lands in the 200 to 400 token range. Same window, same information. One is six to ten times more expensive on every turn. Multiply by ten turns and you have moved from 30,000 to 3,000 input tokens or vice versa. That is the entire economics conversation in one variable.

Where can I see this trade-off in real production code?

Fazm is fully open source on GitHub at github.com/m13v/fazm and the relevant constants live in three places in acp-bridge/src/index.ts. Line 1242 declares MAX_IMAGE_TURNS = 20, a hard cap on how many image-bearing turns a single session can rack up before the bridge stops sending screenshots to keep the API call under multi-image limits. Line 1037 declares MAX_SCREENSHOT_DIM = 1920, a resize watcher that prevents Retina screenshots from blowing past the 2000px API ceiling. Line 1491 pushes --image-responses omit and --output-mode file into the spawned Playwright MCP server so screenshots get written to /tmp/playwright-mcp instead of inlined as base64 in tool results. Lines 3370 to 3384 then iterate over MCP content arrays and only forward type:"text" items to the model. Three constants and two filtering passes is what an explicit token-economics decision looks like in code.

What other levers are there besides representation?

Four, in rough order of impact for computer-use agents. One, screen-state representation, the 6-to-10x lever covered above. Two, tool schema discipline; agents pay tokens for the full schema on every turn, so over-broad tool catalogues bleed input tokens forever. Three, conversation pruning; long-running sessions accumulate dead tool results that no future turn will reference, and a deliberate trim or summarise step claws back kilobytes per iteration. Four, prompt caching; once the leading static prefix is stable, providers like Anthropic and OpenAI offer cache reads at a fraction of input cost. Caching is genuinely the cheapest input cost the model can pay, but only on the static prefix. The dynamic suffix still pays full freight, which is why representation matters first.

Are output tokens ever the bottleneck?

Rarely for agents. The model emits one tool call per turn, usually 20 to 200 output tokens. On a 10-turn task that is 2,000 output tokens. The same task fed accessibility-tree state runs 25,000 input; fed screenshot state, 150,000 input. Output is 1 to 8 percent of total tokens in either case. People who optimise output token count for agents are sanding the hood while the engine is on fire.

Does running the model locally change the economics?

It changes the units, not the structure. Locally, you stop paying dollars per million tokens and start paying prefill seconds per turn. On consumer Apple Silicon a 13B model prefills somewhere in the low hundreds of tokens per second. A screenshot-based agent feeding 2,000 input tokens per turn spends 10 to 20 seconds per turn just on prefill, before the model emits a single output token. An accessibility-tree agent feeding 300 tokens per turn spends 1 to 3 seconds. Across a 10-step task the screenshot agent finishes in roughly the time the accessibility-tree agent finishes step three. The currency changes from dollars to wall-clock seconds; the variable that decides the bill is identical.

What about prompt caching, does it solve this?

Caching helps a lot on the static prefix and almost nothing on the dynamic tail. The system prompt and tool schema are perfect cache candidates because they do not change between turns within a session; providers can flag those bytes and read them at roughly 10 percent of input cost. The conversation history and the latest world-state dump are different on every turn by definition. No cache hits there. So caching turns a 30,000-token-per-turn agent into something like 12,000 effective tokens per turn, which is real money, but the dynamic tail is still the dominant cost. Shrink the tail first, then cache the static head.

How do I sanity-check the token bill on my own agent today?

Three steps. First, log input_tokens and output_tokens for every turn, not just every session; most providers return them on every response. Second, instrument the runtime to also log how many tokens of that input were prefix (system + tool schema), history, and current-turn payload (world-state chunk, latest tool result). Third, run one task, plot the three numbers per turn, and you will see exactly which one is your hot variable. For computer-use agents it is almost always the world-state chunk. For coding agents it is usually history plus retrieved file context. Without that breakdown every optimisation is guesswork.

Deeper dives

Related, by lever

Architecture

Local LLM workflow literacy, the five primitives that turn a chatbox into work

The agent loop, screen-state representation, the swappable reasoner, skills, and memory. Five primitives that decide whether a local model feels like a toy or like work.

Read

Optimisation

Reduce AI agent token costs with MCP strategies for code intelligence

Signature-only retrieval, tree-sitter indexing, and scoped MCP context for coding agents. Token discipline at the retrieval layer.

Read

Representation

Accessibility tree vs screenshots, the representation a computer-use agent should ship with

The same window, two encodings, one decision that decides whether a 13B model on your laptop is fast or unusable.

Read