GTC 2026: Inference Is Eating the World
GTC 2026: Inference Is Eating the World
The biggest takeaway from GTC 2026 is not about new hardware. It is about economics. Inference - the cost of running AI models to produce output - is becoming the dominant cost in software systems that use AI agents.
Training is a one-time cost. Inference is a recurring tax on every operation.
The Recurring Tax
Every time an agent decides what to do next, that is an inference call. Every tool selection, every response generation, every error recovery decision costs tokens. A desktop agent that runs continuously might make hundreds of LLM calls per hour. At scale, this adds up fast.
The companies building AI agents are discovering that their compute costs scale linearly with usage in a way that traditional software does not. A web app serving a million users costs roughly the same whether those users are active or idle. An AI agent costs money every time it thinks.
Minimize Round Trips
The most effective cost optimization for agents is reducing the number of LLM round trips per task. Every technique that lets the agent accomplish more per inference call - better prompting, more context in each call, smarter tool design - directly reduces cost.
This is why local models matter. A small local model handling routine decisions while a cloud model handles complex ones can cut inference costs by 80% or more. The local model is free after hardware cost. The cloud model is pay-per-token.
What This Means for Agents
Agent architectures that minimize LLM calls will outcompete those that do not. Caching previous decisions, using deterministic logic for predictable situations, and batching context to reduce round trips are not optimizations - they are requirements for sustainable agent economics.
Fazm is an open source macOS AI agent. Open source on GitHub.