Inference Optimization Is a Distraction for AI Agent Builders
Inference Optimization Is a Distraction for AI Agent Builders
We spent two months optimizing inference latency. Prompt caching, model distillation, batch processing, streaming responses. We shaved 400ms off every API call. Then we measured the end-to-end task completion time and it barely moved.
The reason is embarrassing in hindsight. A typical agent workflow looks like this: model call (800ms), wait for page to load (2-4 seconds), model call (800ms), wait for animation to finish (1-2 seconds), model call (800ms), wait for element to become clickable (1-3 seconds). The inference calls are a small fraction of total time.
The Real Bottleneck
Action execution dominates agent runtime. Clicking a button and waiting for the resulting page load. Typing text and waiting for autocomplete suggestions to appear. Opening a dropdown and waiting for options to render. Scrolling and waiting for lazy-loaded content.
These waits are not optional. If the agent acts before the UI is ready, it fails. And these waits are measured in seconds, not milliseconds.
Where to Actually Optimize
Reduce unnecessary actions. If the agent can fill three form fields and then tab between them instead of clicking each one individually, that eliminates two click-and-wait cycles. If the agent can use keyboard shortcuts instead of navigating menus, that skips multiple UI transitions.
Parallelize verification. Instead of waiting for each action to complete before verifying, take a screenshot while the next action is being planned. Overlap thinking time with waiting time.
Predict UI state. If the agent knows a button click will open a specific dialog, it can start planning the next action before the dialog fully renders.
These optimizations each save more time than any inference speedup ever will. Focus on the slow parts, not the parts that feel like they should be slow.
Fazm is an open source macOS AI agent. Open source on GitHub.