Inference Optimization Is a Distraction for AI Agent Builders

Matthew Diakonov·March 17, 2026·2 min read

inference optimization distraction bottleneck performance

We spent two months optimizing inference latency. Prompt caching, model distillation, batch processing, streaming responses. We shaved 400ms off every API call. Then we measured the end-to-end task completion time and it barely moved.

The reason is embarrassing in hindsight. A typical agent workflow looks like this: model call (800ms), wait for page to load (2-4 seconds), model call (800ms), wait for animation to finish (1-2 seconds), model call (800ms), wait for element to become clickable (1-3 seconds). The inference calls are a small fraction of total time.

The Real Bottleneck

Action execution dominates agent runtime. Clicking a button and waiting for the resulting page load. Typing text and waiting for autocomplete suggestions to appear. Opening a dropdown and waiting for options to render. Scrolling and waiting for lazy-loaded content.

These waits are not optional. If the agent acts before the UI is ready, it fails. And these waits are measured in seconds, not milliseconds.

Where to Actually Optimize

Reduce unnecessary actions. If the agent can fill three form fields and then tab between them instead of clicking each one individually, that eliminates two click-and-wait cycles. If the agent can use keyboard shortcuts instead of navigating menus, that skips multiple UI transitions.

Parallelize verification. Instead of waiting for each action to complete before verifying, take a screenshot while the next action is being planned. Overlap thinking time with waiting time.

Predict UI state. If the agent knows a button click will open a specific dialog, it can start planning the next action before the dialog fully renders.

These optimizations each save more time than any inference speedup ever will. Focus on the slow parts, not the parts that feel like they should be slow.

Fazm is an open source macOS AI agent. Open source on GitHub.

Inference Optimization Is a Distraction for AI Agent Builders

The Real Bottleneck

Where to Actually Optimize

Related Posts

Data Availability Transfer Notes: The Hidden Bottleneck

Why Removing Unused MCP Servers Speeds Up Claude Code More Than Removing Skills

Fixing SwiftUI LazyVGrid Performance Issues on macOS

Comments ()

The Real Bottleneck

Where to Actually Optimize

Related Posts

Data Availability Transfer Notes: The Hidden Bottleneck

Why Removing Unused MCP Servers Speeds Up Claude Code More Than Removing Skills

Fixing SwiftUI LazyVGrid Performance Issues on macOS

Comments (••)

Comments ()