The 3-Tool-Call Problem - Why Desktop Agents Plateau at Basic Tasks

Matthew Diakonov·March 18, 2026·2 min read

tool-calls action-space desktop-agent multi-step reliability

Desktop AI agents demo beautifully. Open an app, click a button, type some text. Three tool calls, done. But try to extend that to a realistic workflow - open the app, navigate to the right section, find the correct record, update three fields, save, verify, close - and reliability drops off a cliff.

Why 3 Calls Is the Ceiling

Each tool call has a success probability less than 100%. Even if each call succeeds 95% of the time, the probability of all 10 calls succeeding is 0.95^10 = 60%. At 90% per-call reliability, 10 calls gives you 35% end-to-end success. The math is brutal.

But it is not just probability multiplication. Beyond 3 calls, the agent also faces an action space explosion. After each action, the screen state changes. The agent needs to re-evaluate what to do next based on the new state. The number of possible states grows exponentially with each step.

The Action Space Explosion

At step 1, the agent sees a screen and chooses from maybe 20 possible actions. At step 2, depending on what happened, there might be 20 different screens, each with 20 possible actions. By step 5, the theoretical action space is enormous. The agent cannot have seen all these states during training.

This is why agents that work perfectly in demos fail on real tasks. The demo follows a known path through a known action space. Real usage introduces unexpected dialogs, loading states, error messages, and UI variations the agent has never encountered.

Breaking Through the Plateau

The solution is not better models (though that helps). It is reducing the effective action space. Constrain which applications the agent interacts with. Pre-define the workflow steps and let the agent handle each step independently. Use checkpoints so failures at step 7 do not require restarting from step 1.

Think of it as transforming one 10-step workflow into three 3-step workflows with human verification between them. Each segment stays within the reliability sweet spot.

Fazm is an open source macOS AI agent. Open source on GitHub.

The 3-Tool-Call Problem - Why Desktop Agents Plateau at Basic Tasks

Why 3 Calls Is the Ceiling

The Action Space Explosion

Breaking Through the Plateau

More on This Topic

Related Posts

Why Claude CoWork Feels Like Your Worst Coworker - VM Reliability Issues

Real Problems AI Agents Solve vs Demo Magic - Edge Cases and Reliability

We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Comments ()

Why 3 Calls Is the Ceiling

The Action Space Explosion

Breaking Through the Plateau

More on This Topic

Related Posts

Why Claude CoWork Feels Like Your Worst Coworker - VM Reliability Issues

Real Problems AI Agents Solve vs Demo Magic - Edge Cases and Reliability

We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Comments (••)

Comments ()