The Hardest Part of Building AI Agents Is Execution, Not Planning
The Hardest Part of Building AI Agents Is Execution, Not Planning
Give a modern LLM a task description and it will produce a solid plan. "To book a flight, I need to open the airline website, enter the departure city, select dates, choose a flight, fill in passenger details, and complete payment." The plan is correct. The model knows what to do.
The problem is doing it.
Where Execution Breaks
Browser and UI interaction is where everything falls apart. A button that was at coordinates (450, 320) when the agent took its screenshot has moved to (450, 380) because a banner loaded. The agent clicks, hits the wrong element, and the entire flow derails.
Page load timing is another constant source of failure. The agent sees a loading spinner, waits what it thinks is long enough, takes another screenshot, and the page still is not ready. Or worse - it looks ready but a JavaScript bundle is still initializing, so clicking a button does nothing.
Then there are modals. Cookie consent popups, newsletter signup overlays, chat widgets that expand on hover, notification permission requests. Every one of these interrupts the expected flow and the agent needs to recognize and dismiss them before continuing with the actual task.
Why This Is Fundamentally Hard
The core issue is that UIs are designed for humans who have continuous visual feedback and millisecond reaction times. An agent that works in discrete screenshot-action cycles is always operating on stale information. By the time it processes a screenshot, decides what to click, and executes the click, the UI state may have changed.
Retry logic helps but introduces its own problems. How many times do you retry a failed click before concluding the element genuinely is not there? How do you distinguish between "still loading" and "something went wrong"?
The teams making real progress on agent reliability are not improving the planning step. They are building better execution infrastructure - faster screenshot processing, smarter wait strategies, and robust error recovery that does not require re-planning from scratch.
Fazm is an open source macOS AI agent. Open source on GitHub.