The Hardest Part of Building AI Agents Is Execution, Not Planning

Matthew Diakonov

Updated March 19, 2026

ai-agent execution reliability browser-automation challenges ai_agents

The Hardest Part of Building AI Agents Is Execution, Not Planning

Give a modern LLM a task description and it will produce a solid plan. "To book a flight, I need to open the airline website, enter the departure city, select dates, choose a flight, fill in passenger details, and complete payment." The plan is correct. The model knows what to do.

The problem is doing it.

Where Execution Breaks

Browser and UI interaction is where everything falls apart. A button that was at coordinates (450, 320) when the agent took its screenshot has moved to (450, 380) because a banner loaded. The agent clicks, hits the wrong element, and the entire flow derails.

Page load timing is another constant source of failure. The agent sees a loading spinner, waits what it thinks is long enough, takes another screenshot, and the page still is not ready. Or worse - it looks ready but a JavaScript bundle is still initializing, so clicking a button does nothing.

Then there are modals. Cookie consent popups, newsletter signup overlays, chat widgets that expand on hover, notification permission requests. Every one of these interrupts the expected flow and the agent needs to recognize and dismiss them before continuing with the actual task.

Why This Is Fundamentally Hard

The core issue is that UIs are designed for humans who have continuous visual feedback and millisecond reaction times. An agent that works in discrete screenshot-action cycles is always operating on stale information. By the time it processes a screenshot, decides what to click, and executes the click, the UI state may have changed.

Retry logic helps but introduces its own problems. How many times do you retry a failed click before concluding the element genuinely is not there? How do you distinguish between "still loading" and "something went wrong"?

The teams making real progress on agent reliability are not improving the planning step. They are building better execution infrastructure - faster screenshot processing, smarter wait strategies, and robust error recovery that does not require re-planning from scratch.

This post was inspired by a discussion on r/AI_Agents by u/Beneficial-Cut6585.

Fazm is an open source macOS AI agent. Open source on GitHub.

Switching from DOM Selectors to Accessibility Tree Cut Our Flake Rate from 30% to 5%

DOM selectors break when websites update. The accessibility tree is stable because it represents what elements do, not how they are built. Real numbers from

Mar 18, 2026

Nobody Explains How to Make Agents Run Reliably

Making AI agents reliable requires structured state management, proper error recovery, and continuous monitoring - not just better prompts. Here is what

Mar 18, 2026

AI Agent Hallucination Detection - Safeguards That Actually Work

AI agents fail confidently - they report success while quietly doing the wrong thing. Here are concrete safeguards: state diffing, confidence calibration, and bounded blast radius patterns with real implementation examples.

Mar 18, 2026

The Hardest Part of Building AI Agents Is Execution, Not Planning

Where Execution Breaks

Why This Is Fundamentally Hard

Related Posts

Switching from DOM Selectors to Accessibility Tree Cut Our Flake Rate from 30% to 5%

Nobody Explains How to Make Agents Run Reliably

AI Agent Hallucination Detection - Safeguards That Actually Work