Back to Blog

DOM Manipulation vs Screenshots for Browser Automation Agents

Fazm Team··2 min read
dom-manipulationscreenshotbrowser-automationspeedreliability

DOM Manipulation vs Screenshots

There are two approaches to browser automation with AI agents. The screenshot loop and direct DOM manipulation. One of them is dramatically better.

The Screenshot Loop

The screenshot approach works like this: take a screenshot, send it to a vision model, get back coordinates of where to click, execute the click, take another screenshot, repeat.

Every step in this loop is slow. Capturing a screenshot takes 100 to 500 milliseconds. Sending it to a vision model and getting a response takes 1 to 3 seconds. The vision model sometimes misidentifies elements or returns wrong coordinates. A single form fill that takes a human 10 seconds can take a screenshot-based agent 30 to 60 seconds.

And it is fragile. If the page layout shifts slightly, if a popup appears, if the resolution changes - the coordinates are wrong and the agent clicks the wrong thing.

Direct DOM Manipulation

The DOM approach is different. The agent gets the page's accessibility tree or DOM structure as text. It sees every element - buttons, inputs, links, text - with their properties and states. It knows exactly what exists on the page without needing to interpret an image.

Clicking a button is not "click at coordinates (450, 320)." It is "click the element with ref e15" or "click the Submit button." The agent cannot miss because it is targeting the element directly, not a pixel location.

Speed Comparison

A typical form-fill workflow:

  • Screenshot loop: 4 screenshots, 4 vision model calls, 4 click actions = 15 to 25 seconds
  • DOM manipulation: read accessibility tree once, execute 4 actions = 2 to 4 seconds

That is a 5 to 10x speed difference on every interaction.

When Screenshots Still Make Sense

Screenshots are useful for visual verification - confirming that a page looks right after actions are complete. They are also necessary for canvas-based applications where there is no DOM to inspect.

But for standard web automation - filling forms, clicking buttons, reading text, navigating pages - DOM manipulation wins on speed, reliability, and accuracy.

Fazm is an open source macOS AI agent. Open source on GitHub.

Keep Reading

Related Posts