How AI Agents Actually See Your Screen - DOM Control vs Screenshots Explained

Matthew Diakonov··11 min read

How AI Agents Actually See Your Screen - DOM Control vs Screenshots Explained

Every AI desktop agent needs to answer the same fundamental question: how do I understand what is on the screen right now?

The answer splits the entire industry into two camps. One camp takes screenshots and feeds pixels to a vision model. The other reads the actual structure of the interface - the DOM, the accessibility tree, the real elements behind what you see. This is not an academic distinction. It determines how fast, accurate, reliable, and private your AI agent can be.

Let us walk through both approaches, how they work under the hood, and why the difference matters more than most people realize.

The Screenshot Approach - Looking at Pixels

The screenshot approach works the way you might intuitively expect. The agent takes a picture of your screen, sends it to a multimodal vision model, and asks: "What do you see? Where should I click?"

The model looks at the image and tries to figure out what buttons, text fields, menus, and content exist on screen. It identifies coordinates for where to click or type. Then it executes those actions - typically by simulating mouse movements and keyboard input at specific pixel locations.

This is how OpenAI Operator and Claude Computer Use work. The agent sees exactly what you see - a flat image of pixels.

How it works step by step

  1. Capture a screenshot of the current screen state
  2. Send the image (often 1-3 MB) to a vision-capable LLM
  3. The model interprets the image and identifies UI elements visually
  4. The model outputs pixel coordinates for where to click or what to type
  5. The agent simulates mouse/keyboard input at those coordinates
  6. Take another screenshot to verify the action worked
  7. Repeat

Where screenshots struggle

The core problem is that a screenshot is a lossy representation. A 1920x1080 image flattened into pixels loses all semantic information. The model does not know that a particular rectangle is a button, or that some text is editable, or that a dropdown menu has 47 options behind it. It has to guess from visual patterns.

This creates several failure modes:

Resolution ambiguity. Small text, thin borders, and closely spaced elements become hard to distinguish. A submit button next to a cancel button might be only 30 pixels apart. The model clicks the wrong one.

Hidden state. Dropdown menus, tooltips, hover states, off-screen content - none of this exists in a screenshot. The agent cannot see what it cannot see.

Coordinate drift. If the window moves, resizes, or if a notification pushes content down by 20 pixels between taking the screenshot and executing the click, the action hits the wrong target.

Massive token consumption. A single screenshot can consume 1,000 to 3,000 tokens in a vision model's context window. An agent that takes a screenshot before and after every action burns through context rapidly. After 10 actions, you have used 20,000-60,000 tokens just on images - leaving less room for reasoning, memory, and planning.

The DOM/Accessibility Tree Approach - Reading the Structure

The alternative is to skip the visual layer entirely and read the actual structure of the interface. Operating systems expose this structure through accessibility APIs - the same APIs that screen readers use for visually impaired users.

Instead of seeing pixels, the agent gets a structured tree of elements: buttons with labels, text fields with current values, menus with all their options, checkboxes with their checked/unchecked state. Every element has a type, a name, a position, and its current state.

This is how Fazm works. Rather than asking "what does this look like?" the agent asks "what is this, and what can I do with it?"

How it works step by step

  1. Query the OS accessibility API for the current window's element tree
  2. Parse the structured data - element types, labels, states, positions
  3. Send this lightweight text representation to the LLM
  4. The model selects elements by reference (e.g., "click the Save button") rather than by pixel coordinate
  5. The agent targets the actual element through the accessibility API
  6. Query the tree again to verify the state changed
  7. Repeat

Why structure beats pixels

Precision. The agent does not guess where a button is. It knows. An accessibility tree element has an exact reference. Clicking "the Save button" through the accessibility API hits the Save button every time, regardless of screen resolution, window position, or font rendering.

Speed. A text representation of a UI tree is typically 2-10 KB. A screenshot is 500 KB to 3 MB. The DOM approach sends 100-500x less data per action. This means faster LLM responses and far more efficient token usage.

Hidden state access. The accessibility tree includes information that is invisible in a screenshot. Dropdown options before they are opened. Whether a checkbox is checked. Whether a text field is editable or disabled. The agent knows the full state, not just what happens to be rendered on screen at that moment.

Reliability across environments. A screenshot-based agent trained on macOS light mode might fail on dark mode, a different display resolution, or a non-English locale. The accessibility tree does not change based on visual theme. A button labeled "Save" has that label regardless of whether it is rendered in light or dark mode, at 1080p or 4K.

Head-to-Head Comparison

| Factor | Screenshot (Pixel-Based) | DOM/Accessibility Tree | |---|---|---| | Accuracy | ~85-90% on clear UIs, drops on complex layouts | ~95-99% element targeting precision | | Speed per action | 2-5 seconds (image capture + vision inference) | 0.3-1 second (tree query + text inference) | | Tokens per action | 2,000-6,000 (image encoding) | 200-800 (structured text) | | Hidden state | Cannot see dropdowns, off-screen content, disabled states | Full access to element states and properties | | Resolution sensitivity | High - small elements become ambiguous | None - elements are referenced by identity | | Cross-platform consistency | Varies with theme, resolution, locale | Consistent across visual configurations | | Privacy exposure | Sends full screen pixels to cloud (may include sensitive content) | Sends only element structure and labels | | Setup complexity | Low - just needs screen capture | Moderate - requires accessibility permissions |

The Privacy Factor Nobody Talks About

Here is something that does not get enough attention: screenshot-based agents send images of your entire screen to a cloud API. Every time the agent takes an action, it captures everything visible - your email, your Slack messages, your bank balance, your medical records, whatever happens to be on screen.

Even if the agent is focused on one application, the screenshot captures the full desktop. Notification banners, background windows, menu bar items - all of it goes to the model provider's servers.

The DOM approach sends structured text describing UI elements. It sends "Button: Save", "TextField: email - value: user@example.com", "Menu: File > Edit > View". This is still data leaving your machine, but it is dramatically less than a full pixel capture of your screen. And it is much easier to filter - you can strip sensitive field values while keeping the structural information the agent needs.

For any organization handling sensitive data - healthcare, finance, legal, government - this is not a minor consideration. It can be the difference between an AI agent that is deployable and one that compliance will never approve.

The Hybrid Middle Ground

Some agents try to combine both approaches. Simular uses a hybrid model - primarily DOM-based but falling back to screenshots when the accessibility tree is incomplete or when visual context helps with ambiguous situations.

The hybrid approach has theoretical appeal. Use structure when it is available, fall back to vision when it is not. In practice, the complexity of maintaining two perception systems and deciding when to switch between them adds engineering overhead and introduces new failure modes at the boundaries.

The more practical question is: how good does the primary approach need to be? If the DOM/accessibility tree covers 95%+ of interactions, the remaining edge cases might be better handled by improving tree coverage rather than bolting on a second perception system.

Real-World Performance Differences

Consider a concrete workflow: logging into a web app, navigating to settings, changing a configuration value, and saving.

Screenshot-based agent:

  • Screenshot 1: Identify login form (2-4 seconds)
  • Type username and password (with coordinate-based targeting)
  • Screenshot 2: Verify login succeeded (2-4 seconds)
  • Screenshot 3: Find settings link (2-4 seconds)
  • Click settings (coordinate-based)
  • Screenshot 4: Verify settings page loaded (2-4 seconds)
  • Screenshot 5: Find the right configuration field (2-4 seconds)
  • Modify the value (coordinate-based)
  • Screenshot 6: Find and click Save (2-4 seconds)
  • Screenshot 7: Verify save succeeded (2-4 seconds)
  • Total: 7 screenshots, ~14-28 seconds of perception overhead, ~14,000-42,000 tokens on images alone

DOM-based agent:

  • Read element tree: Identify login form fields by type and label (0.3-1 second)
  • Type credentials into identified fields (direct element targeting)
  • Read tree: Verify navigation state changed (0.3-1 second)
  • Read tree: Find settings element by label (0.3-1 second)
  • Click settings element directly
  • Read tree: Find configuration field by name (0.3-1 second)
  • Modify value in identified field
  • Read tree: Click Save by element reference (0.3-1 second)
  • Read tree: Verify state change (0.3-1 second)
  • Total: 6 tree reads, ~1.8-6 seconds of perception overhead, ~1,200-4,800 tokens on structure

That is a 3-5x speed difference and a 10-30x token efficiency difference. Over hundreds of actions per day, this compounds enormously.

When Screenshots Still Win

To be fair, there are cases where the screenshot approach has advantages:

Custom-rendered UIs. Some applications render their interfaces using custom drawing (games, design tools, some Electron apps with heavy custom rendering). These may not expose full accessibility trees, making the visual approach the only option.

Visual verification tasks. If the agent needs to confirm that a chart looks correct, or that an image uploaded properly, or that a PDF rendered correctly - these are inherently visual tasks where pixels carry the information.

Universal applicability. Every application can be screenshotted. Not every application has a complete accessibility tree. The screenshot approach works everywhere, even if it works less precisely.

Which Approach Is Right for You?

If you are evaluating AI desktop agents, here is how to think about this:

Choose a DOM/accessibility tree agent (like Fazm) if:

  • You need high reliability for business-critical workflows
  • Speed matters - you are running hundreds of automated actions per day
  • You handle sensitive data and need to minimize what leaves your machine
  • You work primarily with standard applications (browsers, productivity tools, business software)
  • Token costs and API efficiency are a concern

Choose a screenshot-based agent (like Operator or Claude Computer Use) if:

  • You need to work with heavily custom-rendered applications
  • Your tasks are primarily visual in nature (design review, visual QA)
  • You need the broadest possible application compatibility and are willing to trade speed for coverage

Consider a hybrid if:

  • You work across both standard and custom applications regularly
  • You can tolerate additional complexity in exchange for broader coverage

For most business automation use cases - the kind where you are automating repetitive workflows across standard software - the DOM approach is strictly better. It is faster, more accurate, more private, and more token-efficient. The screenshot approach is a brute-force solution to a problem that has a more elegant answer.

The real question is not which approach is theoretically better. It is which one works reliably for your specific workflows. If you want to see the difference firsthand, try Fazm on a workflow you care about. The speed and accuracy difference is hard to appreciate in theory - but obvious the moment you see it in practice.


Want to dive deeper into how AI agents interact with your computer? Read our guides on what computer use AI actually is and how accessibility APIs compare to screenshot-based control.

Related Posts