How AI Agents Actually See Your Screen - DOM Control vs Screenshots Explained

Matthew Diakonov·March 18, 2026·11 min read

technical dom screenshots computer-use ai-agents

How AI Agents Actually See Your Screen - DOM Control vs Screenshots Explained

Every AI desktop agent needs to answer the same fundamental question: how do I understand what is on the screen right now?

The answer splits the entire industry into two camps. One camp takes screenshots and feeds pixels to a vision model. The other reads the actual structure of the interface - the DOM, the accessibility tree, the real elements behind what you see. This is not an academic distinction. It determines how fast, accurate, reliable, and private your AI agent can be.

Let us walk through both approaches, how they work under the hood, and why the difference matters more than most people realize.

The Screenshot Approach - Looking at Pixels

The screenshot approach works the way you might intuitively expect. The agent takes a picture of your screen, sends it to a multimodal vision model, and asks: "What do you see? Where should I click?"

The model looks at the image and tries to figure out what buttons, text fields, menus, and content exist on screen. It identifies coordinates for where to click or type. Then it executes those actions - typically by simulating mouse movements and keyboard input at specific pixel locations.

This is how OpenAI Operator and Claude Computer Use work. The agent sees exactly what you see - a flat image of pixels.

How it works step by step

Capture a screenshot of the current screen state
Send the image (often 1-3 MB) to a vision-capable LLM
The model interprets the image and identifies UI elements visually
The model outputs pixel coordinates for where to click or what to type
The agent simulates mouse/keyboard input at those coordinates
Take another screenshot to verify the action worked
Repeat

Where screenshots struggle

The core problem is that a screenshot is a lossy representation. A 1920x1080 image flattened into pixels loses all semantic information. The model does not know that a particular rectangle is a button, or that some text is editable, or that a dropdown menu has 47 options behind it. It has to guess from visual patterns.

This creates several failure modes:

Resolution ambiguity. Small text, thin borders, and closely spaced elements become hard to distinguish. A submit button next to a cancel button might be only 30 pixels apart. The model clicks the wrong one.

Hidden state. Dropdown menus, tooltips, hover states, off-screen content - none of this exists in a screenshot. The agent cannot see what it cannot see.

Coordinate drift. If the window moves, resizes, or if a notification pushes content down by 20 pixels between taking the screenshot and executing the click, the action hits the wrong target.

Massive token consumption. A single screenshot can consume 1,000 to 3,000 tokens in a vision model's context window. An agent that takes a screenshot before and after every action burns through context rapidly. After 10 actions, you have used 20,000-60,000 tokens just on images - leaving less room for reasoning, memory, and planning.

The DOM/Accessibility Tree Approach - Reading the Structure

The alternative is to skip the visual layer entirely and read the actual structure of the interface. Operating systems expose this structure through accessibility APIs - the same APIs that screen readers use for visually impaired users.

Instead of seeing pixels, the agent gets a structured tree of elements: buttons with labels, text fields with current values, menus with all their options, checkboxes with their checked/unchecked state. Every element has a type, a name, a position, and its current state.

This is how Fazm works. Rather than asking "what does this look like?" the agent asks "what is this, and what can I do with it?"

How it works step by step

Query the OS accessibility API for the current window's element tree
Parse the structured data - element types, labels, states, positions
Send this lightweight text representation to the LLM
The model selects elements by reference (e.g., "click the Save button") rather than by pixel coordinate
The agent targets the actual element through the accessibility API
Query the tree again to verify the state changed
Repeat

Why structure beats pixels

Precision. The agent does not guess where a button is. It knows. An accessibility tree element has an exact reference. Clicking "the Save button" through the accessibility API hits the Save button every time, regardless of screen resolution, window position, or font rendering.

Speed. A text representation of a UI tree is typically 2-10 KB. A screenshot is 500 KB to 3 MB. The DOM approach sends 100-500x less data per action. This means faster LLM responses and far more efficient token usage.

Hidden state access. The accessibility tree includes information that is invisible in a screenshot. Dropdown options before they are opened. Whether a checkbox is checked. Whether a text field is editable or disabled. The agent knows the full state, not just what happens to be rendered on screen at that moment.

Reliability across environments. A screenshot-based agent trained on macOS light mode might fail on dark mode, a different display resolution, or a non-English locale. The accessibility tree does not change based on visual theme. A button labeled "Save" has that label regardless of whether it is rendered in light or dark mode, at 1080p or 4K.

Head-to-Head Comparison

Factor	Screenshot (Pixel-Based)	DOM/Accessibility Tree
Accuracy	~85-90% on clear UIs, drops on complex layouts	~95-99% element targeting precision
Speed per action	2-5 seconds (image capture + vision inference)	0.3-1 second (tree query + text inference)
Tokens per action	2,000-6,000 (image encoding)	200-800 (structured text)
Hidden state	Cannot see dropdowns, off-screen content, disabled states	Full access to element states and properties
Resolution sensitivity	High - small elements become ambiguous	None - elements are referenced by identity
Cross-platform consistency	Varies with theme, resolution, locale	Consistent across visual configurations
Privacy exposure	Sends full screen pixels to cloud (may include sensitive content)	Sends only element structure and labels
Setup complexity	Low - just needs screen capture	Moderate - requires accessibility permissions

The Privacy Factor Nobody Talks About

Here is something that does not get enough attention: screenshot-based agents send images of your entire screen to a cloud API. Every time the agent takes an action, it captures everything visible - your email, your Slack messages, your bank balance, your medical records, whatever happens to be on screen.

Even if the agent is focused on one application, the screenshot captures the full desktop. Notification banners, background windows, menu bar items - all of it goes to the model provider's servers.

The DOM approach sends structured text describing UI elements. It sends "Button: Save", "TextField: email - value: user@example.com", "Menu: File > Edit > View". This is still data leaving your machine, but it is dramatically less than a full pixel capture of your screen. And it is much easier to filter - you can strip sensitive field values while keeping the structural information the agent needs.

For any organization handling sensitive data - healthcare, finance, legal, government - this is not a minor consideration. It can be the difference between an AI agent that is deployable and one that compliance will never approve.

The Hybrid Middle Ground

Some agents try to combine both approaches. Simular uses a hybrid model - primarily DOM-based but falling back to screenshots when the accessibility tree is incomplete or when visual context helps with ambiguous situations.

The hybrid approach has theoretical appeal. Use structure when it is available, fall back to vision when it is not. In practice, the complexity of maintaining two perception systems and deciding when to switch between them adds engineering overhead and introduces new failure modes at the boundaries.

The more practical question is: how good does the primary approach need to be? If the DOM/accessibility tree covers 95%+ of interactions, the remaining edge cases might be better handled by improving tree coverage rather than bolting on a second perception system.

Real-World Performance Differences

Consider a concrete workflow: logging into a web app, navigating to settings, changing a configuration value, and saving.

Screenshot-based agent:

Screenshot 1: Identify login form (2-4 seconds)
Type username and password (with coordinate-based targeting)
Screenshot 2: Verify login succeeded (2-4 seconds)
Screenshot 3: Find settings link (2-4 seconds)
Click settings (coordinate-based)
Screenshot 4: Verify settings page loaded (2-4 seconds)
Screenshot 5: Find the right configuration field (2-4 seconds)
Modify the value (coordinate-based)
Screenshot 6: Find and click Save (2-4 seconds)
Screenshot 7: Verify save succeeded (2-4 seconds)
Total: 7 screenshots, ~14-28 seconds of perception overhead, ~14,000-42,000 tokens on images alone

DOM-based agent:

Read element tree: Identify login form fields by type and label (0.3-1 second)
Type credentials into identified fields (direct element targeting)
Read tree: Verify navigation state changed (0.3-1 second)
Read tree: Find settings element by label (0.3-1 second)
Click settings element directly
Read tree: Find configuration field by name (0.3-1 second)
Modify value in identified field
Read tree: Click Save by element reference (0.3-1 second)
Read tree: Verify state change (0.3-1 second)
Total: 6 tree reads, ~1.8-6 seconds of perception overhead, ~1,200-4,800 tokens on structure

That is a 3-5x speed difference and a 10-30x token efficiency difference. Over hundreds of actions per day, this compounds enormously.

When Screenshots Still Win

To be fair, there are cases where the screenshot approach has advantages:

Custom-rendered UIs. Some applications render their interfaces using custom drawing (games, design tools, some Electron apps with heavy custom rendering). These may not expose full accessibility trees, making the visual approach the only option.

Visual verification tasks. If the agent needs to confirm that a chart looks correct, or that an image uploaded properly, or that a PDF rendered correctly - these are inherently visual tasks where pixels carry the information.

Universal applicability. Every application can be screenshotted. Not every application has a complete accessibility tree. The screenshot approach works everywhere, even if it works less precisely.

Which Approach Is Right for You?

If you are evaluating AI desktop agents, here is how to think about this:

Choose a DOM/accessibility tree agent (like Fazm) if:

You need high reliability for business-critical workflows
Speed matters - you are running hundreds of automated actions per day
You handle sensitive data and need to minimize what leaves your machine
You work primarily with standard applications (browsers, productivity tools, business software)
Token costs and API efficiency are a concern

Choose a screenshot-based agent (like Operator or Claude Computer Use) if:

You need to work with heavily custom-rendered applications
Your tasks are primarily visual in nature (design review, visual QA)
You need the broadest possible application compatibility and are willing to trade speed for coverage

Consider a hybrid if:

You work across both standard and custom applications regularly
You can tolerate additional complexity in exchange for broader coverage

For most business automation use cases - the kind where you are automating repetitive workflows across standard software - the DOM approach is strictly better. It is faster, more accurate, more private, and more token-efficient. The screenshot approach is a brute-force solution to a problem that has a more elegant answer.

The real question is not which approach is theoretically better. It is which one works reliably for your specific workflows. If you want to see the difference firsthand, try Fazm on a workflow you care about. The speed and accuracy difference is hard to appreciate in theory - but obvious the moment you see it in practice.

Want to dive deeper into how AI agents interact with your computer? Read our guides on what computer use AI actually is and how accessibility APIs compare to screenshot-based control.

How AI Agents Actually See Your Screen - DOM Control vs Screenshots Explained

How AI Agents Actually See Your Screen - DOM Control vs Screenshots Explained

The Screenshot Approach - Looking at Pixels

How it works step by step

Where screenshots struggle

The DOM/Accessibility Tree Approach - Reading the Structure

How it works step by step

Why structure beats pixels

Head-to-Head Comparison

The Privacy Factor Nobody Talks About

The Hybrid Middle Ground

Real-World Performance Differences

When Screenshots Still Win

Which Approach Is Right for You?

Related Posts

Best Open Source AI Computer Use Agent in 2026

Computer Use Agent: What It Is, How It Works, and How to Pick One

Perplexity Computer Browser Control Capabilities: What It Can Actually Do