Back to Blog

How AI Agents Actually See Your Screen: DOM Control vs Screenshots Explained

Fazm Team··17 min read
technicalai-agentsdom-controlexplainer

How AI Agents Actually See Your Screen: DOM Control vs Screenshots Explained

AI agents that can control your computer are no longer a research demo. They are real products you can download and use today. ChatGPT Atlas browses the web for you. Anthropic's Claude can operate a virtual desktop. Open-source tools like Fazm take voice commands and execute real actions on your Mac.

But here is a question most people never think to ask: how does the agent actually see what is on your screen?

This is not a philosophical question. It is a deeply practical one. The approach an AI agent uses to perceive and interact with your computer affects everything - how fast it moves, how often it makes mistakes, how much it costs to run, and whether your screen content gets sent to a cloud server.

There are two fundamentally different approaches, and understanding them will change how you evaluate any AI agent.

The Two Approaches at a Glance

Think of it this way. You are standing outside a building and you need to find Room 204.

Approach 1: Take a photo of the building. You snap a picture, hand it to someone, and ask them to figure out where Room 204 might be based on what the building looks like from the outside. They squint at the photo, make their best guess about the floor plan, and point you toward a window that they think is roughly the right area.

Approach 2: Read the blueprint. You pull up the building's floor plan. Room 204 is on the second floor, third door on the left. You walk straight there.

That is the core difference between how AI agents interact with your screen. Some agents look at a photo (screenshot) and guess. Others read the blueprint (the DOM or accessibility tree) and know exactly what is there.

Let's dig into how each one works.

How Screenshot-Based Agents Work

The screenshot approach - sometimes called the "vision" approach or "pixel-based" approach - is the most common method used by major AI agents today. Here is what happens every time you ask a screenshot-based agent to do something.

Step 1: Capture a screenshot

The agent takes a screenshot of your entire screen or the active window. This produces an image file - typically a PNG or JPEG - that represents exactly what you see on your monitor at that moment.

Step 2: Send the image to a vision model

That screenshot gets uploaded to a large language model with vision capabilities - models like GPT-4V, Claude Vision, or Gemini. The image is encoded and transmitted to a cloud server where the model runs.

Step 3: The model analyzes the image

The vision model "looks" at the screenshot the way a human would look at a photo. It identifies visual elements - buttons, text fields, menus, links, icons, text content. It reads the words on screen and tries to understand the layout and what each element does.

Step 4: The model outputs coordinates

Based on its analysis, the model decides what to do next. If it needs to click a button, it outputs pixel coordinates - something like "click at position x=342, y=518." If it needs to type, it might say "click the text field at x=200, y=400, then type 'hello world'."

Step 5: The agent executes the action

The agent takes those coordinates and physically moves the mouse cursor to that position on screen, then performs the click. For typing, it simulates keyboard input.

Step 6: Take another screenshot

After the action completes, the agent takes a brand new screenshot to see what changed. Did the button click work? Did a new page load? Did a dropdown appear?

Repeat for every single action

And here is the critical part - this entire cycle repeats for every action the agent takes. Click a button? Screenshot, analyze, act. Type in a field? Screenshot, analyze, act. Scroll down a page? Screenshot, analyze, act.

A simple task like filling out a five-field form might require 15 to 20 of these cycles. Each one involves capturing an image, uploading it, waiting for model inference, parsing the response, and executing the action.

Products using this approach: ChatGPT Atlas, Anthropic's Computer Use demo, OpenAI Operator, many open-source agents built on top of vision models.

How DOM-Based Agents Work

The DOM approach - sometimes called "structured access" or "programmatic control" - works in a completely different way. Instead of looking at pixels, the agent reads the underlying structure of what is on screen.

What is the DOM?

If you have never heard the term, the DOM (Document Object Model) is the structured representation of a web page that your browser maintains internally. When you see a button on a website, that button exists in the DOM as an element with specific properties - it has a type ("button"), a label ("Submit"), a position, a size, and various attributes. The DOM is the source of truth for everything on the page. The pixels you see are just a visual rendering of this structure.

For native desktop applications, the equivalent concept is the accessibility tree - a structured representation of the app's UI elements that macOS maintains for assistive technologies like screen readers.

Step 1: Access the page or app structure directly

Instead of taking a screenshot, the agent connects to the browser's DOM (via a browser extension or automation protocol) or to the operating system's accessibility API. It gets direct access to the actual UI structure.

Step 2: Read the real elements

The agent reads the DOM tree and gets precise information about every element on the page. Not "there appears to be a blue rectangle at coordinates (342, 518) that looks like it might be a button." Instead: "there is a button element with the text 'Submit', ID 'submit-form', class 'primary-btn', and it is currently enabled."

Step 3: Identify the target by its properties

When the agent needs to click the Submit button, it does not need to figure out where the button is visually. It identifies the element by its actual properties - its ID, its text content, its role in the page structure. There is no ambiguity. A button is a button. A text input is a text input.

Step 4: Interact with the element directly

The agent calls the element's native interaction methods. For a web page, that means calling element.click() or element.value = 'text' directly through the browser's API. No mouse movement simulation. No coordinate guessing. The interaction happens at the programmatic level.

No screenshots taken. No images uploaded. No vision model inference. No coordinate math.

Products using this approach: Fazm, browser automation tools like Playwright and Puppeteer, many browser extensions, and some specialized automation platforms.

Speed: Why DOM Control Is Dramatically Faster

The performance difference between these two approaches is not incremental. It is an order of magnitude.

Screenshot approach timing (per action)

| Step | Time | |------|------| | Capture screenshot | 100 - 500ms | | Upload image to API | 200 - 1,000ms | | Vision model inference | 1,000 - 5,000ms | | Parse response and calculate coordinates | 50 - 100ms | | Move mouse and execute action | 100 - 300ms | | Total per action | 1,500 - 7,000ms |

That is roughly 2 to 7 seconds for a single click or keystroke.

DOM approach timing (per action)

| Step | Time | |------|------| | Parse element tree | 10 - 50ms | | Find target element | 1 - 10ms | | Execute interaction | 10 - 50ms | | Total per action | 20 - 100ms |

That is under a tenth of a second.

What this means in practice

Consider a realistic workflow: filling out a contact form with 5 fields (name, email, phone, company, message) and clicking Submit. That is roughly 11 actions - click each field, type in it, then click submit.

  • Screenshot-based agent: 11 actions at 2-7 seconds each = 22 to 77 seconds
  • DOM-based agent: 11 actions at 20-100ms each = 0.2 to 1.1 seconds

The form that takes a screenshot agent over a minute to fill out gets completed in about one second with DOM control. And this gap only widens with more complex workflows. A 30-step task that takes a vision agent 2 to 3 minutes finishes in a few seconds with direct element access.

This is not a theoretical comparison. If you have used a screenshot-based AI agent, you have experienced the waiting - the agent takes a screenshot, pauses while the model thinks, moves the mouse slowly to a coordinate, clicks, pauses again for the next screenshot. With DOM control, actions execute at native speed. Forms fill out as fast as if you were pasting data in. Pages navigate instantly. The experience feels fundamentally different.

Accuracy: Why DOM Control Makes Fewer Mistakes

Speed is one thing, but reliability might matter even more. And this is where the gap between the two approaches gets really interesting.

The problems with screenshot-based perception

When a vision model looks at a screenshot, it is doing something remarkably difficult - trying to understand a complex visual scene and map it back to interactive elements. Here are the failure modes that come up regularly:

Overlapping elements. Modern web pages have layers - tooltips, dropdown menus, modal dialogs, sticky headers. When elements overlap in a screenshot, the vision model can misidentify what is clickable or select the wrong layer.

Dynamic content. Animations, loading spinners, auto-scrolling content, and transition effects can produce screenshots that capture the page in a mid-state. The model sees a blurred button or a half-loaded form and gets confused.

Resolution and scaling. Different monitors have different pixel densities. A Retina display renders elements at 2x resolution. The coordinates the model calculates might be correct for the image dimensions but wrong for the actual screen because of DPI scaling differences.

Visual similarity. Two buttons that look nearly identical - same size, same color, similar text - can easily be confused in a screenshot. "Save" and "Save As" look very similar at a glance. "Submit" on a form and "Submit" in a navigation menu might be indistinguishable in certain layouts.

Dark mode, themes, and custom styling. The same web page can look completely different depending on browser theme, system dark mode, custom CSS, or accessibility settings. A model trained primarily on light-mode screenshots may struggle with dark-mode interfaces and vice versa.

The "off by 5 pixels" problem. Even when the model correctly identifies the right element, the coordinates it outputs might be slightly off. Five pixels to the left and you click the adjacent menu item. Five pixels up and you hit the toolbar instead of the input field. This happens more often than you would expect, especially with small or densely packed UI elements.

Why DOM control avoids these problems

With DOM access, none of these visual ambiguities exist. The agent is not interpreting an image. It is reading a data structure.

  • Overlapping elements? The DOM tree represents each element's exact position in the layer hierarchy. The agent knows which element is on top.
  • Dynamic content? The DOM reflects the current state of the page in real time. The agent waits for the element to be ready, not for a static image to be captured.
  • Resolution and scaling? Irrelevant. The agent interacts with the element object directly, not with pixel coordinates.
  • Visual similarity? Every DOM element has unique identifiers, attributes, and positions in the tree. Two buttons that look identical visually have different IDs, different parent elements, and different positions in the markup.
  • Dark mode and themes? The DOM structure does not change when you switch themes. A button's ID, text, and role stay the same regardless of its color.

The result is that DOM-based agents have dramatically lower error rates for web-based tasks. They click the right button the first time. They type in the correct field. They do not get confused by page layouts, pop-ups, or visual styles.

Privacy and Cost: The Hidden Differences

Beyond speed and accuracy, the two approaches have very different implications for privacy and operating cost. These are the factors people tend to overlook, but they matter a lot.

What happens to your data

Screenshot approach: Every single action requires sending a full-resolution screenshot of your screen to a cloud API. Think about what is on your screen at any given moment - emails, documents, banking information, private messages, medical records, code with API keys. All of that gets transmitted as an image to a remote server for processing.

Over the course of a 10-step workflow, that is 10 screenshots of your screen sent to a third party. Over a full day of agent usage, it could be hundreds of images containing your most sensitive information.

DOM approach: No images are captured or transmitted. The agent reads element properties locally - button labels, input field values, page structure. For an agent like Fazm, screen analysis happens on your machine. Only the intent (what you want to do) gets sent to an AI model for planning, not images of your screen content.

For anyone working with sensitive data - medical professionals, lawyers, financial advisors, anyone handling PII or confidential business information - this distinction is significant. The DOM approach is fundamentally more private by design.

What it costs to run

Screenshot approach: Every action triggers a vision model API call. Vision model calls are expensive - processing an image through GPT-4V or Claude Vision costs significantly more than a text-only call. At scale, a heavy agent user might trigger hundreds of vision API calls per day. That cost adds up quickly, whether you pay it directly or it is baked into a subscription price.

DOM approach: No vision model calls needed for perception. The agent might still call an LLM for planning and decision-making, but those are text-only calls that cost a fraction of vision calls. The perception step - understanding what is on screen - is essentially free because it is just reading a data structure.

This cost difference is part of why some screenshot-based agents are limited in how many actions you can run per session or per day. The economics of sending every screenshot through a vision model are challenging at scale.

When Screenshots Still Make Sense

Being fair, the screenshot approach is not without advantages. There are real scenarios where it is the better - or the only - option.

Native desktop applications without accessibility support. Not every app exposes a usable accessibility tree. Older applications, certain games, and some custom-built software may have limited or no structured UI access. In these cases, looking at a screenshot may be the only way for an agent to understand what is on screen.

Canvas-based and graphical applications. Design tools, games, video editors, and other applications that render content on a canvas element do not have a traditional DOM. There is no button element to query - just pixels. Screenshot analysis is necessary for interacting with these kinds of applications.

Visual verification. Sometimes you genuinely need to verify what something looks like, not just what it is. Checking if a design renders correctly, confirming that a chart displays the right data visually, or verifying that a PDF looks right - these are tasks where screenshot analysis adds real value.

Universal compatibility. The screenshot approach works on virtually any surface - any operating system, any application, any interface. It does not require a browser extension, an accessibility API, or any special integration. You just take a picture of the screen and work with it. This universality is a genuine advantage for agents that need to work across wildly different environments.

The Hybrid Approach: Getting the Best of Both

The smartest AI agents do not commit to one approach exclusively. They use the right tool for the right situation.

A hybrid approach looks like this:

  • Web pages in a browser? Use DOM control. The page structure is readily available through browser APIs or extensions. Every element is identifiable and directly interactable. There is no reason to resort to screenshots.
  • Native macOS applications? Use the accessibility tree. macOS provides a comprehensive accessibility API that exposes app UI elements in a structured format - the same infrastructure that screen readers and assistive technologies use.
  • Apps with limited accessibility support? Fall back to screenshot analysis. Use vision capabilities for the rare cases where structured access is not available.

This is the approach Fazm takes. For browser-based tasks, it uses direct DOM control through a browser extension, which gives it native-speed interactions with any web page. For native Mac apps, it leverages macOS accessibility APIs to read and interact with application interfaces. The screenshot approach is reserved for edge cases where neither DOM nor accessibility access is available.

The result is an agent that is fast and accurate for the 90%+ of tasks that involve web pages and standard Mac apps, while still being able to handle the occasional app that does not expose structured UI data.

What This Means for You

If you are evaluating AI agents - or just curious about how they work - the perception approach is one of the most important technical details to understand. It tells you more about the agent's real-world performance than the underlying language model does.

Here is a quick cheat sheet:

| Factor | Screenshot-Based | DOM-Based | |--------|-----------------|-----------| | Speed per action | 2-7 seconds | Under 100ms | | 10-step workflow | 20-70 seconds | Under 1 second | | Accuracy on web tasks | Good but imperfect | Near-perfect element targeting | | Privacy | Screen images sent to cloud | Local element data only | | Cost per action | High (vision model calls) | Low (text-only or free) | | Works on any surface | Yes | Requires DOM or accessibility API | | Handles visual verification | Yes | No |

A fast, accurate, private agent built on DOM control will outperform a screenshot-based agent using the most powerful vision model in the world for the vast majority of everyday computer tasks. The model matters, but the architecture matters more.

As AI agents become a standard part of how people use computers - and that shift is happening right now - understanding these mechanics helps you choose the right tool and set the right expectations. Next time you see an AI agent demo, pay attention to how long each action takes. That pause between steps? That is the screenshot-analyze-respond cycle in action. If the agent moves at native speed with no visible delay, it is probably using direct DOM or accessibility tree access.

Try the DOM Approach for Yourself

Fazm is free, open source, and built on the DOM-first architecture described in this article. If you want to experience the speed and accuracy difference firsthand:

  • Download Fazm from fazm.ai/download
  • Star the repo on GitHub to follow development
  • Join the waitlist at fazm.ai for early access to new features

Give it a voice command and watch how fast the actions execute. Once you see DOM control in action, the screenshot approach feels like watching a slideshow.