Privacy and Security

How AI Agents Actually See Your Screen: Privacy Implications

When you give an AI agent permission to interact with your computer, how much does it actually see? The answer depends entirely on the underlying technology. Screenshot-based agents capture a full image of your display and send it to a vision model for interpretation. Accessibility API agents read structured metadata about UI elements without ever seeing the pixels. The privacy implications of this difference are substantial. This guide explains both approaches, what data flows where, and how to make informed decisions about which tools to trust with your screen.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. The Data Flow: Screenshots vs Accessibility APIs

Understanding the privacy implications starts with understanding the data flow. When a screenshot-based agent needs to understand what is on your screen, it calls the operating system's screen capture API, receives a full-resolution image of your display (or a specific window), and sends that image to a cloud-hosted vision model for interpretation. The model returns a description of what it sees, and the agent uses that description to decide its next action.

When an accessibility API-based agent needs the same information, it queries the operating system's accessibility tree. This tree contains structured metadata: element types (button, text field, label), element names, element values, and element positions. No images are involved. The data is a text-based tree structure that describes the UI hierarchy.

The critical difference is what crosses the wire. Screenshot data includes everything visible on screen: personal messages, financial information, private documents, notification content, browser tabs, and background applications. Accessibility data includes only the metadata of UI elements the agent is interacting with. A password field in the accessibility tree shows up as "secure text field, name: Password" rather than showing the actual password characters.

2. What Screenshot Agents Actually Capture

A single screenshot at 1440p resolution is roughly 2 to 4 MB as a PNG or 500 KB to 1 MB as a compressed JPEG. An active agent taking screenshots every few seconds generates hundreds of images per session, each containing whatever was on your screen at that moment.

The incidental capture problem is significant. While the agent is automating a task in one application, the screenshot captures everything else too. A Slack notification with a private message appears in the screenshot. A background browser tab showing your bank account is captured. An email preview in your notification center is included. The agent did not need any of this information, but it was sent to the cloud for processing anyway.

Some screenshot agents mitigate this by capturing only the active window rather than the full screen. This helps but does not eliminate the problem. Even within a single application window, there may be more data visible than the agent needs. A spreadsheet application shows many rows and columns; the agent might need only one cell, but the screenshot captures the entire visible sheet.

Data retention is another concern. When screenshots are sent to a cloud vision API, the cloud provider's data retention policies apply. Some providers retain input data for model improvement. Some retain it for abuse detection. Even providers that do not retain data still process it on their servers, which means the data exists (however briefly) outside your control. For sensitive data, even brief external processing may violate your organization's data handling policies.

Privacy-first screen interaction

Fazm reads structured accessibility metadata, not screenshots. Your screen data stays on your Mac.

Try Fazm Free

3. What Accessibility API Agents Actually Read

The macOS accessibility API (part of the ApplicationServices framework) provides a tree of UI elements for each application. Each element has a role (button, text field, static text, table, row), optional attributes (label, value, enabled state, position, size), and child elements. The tree mirrors the visual hierarchy of the application but contains only metadata, not rendered pixels.

An accessibility API agent reading a contact form sees: "text field, label: First Name, value: (empty)" and "text field, label: Email, value: (empty)" and "button, label: Submit." It does not see the form's visual design, the background color, the font, or any other visual content on the page that is not part of the form.

This scoping is inherent to the technology, not a policy layer that can be bypassed. The accessibility API simply does not provide pixel data. Even if an agent wanted to capture visual information through the accessibility API, it could not. This is a structural privacy guarantee that does not depend on the agent being well-behaved.

The trade-off is that accessibility API agents cannot perform visual tasks. They cannot verify that a chart looks correct, check that colors match brand guidelines, or confirm that a layout is visually acceptable. For these tasks, some form of visual capture is necessary. But for the vast majority of automation tasks (clicking buttons, filling forms, reading text, navigating menus), the accessibility tree provides all necessary information without any visual data exposure.

4. Cloud Processing vs Local Processing

Where data is processed matters as much as what data is captured. Screenshot agents necessarily send images to cloud-hosted vision models because running large vision models locally is computationally expensive. This means your screen data traverses the network and is processed on servers you do not control.

Accessibility API data is lightweight text that can be processed locally. An agent that reads the accessibility tree and makes decisions based on element labels and values can run entirely on your machine. The data never leaves your computer. Even when the agent needs to call an LLM for decision-making, the data it sends is structured text (element names and types), not images of your screen.

For organizations with strict data sovereignty requirements, this distinction is decisive. Financial institutions, healthcare providers, and government agencies often cannot send screen content to cloud services. Accessibility API-based agents that process data locally meet these requirements by design. Screenshot-based agents would need on-premises vision model deployment, which is technically possible but significantly more complex and expensive.

Fazm and similar tools that use accessibility APIs keep your data local by default. When cloud processing is needed (for complex reasoning about what action to take), only the structured element metadata is sent, not raw screen content. This minimizes data exposure while still leveraging powerful cloud models for decision-making.

5. Making Practical Privacy Decisions

When evaluating AI agent tools for screen interaction, ask these specific questions: Does the tool capture screenshots or use accessibility APIs? If screenshots, does it capture the full screen or just the active window? Where are screenshots processed, on your machine or in the cloud? What is the data retention policy for any cloud-processed data? Is the tool open source so you can verify its data handling claims?

For personal use on non-sensitive tasks, screenshot-based agents are convenient and the privacy trade-offs may be acceptable. For professional use with any sensitive data (customer information, financial records, proprietary documents), the structural privacy advantages of accessibility API agents are significant.

If you must use a screenshot-based agent, take precautions: close applications with sensitive data before running the agent, disable notifications during agent sessions, use the agent on a dedicated virtual desktop with only the relevant application visible, and review the agent provider's data retention policies carefully.

For teams deploying AI agents across multiple employee machines, establish a clear policy about which tools are approved and which types of data they may access. Monitor agent activity logs to verify compliance. And prefer tools that produce structured audit trails (another advantage of accessibility API approaches) so that policy compliance can be verified automatically rather than relying on manual review.

The privacy landscape for AI agents is still evolving. Regulatory frameworks are catching up to the technology. Making informed choices now, based on the actual data flows rather than marketing claims, positions you well for whatever regulations emerge.

Automate your Mac without exposing your screen

Fazm uses accessibility APIs to interact with your apps, keeping your visual data private and local.

Try Fazm Free

Open source. Free to start. No screenshots sent to the cloud.