AI Architecture Deep Dive

Proactive AI Agents and Local Sensing: Moving Beyond Reactive Assistants

A post on r/singularity captured a frustration that many people feel but struggle to articulate: "Every AI assistant built is reactive by design." You have to open a chat window, type a prompt, wait for a response. The AI never initiates. It never notices that you have been staring at a spreadsheet for 20 minutes, or that an email arrived that needs urgent attention, or that the same manual workflow is happening for the third time this week. The entire paradigm is built around the user asking and the AI responding. But the most useful assistant would not wait to be asked.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. The Reactive Default and Why It Exists

Reactive design is not an oversight - it is the path of least resistance. Building a reactive assistant requires solving one problem: given a user prompt, generate a good response. Building a proactive assistant requires solving three additional problems: what is the user doing right now, is this a moment where intervention would be helpful, and what action should be taken without being asked.

Each of those problems is harder than the original one. Knowing what the user is doing requires continuous context awareness - monitoring their screen, their applications, their calendar, their communication channels. Deciding when to intervene requires judgment about when help is welcome versus annoying. Deciding what action to take without a prompt requires understanding the user's goals and preferences deeply enough to anticipate needs.

The reactive model sidesteps all of this complexity. The user tells you what they want, you do it. No sensing required, no judgment about timing, no risk of unwanted interruptions. It is simpler to build, easier to evaluate, and harder to get catastrophically wrong. Which is why nearly every AI assistant shipped in 2023, 2024, and most of 2025 was reactive.

2. What Proactive Actually Means

Proactive does not mean the AI randomly interrupts you with suggestions. That is the Clippy failure mode, and it is what most people imagine when they hear "proactive AI." Real proactive behavior is more subtle and more useful:

Context-aware readiness - the agent knows what you are working on and has relevant context pre-loaded. When you do ask for help, the response is immediate and contextually informed, not starting from scratch.
Pattern recognition - the agent notices you performing the same manual workflow repeatedly and offers to automate it. Not on the first occurrence, but after the third or fourth time, when the pattern is clear.
Ambient monitoring - the agent watches for trigger conditions you have defined. An important email from a specific sender. A calendar conflict appearing. A file change in a monitored directory. When the trigger fires, the agent acts or alerts.
Predictive preparation - the agent knows your Monday morning routine involves pulling reports from three dashboards. It starts compiling the data before you sit down, so the summary is ready when you are.

Design principle: A proactive agent should make the user feel like the assistant "just knows" - not like it is constantly watching them. The sensing should be invisible. The value should be obvious.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Local Sensing vs. Cloud Screenshots

There are two fundamentally different architectures for giving an AI agent awareness of what the user is doing. They have radically different implications for cost, privacy, latency, and reliability.

The cloud screenshot approach takes a screenshot of the user's screen at regular intervals, sends it to a cloud vision model for interpretation, and uses the model's output to build context. Products like Rewind AI (now Limitless) pioneered this approach. It works but has significant limitations.

The local sensing approach uses operating system APIs - primarily accessibility APIs - to read structured information about what is on screen without taking screenshots. The agent knows you have a spreadsheet open, which cell is selected, what the values are, and which menu is active. All of this information is available locally, with no cloud round trip required.

Dimension	Cloud Screenshots	Local Accessibility APIs
Latency	1 to 5 seconds per screenshot analysis	Milliseconds for UI tree queries
Privacy	Screenshots leave the device	All data stays on device
Cost per query	$0.01 to $0.03 per vision API call	Effectively zero - OS API call
Reliability	Depends on vision model accuracy, breaks on unusual UIs	Deterministic - structured data from the OS
Information richness	Visual appearance only	Semantic structure - element types, labels, states, hierarchy
Continuous monitoring cost	$5 to $50 per day at 1 screenshot per minute	Negligible CPU usage

The cost difference is the killer. Continuous cloud screenshot analysis at any reasonable frequency is expensive. Local accessibility API queries are free. This makes the cloud approach viable for occasional analysis but impractical for the kind of always-on ambient awareness that proactive agents require.

4. Accessibility APIs as a Sensing Layer

Accessibility APIs were designed for screen readers and assistive technology. They expose a structured tree of every UI element in every application: buttons, text fields, labels, menu items, table cells, and their relationships to each other. On macOS, this is the Accessibility framework. On Windows, it is UI Automation. On Linux, it is AT-SPI.

For AI agents, this is the most underutilized sensing capability available. The accessibility tree tells you not just what is visible on screen, but the semantic meaning of what is visible. A screenshot shows pixels. The accessibility tree shows "a text field labeled Email Address containing john@example.com, focused, inside a form titled Contact Information, inside a window titled CRM - New Contact."

This semantic richness is what makes proactive behavior possible without vision models. The agent can monitor for specific conditions - "a dialog appeared with the word Error in its title" or "the user has been in the same spreadsheet cell for more than 5 minutes" - without any cloud calls.

Element-level monitoring without screenshots
Application state changes detected in real time
Text content readable without OCR
UI hierarchy gives context about what the user is doing, not just what they see
Works across all native applications, not just the browser

The limitation is coverage. Not all applications expose complete accessibility trees. Electron apps are generally good. Some native apps are sparse. Web content in browsers is well-covered through the DOM accessibility layer. The practical coverage on macOS in 2026 is roughly 85 to 90% of the applications most knowledge workers use daily.

5. Cost Analysis - Local vs. Cloud Approaches

Let us run the numbers on continuous context awareness for a single user over a working month (22 days, 8 hours per day):

Approach	Frequency	Monthly Cost	Data Sent Off Device
Cloud screenshots (1/min)	10,560 screenshots/month	$105 to $317 (vision API)	~30 GB of screenshots
Cloud screenshots (1/5min)	2,112 screenshots/month	$21 to $63	~6 GB of screenshots
Local accessibility polling	Continuous (event-driven)	$0 (local CPU only)	None
Hybrid (local sensing + occasional cloud LLM)	LLM called only on trigger events	$2 to $15 (depends on trigger frequency)	Text summaries only, not screenshots

The hybrid approach is where the economics make sense for proactive agents. Local sensing handles the continuous monitoring at zero marginal cost. When something interesting happens - an event that needs reasoning or action - the agent calls an LLM with a text description of the context, not a screenshot. The LLM decides what to do. Total cost: a few dollars per month instead of hundreds.

This is not a minor efficiency gain. It is the difference between a product that can offer always-on proactive behavior sustainably and one that hemorrhages money on cloud vision API calls.

6. Privacy and Trust Tradeoffs

A proactive agent that monitors your screen raises obvious privacy concerns. The architectural choice between cloud and local processing directly determines the privacy profile:

Cloud screenshot approaches send images of your entire screen to remote servers. These images capture everything visible: private messages, financial data, medical records, passwords you are in the middle of typing. Even with end-to-end encryption at rest, the images must be decrypted for vision model processing. The privacy surface area is enormous.

Local sensing via accessibility APIs keeps everything on device. The agent reads structured UI data from the OS, processes it locally, and only sends text summaries to cloud LLMs when reasoning is needed - and those summaries can be filtered to exclude sensitive fields. You can strip credit card numbers, passwords, and private messages before any data leaves the device.

For enterprise adoption, this is not a nice-to-have distinction. It is a dealbreaker. Security teams will not approve a tool that streams screenshots of employee screens to a third-party cloud service. They will consider a tool that processes screen context locally and sends only sanitized text summaries for LLM reasoning.

7. The Current Landscape of Proactive Agents

As of early 2026, the shift from reactive to proactive is still early. Most AI assistants remain prompt-response systems. But a few products are exploring proactive patterns:

Limitless (formerly Rewind) - pioneered continuous screen recording and transcription for personal context. Cloud-based approach with a pendant device for audio. Primarily focused on meeting summaries and recall, not proactive action.
Apple Intelligence on-device - Apple's approach uses on-device models for context awareness within its own apps. Limited to Apple's ecosystem but demonstrates the local-first architecture at scale.
GitHub Copilot workspace - proactive in the coding context, suggesting next steps and identifying issues before the developer asks. Domain-specific proactivity rather than general-purpose.
Fazm - a macOS agent that uses accessibility APIs for local sensing, enabling proactive context awareness across all native applications without cloud screenshots. The local-first architecture keeps monitoring costs at zero while enabling always-on ambient awareness of user activity.
Custom enterprise solutions - some companies are building internal proactive agents using accessibility APIs and local models for specific high-value workflows. These are bespoke but point to where the market is heading.

The trajectory is clear. As local models improve and accessibility API coverage expands, proactive agents will become the norm rather than the exception. The question is not whether this shift happens, but which architectural foundation - cloud screenshots or local sensing - becomes the standard.

The cost and privacy analysis strongly favors local sensing. The agents that can monitor continuously without racking up cloud bills or shipping screenshots off-device will be the ones that can offer true ambient intelligence sustainably.

Try a local-first proactive agent

Fazm uses macOS accessibility APIs for ambient context awareness - no cloud screenshots, no per-query costs. See what a proactive desktop agent actually feels like.

Free to start. Fully open source. Runs locally on your Mac.