Microsoft's computer use agent runs in a Cloud PC. Your Mac already has the API to do the same job locally.
Copilot Studio's Computer Use is a Windows 365 Cloud PC pool gated by Microsoft Entra and Microsoft Intune, driven by a vision model that takes a screenshot every step. That shape is right for a company replacing RPA at thirty-seat scale. It is the wrong shape for one person on a Mac who just wants their own desktop automated. This guide walks through what Microsoft actually ships, the trade-offs of the vision loop, and what the local Mac-native counterpart looks like when the observation surface is the accessibility tree instead of a screenshot.
What Microsoft actually ships, stripped of the marketing
Inside Copilot Studio, Computer Use is a tool an agent can call that lets it propose UI actions (click, type, scroll, request a new screenshot) and adapt by re-evaluating the latest screenshot after each action. The model underneath is either OpenAI's Computer-Using Agent or Anthropic's Claude Sonnet 4.5, depending on which you pick. Both are multimodal vision models reasoning directly over images.
The execution environment is not a user's laptop. It is a Cloud PC pool, powered by Windows 365 for Agents, that auto-scales based on workload demand. Each agent task gets a Cloud PC instance, signs in with credentials the admin stored in Copilot Studio's built-in credential vault, drives the apps inside that Cloud PC, then logs everything (including screenshot replays) into Microsoft Purview. Identity flows through Microsoft Entra. Device policy flows through Microsoft Intune.
That shape is a reasonable answer to a specific problem: an enterprise that wants to replace old RPA scripts with an agent that can adapt to UI changes. It is also the wrong shape for someone who already owns a Mac and just wants to automate their own desktop. The two paragraphs below unpack why.
Microsoft's Computer Use, end to end
The per-step cost of the vision loop
Every action the Microsoft agent takes pays the same tax: take a screenshot of the Cloud PC desktop, upload it, have the vision model reason over it, get back an action, execute the action, take another screenshot. That is the loop by design; it is how the agent adapts when a button moves or a dialog changes. It is also why Purview session replays come for free (the screenshots are already the input signal). And it is why Copilot Studio's computer use is compute-heavy per step in a way a local accessibility-based loop simply is not.
The local Mac-native counterpart, in one probe
Fazm is a consumer Mac app that drives your own desktop by reading the macOS accessibility tree. It is the same tree VoiceOver reads, the same tree Shortcuts uses, the same tree your screen reader would use if you enabled one. macOS already exposes it. The question is just whether you trust the API.
Here is the exact probe Fazm runs every cold launch, and every five seconds while the app is open, to confirm the permission works. Three API calls. No cloud. No screenshot.
Why the AXError branching in that probe actually matters
A naive probe would just check AXIsProcessTrusted() and be done. On macOS 14 and later that boolean can lie: macOS granted the permission, but the per-process cache has not caught up. Fazm instead performs a real AX call and reads the AXError. .success, .noValue, .notImplemented, and .attributeUnsupported all mean the API is alive. .apiDisabled means accessibility is globally off. .cannotComplete is the interesting one: some Mac apps genuinely do not implement accessibility (Qt-based apps, some Python UIs, PyMOL). If the frontmost app happens to be one of those, a single probe cannot tell the difference between "my permission is broken" and "this specific app does not expose AX".
So Fazm falls back to a second probe against Finder, which is guaranteed to implement accessibility. If Finder also fails, the permission really is stuck and Fazm asks you to quit and reopen. If Finder succeeds, the original .cannotComplete was app-specific and the permission is fine. If Finder is not running, Fazm falls back one more level to an event-tap probe that bypasses the TCC cache entirely.
Nothing about that loop is possible in the Copilot Studio model, because the agent is not on your Mac; it is in a Cloud PC on the other side of a screenshot pipeline. The local approach gets diagnostics the hosted approach cannot.
“The frontmost app's AXError is the entire permission health check. Three lines, no cloud, no screenshot.”
Desktop/Sources/AppState.swift lines 433 to 463
The two architectures, side by side
Copilot Studio Computer Use vs Fazm local agent
Two honest comparisons. One fits a 30-seat enterprise replacing RPA. The other fits one person on a Mac.
| Feature | Copilot Studio | Fazm (local) |
|---|---|---|
| Where the agent actually runs | Windows 365 Cloud PC pool (Microsoft-hosted VM) | Your own signed .app on your own Mac |
| What the agent reads on each step | A fresh screenshot of the Cloud PC desktop | The live macOS accessibility tree, plus a structured diff |
| Underlying model | OpenAI CUA or Claude Sonnet 4.5, always on an image payload | Any model you want, reasoning over attribute-level deltas |
| Identity and admin | Microsoft Entra + Intune policies, Purview logging | macOS TCC permission prompt, logs stay on your disk |
| What you automate | Windows apps and web pages inside the Cloud PC | Any AX-compliant Mac app: Mail, Calendar, Finder, Messages, Notes, Safari, Catalyst apps |
| Per-step latency driver | Image encode + upload + decode + vision reasoning | Accessibility tree walk (tens to low hundreds of ms) |
| Credentials | Re-entered into a Copilot Studio credential vault | Whatever you are already signed in to on this Mac |
| Licensing shape | Microsoft 365 + Copilot Studio + Windows 365 seats | Consumer Mac app, signed and notarized, self-serve install |
| Right fit | Enterprise RPA replacement at 30+ seat scale | One person, one Mac, automating their own desktop |
Both are valid. Pick by where you actually want the automation to live.
Why the observation surface changes the per-step cost shape
A vision loop costs roughly the same per step no matter what the step does. Click a button: image in, image out. Type a single character: image in, image out. Scroll one row: image in, image out. That is the shape of "take a screenshot and reason over it". It is predictable, it is auditable, it is expensive per action.
A tree-diff loop costs almost nothing for small changes and exactly as much as the tree is complex for big changes. Type a character into a text field: the diff is one AXTextField whose AXValue attribute went from one string to another. Open a new dialog: the diff is the set of new AXButton / AXTextField / AXStaticText children inside the AXSheet. The agent gets attribute-level deltas instead of pixels, which means the next decision does not pay any image decode to know what happened.
Whether that matters to you depends on what you are optimizing. If you are an enterprise with Purview budget, vision-loop cost is a rounding error. If you are one person with an LLM bill you pay yourself, every image encode adds up.
Same task, two execution models
User says 'file my expense report from last week'. Copilot Studio provisions a Cloud PC, signs in with the vault credential, opens the expense web app inside the Cloud PC, takes a screenshot, the CUA model reasons over it, returns 'click the New Report button', the Cloud PC executes the click, takes another screenshot, and the loop continues. Every step logs a screenshot to Purview. The user's actual Mac is only rendering the Cloud PC session.
- Every action pays one screenshot round trip
- Credentials live in Copilot Studio's vault
- All execution happens inside Windows 365
- Purview gets a full replay by design
- Requires an Entra tenant and Copilot Studio license
What the local-on-your-Mac shape actually gets you
No Cloud PC pool in the middle
The agent is not remoting into a Microsoft-hosted Windows VM to drive a copy of the apps you already own. It drives the apps you already have open, on the Mac in front of you.
No Entra / Intune admin
macOS already gates accessibility through TCC. One native permission prompt, scoped to this bundle ID, revocable from System Settings. No tenant, no policy file.
No screenshot per step
The primary read path is the accessibility tree. A screenshot is only captured when the tree is bare (games, some Electron apps), and only for that one step.
Works with any AX-compliant Mac app
Finder, Mail, Calendar, Messages, Notes, Safari, Settings, plus Catalyst apps like WhatsApp. If VoiceOver can read it, the agent can drive it.
The automation engine is open source
mcp-server-macos-use on GitHub, a single Swift binary. Fazm bundles it pre-signed; nothing stops you from running it against Claude Desktop, Cline, or your own MCP client.
Consumer install path
Download, drag to Applications, grant accessibility. No tenant to provision. No Cloud PC pool to attach. No per-seat license to justify.
Pick Copilot Studio when...
- You already have Entra, Intune, Purview, and Power Platform licensing.
- You are replacing RPA scripts at multi-seat scale.
- You need session-level replay audits inside Purview.
- You need deterministic OS image / Chrome version / DPI across runs.
- The target apps are Windows-only anyway.
Pick a local Mac agent when...
- You are one person and the Mac in front of you is the machine you want automated.
- The apps you care about are Finder, Mail, Calendar, Messages, Notes, Safari, Catalyst apps.
- You do not want your credentials re-stored in a third-party vault.
- You would rather pay for the LLM call than the Cloud PC seat.
- You want diagnostics (AX probes, event-tap probes) that only a local process can run.
A note on screenshots: Fazm uses them too, just not every step
The honest version of the "no screenshots" story is that Fazm does capture screenshots; it just does not use them as the primary observation surface. ScreenCaptureManager.swift lines 30 to 43 show the capture path: given a PID, find the frontmost window, call CGWindowListCreateImage with .boundsIgnoreFraming + .bestResolution, save the PNG. That is used when the accessibility tree is genuinely bare (games, Canvas-rendered web content, some Electron apps) and when the user explicitly asks for a visual.
The cost model flips: screenshots are expensive per step, so they fire once when needed, not continuously. The Microsoft model is the opposite: screenshots are the input, so they fire continuously because the pipeline is built around them. Neither is universally better. They just make different trade-offs.
Can you run the engine without Fazm?
Yes. The automation engine is an open source MCP server,github.com/mediar-ai/mcp-server-macos-use. It is a single Swift binary that speaks Model Context Protocol over stdio. Any MCP client (Claude Desktop, Cline, Zed's ACP bridge, your own harness) can drive Mac apps through it. Fazm bundles it atContents/MacOS/mcp-server-macos-useinside the signed .app so a non-developer gets a one-click install instead of a toolchain.
The rough Copilot Studio analog is "CUA exposed as a tool to an agent". The execution difference is the one this whole guide is about: hosted Cloud PC vs your own hardware.
Want the 15-minute version of this, live on your Mac?
I will share a screen, run the AX probe, and walk through what Fazm would and would not automate on your actual desktop. If a Windows 365 Cloud PC is the right answer for you, I will say so.
Book a call →Frequently asked questions
What is Microsoft's computer use agent, in one paragraph?
Computer use is a capability inside Microsoft Copilot Studio that lets an agent drive a Windows desktop or a web page by reasoning over screenshots. Microsoft provides the execution environment: a Cloud PC pool powered by Windows 365 for Agents, auto-scaling based on workload, with sign-in and admin policies flowing through Microsoft Entra and Microsoft Intune. Under the hood the agent uses a computer-using model (either Anthropic's Claude Sonnet 4.5 or OpenAI's CUA) that takes a screenshot, decides what to click or type, sends the action back, and takes another screenshot. Session replays with screenshots land in Microsoft Purview for traceability. It is a hosted, enterprise-shaped product.
Why does it run in a Windows 365 Cloud PC pool instead of on the user's own machine?
Two reasons, both reasonable for the buyer Microsoft is building for. First, isolation: putting the agent on a user's real workstation means giving the agent the user's sign-ins, tokens, and local permissions, which an enterprise security team cannot audit. A Cloud PC is a disposable, policy-bound surface that can be reset, monitored, and shut down. Second, determinism: Microsoft can guarantee the OS image, the Chrome version, the fonts, and the DPI scale in a Cloud PC. On a real user laptop none of that is stable. The cost is that you are paying for Windows 365 seats for machines that only exist to host the agent, and every user credential the agent needs has to land inside Microsoft's built-in credentials vault.
How does the Copilot Studio agent actually observe the screen?
It takes a screenshot. The CUA model combines vision capabilities with an agent framework that plans and reasons over the image, then proposes the next UI action (click, type, scroll, request a new screenshot). After the action runs inside the Cloud PC, the tool takes a fresh screenshot and the loop repeats. This is why session replays with screenshots work so cleanly in Purview: the screenshots are already the input signal, so logging them costs nothing extra. It is also why the approach is compute-heavy per step: every action pays one image encode, one model call on an image payload, one image decode.
What would it cost me, as one person with a Mac, to use Microsoft's version?
Copilot Studio is licensed through Microsoft 365 and the Power Platform, which is not a one-person consumer product. You would need a tenant with Microsoft Entra, a Copilot Studio license, Windows 365 capacity for the agent host PCs, and an admin (you, if you are the only person) to configure Intune policies and credential vaults. Then you would run the agent against a remote Windows desktop that your Mac is only displaying. If what you actually want is 'drive the apps that are already open on this Mac', you are paying enterprise overhead for a Windows VM that re-creates a worse copy of your own machine.
What does Fazm do differently, in one paragraph?
Fazm is a consumer-facing Mac app, not a SaaS platform. It runs as a signed .app on your own Mac, reads the live macOS accessibility tree of whatever app you point it at, and drives that app directly through the same APIs VoiceOver and Mission Control use. No Cloud PC. No Entra admin. No screenshot on every step. The agent gets a structured accessibility-tree diff back after every click, type, or keypress, so the next decision is planned against attribute-level deltas instead of a picture. Screenshots are a fallback for apps that ignore accessibility (games, some Electron apps), not the primary observation surface.
How does Fazm actually check that accessibility is working when it launches?
It runs a three-line probe. AppState.swift lines 433 to 485: take the frontmost app's process ID, wrap it with AXUIElementCreateApplication, call AXUIElementCopyAttributeValue against kAXFocusedWindowAttribute, then branch on AXError. .success, .noValue, .notImplemented, or .attributeUnsupported all mean the permission works. .apiDisabled means accessibility is globally off. .cannotComplete is ambiguous (some apps genuinely do not implement accessibility), so Fazm re-probes against Finder, which is a known AX-compliant app; if Finder also fails, the permission is truly stuck and Fazm asks the user to quit and reopen. That loop runs on every cold launch, plus on a 5-second retry timer if the permission drifts mid-session. No cloud round trip. No screenshot.
When is the Microsoft approach actually the right answer?
When you are replacing robotic process automation (RPA) inside a large company, you want the Microsoft approach. You already have Entra, Intune, Purview, and Power Platform licensing. You want isolation because the agent is driving customer systems on behalf of thirty employees. You want auditable session replays in Purview. You want a predictable OS image so the same workflow runs the same way for every user. And you do not want the agent driving the sales team's actual laptops. Copilot Studio's computer use is built for that buyer, and it is a very reasonable fit for that buyer.
When is a local Mac-native agent the right answer?
When you are one person, on your own machine, with your own apps already signed in, and what you want is 'automate my own desktop'. That is the consumer shape. You do not want to rebuild your life inside a Windows 365 Cloud PC just to have an agent click the Compose button in Mail. You also do not want to pay per-step vision-model latency on every action when your Mac already knows exactly which AXButton is under the cursor. This is where Fazm fits. It is the local counterpart to Copilot Studio's computer use: same underlying idea (an agent drives a GUI), opposite execution model (your hardware, your accessibility tree, your apps).
Can I run Fazm's automation engine outside Fazm, the same way Copilot Studio uses CUA models through MCP?
Yes. The engine is an open source MCP server, github.com/mediar-ai/mcp-server-macos-use, a Swift binary Fazm bundles at Contents/MacOS/mcp-server-macos-use inside the signed app. Anything that speaks Model Context Protocol (Claude Desktop, Cline, Zed's ACP bridge, your own agent harness) can drive Mac apps through it. Fazm bundles it plus a Claude-powered chat UI plus permission handling plus auto-updater plus audio capture so that a non-developer on a Mac gets a one-click install instead of a toolchain. The Microsoft equivalent of this is CUA-as-a-tool inside Copilot Studio, but the execution is Cloud PC rather than local.
Does Fazm support Windows?
No. Fazm is macOS-only. The whole design premise is that macOS ships a rich, first-class accessibility API that every well-behaved Mac app supports (Finder, Mail, Calendar, Messages, Notes, Settings, Safari, plus Catalyst apps like WhatsApp), and that API is the right observation surface for a local agent. Windows has UI Automation, which is similar in spirit but different in practice. A Windows local agent would be a separate product with a separate tree walker. If your target is Windows, Copilot Studio's computer use running against a Windows 365 Cloud PC is, today, the most mature path.