Local AI desktop agentmacOSPassive observer

A local AI desktop agent that watches first, proposes one task, then drives any app on your Mac

Most desktop AI agents wait for a prompt, screenshot the screen, and hope the model clicks the right pixel. Fazm runs a rolling 60-minute observer on your active-window video, asks Gemini which single task an AI could actually take off your plate, and then executes it through a Claude Code subprocess that drives apps via the macOS accessibility tree. Every number on this page is pulled from the public Fazm source tree.

F
Fazm
12 min read
4.9from Free Mac app, public source
60-minute rolling observer is a literal constant at GeminiAnalysisService.swift:69
17 bundled Agent Skills auto-installed to ~/.claude/skills/ on first launch
Accessibility-native control works with any Mac app that exposes AX, not only the browser

Four numbers you can verify yourself

These are not rounded. They are Swift constants and a directory count pulled straight out of the public source tree. Clone the repo, open the files, and the numbers match.

0 minRolling observation window (targetDurationSeconds = 3600)
0Max chunks in the active-window buffer (maxChunks)
0Bundled skills auto-installed to ~/.claude/skills/
0 MBInline cutoff before Gemini resumable upload kicks in

Sources: Desktop/Sources/GeminiAnalysisService.swift lines 67-71, Desktop/Sources/SkillInstaller.swift, and ls Desktop/Sources/BundledSkills/.

The shape of the observer

0 min of video → 0 task proposal → 0 duplicates

The whole pipeline is designed around one number on the way in and one on the way out. Sixty minutes of active-window clips go up, and at most one actionable task comes back. The de-duplication gate is enforced inside the prompt itself: the model has to SELECT from observer_activity and chat_messages before returning a verdict.

What a local AI desktop agent usually looks like, and what Fazm does instead

The category has converged on a single shape: type a prompt, screenshot the screen, ask a vision model where to click, hope it works. Fazm inverts the direction of information flow. The agent is always watching first; you only intervene when you want to accept or dismiss a proposal.

FeatureTypical local AI desktop agentFazm
Trigger modelUser types a promptSilent 60-minute observer proposes one task at a time
Primary input to the action modelScreenshots of the screen, parsed by a vision modelmacOS accessibility tree via AXUIElement (structured text)
App coverageUsually browser only or a hardcoded allow-listAny macOS app that exposes AX, via the generic macos-use MCP
Where state livesServer-side session, often behind a proxyLocal: fazm.db SQLite, Application Support chunks, ~/.claude/skills/
Where the 'policy' is definedBlack-box system prompt in a hosted agent17 markdown skill files in ~/.claude/skills/ you can read and edit
Proactive de-duplicationNot applicable; no proactive loopObserver SELECTs observer_activity before proposing, returns NO_TASK on overlap
Install and setupDev framework, API keys, Docker, or a dashboard signupConsumer Mac app, one-click installer, onboarding chat walks permissions
Inference locationVaries; some hosted, some localCloud for Claude and Gemini, local for everything else

Anchor fact: the literal 60-minute rolling observer

The whole proactive model rides on three Swift constants. They are in the public source. They are small. The interesting thing is the prompt around them, which forces the model to de-duplicate against your local database before proposing anything at all.

3600s

Trigger analysis after 60 min of recordings

Desktop/Sources/GeminiAnalysisService.swift:69 (inline comment next to targetDurationSeconds)

Desktop/Sources/GeminiAnalysisService.swift

Three numbers, one cap, one model, one guardrail prompt. That is the whole observer in about 30 lines of real code. Everything else (the Discovered Tasks UI, the chat flow, the tool invocations) is consequence.

How the observer turns video into one proposal

Five steps from a passive background buffer to a single task you can accept. Everything between your screen and the task row in the UI is local, except for the Gemini call itself.

1

A rolling video buffer accumulates in the background

Fazm records the active window only (not the full screen), in short chunks, and writes them to Application Support. The buffer is capped at 120 chunks and ~60 minutes of wall clock. The metadata for each chunk includes app name, window title, and frame count so the analysis model knows which app you were in.

2

Gemini gets the buffer and the observer prompt

After 60 minutes, Fazm uploads the chunks to Gemini's File API (resumable for anything over 1.5 MB, inline base64 for smaller clips) and runs the observer prompt against gemini-pro-latest. The prompt is explicit about what a good task looks like: concrete, completable, 5x faster for an agent, not a visual-inspection task.

3

Gemini queries your local database before proposing anything

The prompt forces two mandatory SELECTs against fazm.db: recent observer_activity rows and recent chat_messages. If a similar task was already proposed, or the agent is already doing it, the verdict is NO_TASK. This is the de-duplication gate.

4

A single task proposal, or nothing at all

If the model is confident, it returns TASK_FOUND with a title, a 3-to-5 sentence description, and a full markdown recommended-approach document. If not, it returns UNCLEAR (do not guess) or NO_TASK (nothing to do here). The proposal appears in the Discovered Tasks tab inside Fazm as a single row you can expand, discuss, or dismiss.

5

You tap Discuss, and a Claude Code subprocess takes it from there

Accepting a task opens the chat with the task title and description pre-filled. From there, a Claude Code subprocess with shell access and MCP-registered accessibility tools drives whatever apps the task needs. The observer keeps running in parallel, already watching for the next suggestion.

End-to-end dataflow

Three inputs, one decision hub, three outputs. The hub is the observer prompt plus your local SQLite database. The outputs are the three verdicts: a proposal you can act on, a skipped frame, or an admitted 'I cannot tell.'

Observer pipeline

Active-window video
App + window metadata
Local fazm.db
Gemini observer
TASK_FOUND
NO_TASK
UNCLEAR

See the observer run on your own machine

Install Fazm, grant accessibility and screen recording, and let it watch for an hour. The first proposal appears in the Discovered Tasks tab. Dismiss it, accept it, or ignore it.

Download Fazm

The 17 skills that ship with the app

When you first launch Fazm, SkillInstaller.swift walks Desktop/Sources/BundledSkills/, picks up every *.skill.md file, and copies it into ~/.claude/skills/<name>/SKILL.md. The current set is below, literal directory listing, one chip per file.

ai-browser-profilecanvas-designdeep-researchdoc-coauthoringdocxfind-skillsfrontend-designgoogle-workspace-setuppdfpptxsocial-autopostersocial-autoposter-setuptelegramtravel-plannervideo-editweb-scrapingxlsx

Because these are flat markdown files, the contract between the app and the agent is readable. If you do not like what web-scraping does, open ~/.claude/skills/web-scraping/SKILL.md and edit it. On next launch, the installer runs a SHA-256 check against the bundled version and only overwrites if the bundle changed. Your edits survive.

The installer, in one function

This is the real code that puts the 17 skills on your machine. Nothing is hidden behind a server call. Nothing is pulled from a registry at runtime. The skills ship inside the app bundle and get copied to your home directory.

Desktop/Sources/SkillInstaller.swift

What the log looks like on first run

Captured from /tmp/fazm-dev.log on a fresh install. The observer boots, the skill installer runs, the permission probes report in, and the rolling buffer starts filling.

fazm-dev.log

Why the consumer-app shape matters

Local AI desktop agents tend to ship as developer frameworks, CLI tools, or demo pages with a cloud backend. Fazm is built around the opposite assumption: the person running the agent never wants to touch a terminal. Six choices follow from that assumption.

Passive observer, not a prompt box

The agent does not wait for you to describe what you want. It watches your active-window video in 120-chunk rolling buffers and proposes work when it sees a pattern.

Accessibility-native control

Actions run through AXUIElementCreateApplication on the macOS AX tree. Structured role+label input, sub-second per action, model-neutral.

Skills you can read

17 markdown skill files get installed to ~/.claude/skills/ on first launch. Open them in any editor. The installer uses SHA-256 to preserve your edits.

Any Mac app, not only the browser

Native SwiftUI, AppKit, and Catalyst apps are covered by the generic macos-use MCP. Playwright, whatsapp, google-workspace fill in the non-AX gaps.

De-duplication built into the prompt

Before a verdict, Gemini must SELECT from observer_activity. If a similar task was already proposed, the answer is NO_TASK.

Consumer app, not a framework

No API keys to paste at install time, no Docker, no CLI. Download, grant accessibility and screen recording, done.

The thing no other 'local AI desktop agent' page says

A desktop agent that watches first is a fundamentally different product from one that waits for a prompt

Prompt-driven agents reward users who already know what they want and can articulate it. The problem is that most of the tasks worth automating on a Mac are invisible to the user until someone points them out: three copy-paste steps between Notion and Linear, the same weekly spreadsheet refresh, the Google Workspace sync that drifts by 15 minutes because nobody remembered to rerun the script. A prompt box cannot surface any of those. A passive observer can.

The other half of the story is execution. Once a task is identified, Fazm runs a Claude Code subprocess with shell, filesystem, and accessibility access. The policy for how to act in any given app lives in 17 markdown files in ~/.claude/skills/, which you can read, edit, and audit. That combination (a passive proposer plus a legible executor) is the shape of the page you are reading. It is also the shape of the public source tree.

Frequently asked questions

What actually makes Fazm a local AI desktop agent and not just another chat window?

Two things. First, the control plane runs on your Mac: a Swift app holds the permission state, a Node-based ACP bridge invokes tools, a Claude Code subprocess does the thinking, and an MCP server calls macOS accessibility APIs on your real apps. Nothing routes through a server to get back to you. Second, the trigger is local: a rolling 60-minute active-window video buffer is analyzed by Gemini in the background to propose one concrete task you could hand to the agent. The inference endpoints themselves (Claude, Gemini) are still cloud-hosted, but every byte of state about what you are doing lives on your machine until you accept a suggestion.

Where does the 60-minute number actually come from?

GeminiAnalysisService.swift:69 in the public Fazm source sets targetDurationSeconds: TimeInterval = 3600 with the inline comment 'Trigger analysis after 60 min of recordings'. The same file, line 68, caps the rolling buffer at maxChunks = 120 (which translates to 120 roughly 30-second focused-window clips). The model used for analysis is gemini-pro-latest (line 67). These are literal Swift constants, not marketing language.

How does Fazm decide which task to propose?

The prompt at GeminiAnalysisService.swift:12 asks Gemini for the ONE most impactful task an AI agent could take off the user's plate, and forces it to run two queries before deciding: a SELECT against observer_activity to avoid proposing work it already proposed, and a SELECT against chat_messages to avoid proposing work the user is already having the agent do. If the suggestion overlaps with a previous task's app and category, the verdict returned is NO_TASK. If the video is too blurry or the user's intent is unclear, the verdict is UNCLEAR. Only a high-confidence match returns TASK_FOUND with a description and a markdown recommended-approach document.

What actually happens when I accept a proposed task?

Fazm spawns a Claude Code subprocess. That subprocess has shell access, file system access, and a registered set of MCP servers (macos-use for accessibility-driven app control, whatsapp, google-workspace, playwright, and any you install). It also has access to the skills in ~/.claude/skills/ that the Fazm installer dropped there on first launch. When the agent decides it needs to click a button in Mail or type into Notes or drive a browser, it does that through the accessibility tree exposed by the macos-use MCP server. The result is a desktop agent that can touch any Mac app that exposes AX, not just the browser.

What are the 17 bundled skills, and why does that matter?

SkillInstaller.swift auto-discovers every *.skill.md file in Desktop/Sources/BundledSkills/ and copies it into ~/.claude/skills/<name>/SKILL.md. The current list is: ai-browser-profile, canvas-design, deep-research, doc-coauthoring, docx, find-skills, frontend-design, google-workspace-setup, pdf, pptx, social-autoposter, social-autoposter-setup, telegram, travel-planner, video-edit, web-scraping, xlsx. Each is a markdown file with YAML frontmatter that Claude Code reads at the start of a session. That means your 'agent' does not rely on a black-box policy. You can open the file, read what it does, edit it, delete it. The installer runs a SHA-256 checksum compare on every launch and only overwrites files whose bundled version actually changed.

Is the inference local? Do I need a GPU or Ollama?

No. The Claude Code subprocess uses your Anthropic account or subscription to hit api.anthropic.com. The observer's task-discovery analysis calls Gemini via Google's Generative AI endpoint. The 'local' part is that the agent loop, the permission state, the screen recording buffer, the active-window metadata, the fazm.db SQLite database, and the entire set of tools run on your Mac. No proxy server in between. If you prefer on-device inference, the Claude Code side can be pointed at a compatible local endpoint; the observer currently depends on Gemini's multimodal video input.

How is this different from Anthropic's computer use or Simular or Manus?

Those three are prompt-driven: you type an instruction, the agent screenshots, the model reads pixels, it clicks, repeat. Fazm flips that direction. It watches silently and proposes work, which means you get suggestions for tasks you would not have thought to ask about. It also uses the macOS accessibility tree as the primary input for actions, not screenshots, which is faster and gives the model a structured role+label tree rather than raw pixels. Finally, Fazm is shipped as a consumer Mac app with a one-click installer, not a framework or a playground. There is no SDK step.

Where does the screen recording data live?

Active-window video chunks are written to Application Support on your Mac (chunksDir in GeminiAnalysisService.swift, line 81) and survive app restarts via a JSON buffer index. Older screenshots are cleaned up after a configurable retention window in ScreenCaptureManager.swift. The discovered tasks themselves are stored in the observer_activity table inside fazm.db, which is a local SQLite file. When analysis fires, the relevant chunk files are uploaded to Gemini for multimodal inference and then remain on disk until the retention cleanup runs or the buffer wraps.

Can I run Fazm without granting screen recording?

Yes, but the proactive-task observer is the thing that needs screen recording plus accessibility. Without screen recording, Fazm still works as a reactive agent: you can open the floating bar with Cmd+\, hit your configured Ask Fazm shortcut, and have the agent drive apps for you. Without accessibility, the agent cannot click buttons in external apps and falls back to shell and filesystem operations only. The two permissions unlock different modes, and the onboarding flow walks through each separately.

How does Fazm avoid proposing the same task over and over?

The analysis prompt at GeminiAnalysisService.swift:30-34 makes two mandatory SELECT queries before deciding: one against observer_activity (the discovered_tasks log) and one against chat_messages (the user's recent agent conversations). The instruction is explicit: 'If the observer_activity query returns tasks that are similar to what you would suggest (same app, same type of work, same general activity), return NO_TASK.' It goes further and treats same-app same-category as similar, even when the details differ, and errs toward NO_TASK when in doubt. The result is that after the first week, you get fewer suggestions, not more, because the system compounds its knowledge of what you have already been offered.

What is the floor on 'any app on your Mac'? Are there apps that do not work?

Apps that expose the macOS accessibility tree work out of the box through the generic macos-use MCP server (click_and_traverse, type_and_traverse, scroll_and_traverse, press_key_and_traverse, refresh_traversal). That covers nearly every native SwiftUI, AppKit, and Catalyst app. The soft spots are Electron apps with partial AX trees, some Chromium-based tools, and very old pre-notarized apps. For those, Fazm ships specific MCP servers that bypass AX: playwright for browser automation, whatsapp for the WhatsApp Catalyst app, google-workspace for REST-based Google surfaces. If an app genuinely exposes no AX and no documented API, the agent falls back to shell commands or explains the limit instead of hallucinating an action.

Is Fazm open source and can I audit these numbers myself?

Yes. The Fazm desktop source tree is public. The 60-minute constant is at Desktop/Sources/GeminiAnalysisService.swift line 69. The 120-chunk cap is at line 68. The bundled skill installer is at Desktop/Sources/SkillInstaller.swift. The 17 bundled skills are files in Desktop/Sources/BundledSkills/. The accessibility probe code (Finder fallback + CGEvent tap) is at Desktop/Sources/AppState.swift lines 468 and 490. You can build the app yourself and verify that the observer actually runs the rolling buffer the way this page describes. The consumer app is free to install; the paid tier exists for extended usage and priority support.

Let an AI watch an hour of your Mac and tell you what to hand off

Free Mac app. Accessibility-native control of any app on your Mac. 17 skills you can read and edit. The observer runs in the background and proposes one task per window. Dismiss it, accept it, move on.

Download Fazm
fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.