AI desktop agentProactive observerGemini tool-use loop

The AI desktop agent that watches itself so it never suggests work it is already doing

Every other page for this keyword describes reactive agents: you type a command, the agent clicks and types. Fazm ships something no SERP page describes. A second agent, a Gemini observer, watches a 60-minute rolling video buffer of your active window, runs an agentic loop with three introspection tools, and refuses to interrupt you with a suggestion that duplicates work the first agent is already performing.

Matthew Diakonov, Written with AI

Published April 20, 202611 min read

Try Fazm

4.9from 200+ Mac users

Every constant anchored to a specific line in GeminiAnalysisService.swift

Grounded in the 1,181-line public observer source

No marketing, just the six constants and the three tools

The observer runs its own agentic loop

It watches. It queries its own database. It refuses redundant work.

60 minutes of screen video accumulate in a 120-chunk buffer

Gemini gets the video plus three introspection tools

Mandatory SELECTs against observer_activity and chat_messages run first

If the agent is already doing it, the verdict is NO_TASK

Only TASK_FOUND rows ever reach the overlay above the floating bar

0:00 / 0:05

The wire values, in one marquee

Every chip below is a constant, table, or verdict that appears verbatim in Desktop/Sources/GeminiAnalysisService.swift. Nothing invented. If a competitor's AI desktop agent does not have these values in its source, it is not running this loop.

model = "gemini-pro-latest"maxChunks = 120targetDurationSeconds = 3600inlineSizeLimit = 1_500_000retryCooldown = 300maxTurns: 5query_database(sql)read_dev_log(lines)get_active_sessions()observer_activitychat_messagesai_user_profilesindexed_filesVERDICT: NO_TASKVERDICT: TASK_FOUNDVERDICT: UNCLEARblockedSQLKeywords: 9SELECT-or-WITH-onlybuffer-index.jsonchunks_analyzed

Four numbers that bound the loop

1,181 is the total line count of the observer service. 120 is the chunk ceiling. 3600 is the trigger duration in seconds. 5 is the agentic turn cap. Four numbers are all you need to argue that this is a bounded, auditable loop, not a freewheeling agent.

0Lines in GeminiAnalysisService.swift

0maxChunks ceiling

0targetDurationSeconds (60 min)

0maxTurns in the agentic loop

The core idea

0 min of rolling video, 0 self-introspection tools

A reactive agent answers prompts. A proactive observer watches work that already happened and decides whether to offer to take any of it off the user's plate. The three introspection tools (query_database, read_dev_log, get_active_sessions) are what let it do that without duplicating work the reactive agent has already handled.

Reactive desktop AI agent vs. a desktop AI agent with a self-observing loop

The table below is a toggle between the two mental models. One is how every SERP result describes an AI desktop agent. The other is what Fazm's source actually ships, with the concrete file:line anchors in the right column.

Two definitions of 'AI desktop agent'

Waits for a prompt. Has no knowledge of prior suggestions. Has no knowledge of what the reactive agent is doing right now. Every task suggestion is a best guess from a single moment.

Trigger: one user message
No dedup against prior agent output
No self-awareness of concurrent agent activity
Same suggestion can surface twice in one hour

Anchor fact: six constants, twelve lines

The entire bound of the observer loop is a block of six constants at the top of GeminiAnalysisService.swift. The six constants at lines 67 through 78 plus maxTurns at line 377 (default arg, enforced in the agentic for-loop at line 760) define every boundary of the loop: the model, the buffer ceilings, the upload split, the cooldown, and the turn cap.

60 min

“private let targetDurationSeconds: TimeInterval = 3600”

Desktop/Sources/GeminiAnalysisService.swift:69

Desktop/Sources/GeminiAnalysisService.swift (lines 67-78)

Six constants, one bento

Each card is a named constant in the public source. Together they define the observer as a bounded process. If any of them were missing, the loop would be unbounded, uncooperative, or unsafe to ship.

maxChunks = 120

Hard cap on buffered chunks. Oldest are dropped from disk first. Prevents unbounded growth if the 60-minute trigger can't fire.

targetDurationSeconds = 3600

Gate for triggerAnalysis(). bufferedDuration sums endTimestamp - startTimestamp per ChunkEntry.

inlineSizeLimit = 1_500_000

Boundary between inline base64 and Gemini File API upload. Above 1.5 MB, the service uploads and waits for processing via waitForProcessing.

retryCooldown = 300

Five minutes between failed analysis retries. Buffer is preserved on failure so no video is lost during cooldown.

maxTurns = 5

The agentic loop for the observer is bounded to five tool-call turns. Beyond that, the service logs exhausted turns and returns the last raw output.

model = gemini-pro-latest

The observer is not Claude, it is Gemini. The Fazm runtime separates the chat agent (your primary assistant) from the observer agent that watches for unmet work.

What happens to every chunk

A chunk is one short MP4 of whichever app window had focus when it was recorded. Six steps take it from SessionRecordingManager to the observer's final verdict. Every step lives in the public source.

SessionRecordingManager finalizes a chunk

A short video of the active app window lands on disk. The chunk ships with app name, window title and frame count in ActiveAppInfo, so the observer knows which app was focused during which seconds of the clip.

handleChunk copies it to Application Support/Fazm/gemini-analysis/chunks

The original temp file is deleted after the session recorder is done with it, so the service reads the bytes and writes them to a stable path named chunk_<index>_<epoch>.mp4 before the recorder cleans up.

Buffer index is persisted as JSON on every append

ChunkEntry[] is encoded to buffer-index.json via atomic write. This survives app restarts, crashes, and reboots, so 60 minutes of observation is not lost if the Mac goes to sleep mid-buffer.

Buffer is capped at 120 chunks, oldest dropped with its file

The maxChunks = 120 cap is a safety ceiling. If the buffer ever exceeds it (e.g., after a long offline run), the oldest ChunkEntry files are removed from disk and the array is trimmed from the front.

Once bufferedDuration >= 3600s, triggerAnalysis() fires

bufferedDuration is the sum of (endTimestamp - startTimestamp) across every entry. At or above one hour, analysis starts unless the last failure was less than retryCooldown = 300 seconds ago.

On failed analysis, the buffer is kept; on success, only analyzed chunks are cleared

Success removes the Set of analyzedURLs from the buffer and deletes those files from disk. New chunks that arrived during analysis are preserved. Failure sets lastFailedAnalysis = Date() and waits out the 5-minute cooldown before retry.

Anchor fact 2: the three self-introspection tools

Most desktop AI agents expose tools to call out to the world (browser, shell, file system). Fazm's observer has three tools pointed inward. They exist to give Gemini enough context to refuse to suggest work that is redundant with the first agent's recent activity.

3 tools

“"name": "get_active_sessions" — Check if Fazm's AI agent is currently processing any tasks.”

Desktop/Sources/GeminiAnalysisService.swift:555-562

Desktop/Sources/GeminiAnalysisService.swift (lines 529-565)

What each tool is for

Five roles split across three tools plus the mandatory pre-check SELECTs embedded in the prompt. Together they are what the observer uses to introspect before saying anything.

query_database(sql)

Gemini can run SELECT (or WITH) against fazm.db. The exposed tables are chat_messages, observer_activity, ai_user_profiles, and indexed_files. Nine SQL keywords are denied before the query reaches SQLite: DROP, ALTER, TRUNCATE, CREATE, ATTACH, DETACH, VACUUM, REINDEX, PRAGMA. Multi-statement is also blocked.

read_dev_log(lines)

Gemini can tail Fazm's dev log (max 200 lines, default 50) to see whether the agent is currently in the middle of an ACP session. This is how the observer distinguishes automated-looking activity from activity the agent is already performing.

get_active_sessions()

No args. Returns whether Fazm's ACP bridge has any live sessions and what tools it recently called. The observer uses this as a second-order signal before suggesting a task: if the agent is already running, most suggestions collapse to NO_TASK.

Mandatory pre-checks in the prompt template

Before Gemini makes a verdict, the prompt forces two SELECTs: observer_activity (last 10 gemini_analysis rows) and chat_messages (last 10 messages). If the proposed task is similar to anything in observer_activity, the prompt tells the model to return NO_TASK.

Verdict format is structured, not free-form

The prompt locks the response to three verdicts: NO_TASK, TASK_FOUND, UNCLEAR. Only TASK_FOUND triggers persistAndShowOverlay, which writes the row into observer_activity and displays the overlay above the floating bar.

Inputs, hub, outputs

Three signals go into the observer. One runtime decides the verdict. Three outputs leave. The diagram below is a one-glance view of how the observer interacts with the rest of Fazm.

Observer signal flow

Wire-level trace for one analysis turn

The sequence diagram below traces one 60-minute buffer going through the observer end to end. Three tool calls happen before the verdict, two of them are the mandatory dedup SELECTs.

One observer turn, end to end

Anchor fact 3: the mandatory dedup SELECTs

Dedup is not a post-processing filter. It is enforced in the prompt that the observer reads on every run. Before any verdict, the observer must run two SELECTs and prefer NO_TASK when it sees similar prior tasks. This is the exact text from the prompt template.

analysisPromptTemplate (lines 30-34 inside the multiline string)

Anchor fact 4: the SQL guard

The observer's query_database tool does not trust the LLM. Nine SQL keywords are stripped before execution; the query must start with SELECT or WITH; multi-statement is rejected. The executor also auto-appends LIMIT if missing. This is the whole surface area Gemini gets against the user's local database.

Desktop/Sources/GeminiAnalysisService.swift (lines 569-617)

The buffer lifecycle in code

The buffer is capped, persisted, and gate-kept by the six constants above. The 20 lines below are the heart of the trigger logic, lifted verbatim from lines 196 to 215. Everything else in the 1,181-line file is upload, API, parsing, and overlay plumbing.

Desktop/Sources/GeminiAnalysisService.swift (lines 196-215)

Three greps anyone can run to verify the page

This guide is not marketing. Every claim is independently checkable. The terminal session below is the simplest way to verify the anchor facts against the public Fazm source tree.

verifying the observer constants and tools

Side by side

Nine rows, each one anchored to a specific constant, tool, or line range in the observer source. The left column is how every SERP page for this keyword defines an AI desktop agent. The right column is what the observer actually does.

Feature	Reactive AI desktop agent	Fazm observer
Trigger	User types a command in a chat box	60 minutes of buffered video duration (targetDurationSeconds = 3600)
Input signal	A single prompt string	Up to 120 video chunks of the active app window, each tagged with app name and window title
Self-awareness	None. The agent does not know it is already running	Observer calls read_dev_log + get_active_sessions to check if the agent is already doing the suggested work
Dedup	Usually none. Same suggestion can surface repeatedly	Mandatory SELECT from observer_activity LIMIT 10 before verdict. Similar tasks return NO_TASK
SQL surface	Full read/write access, or no database access at all	SELECT or WITH only, 9-keyword denylist, single-statement enforced at lines 607-617
Failure handling	Retry immediately, often in a loop	5-minute retryCooldown, buffer preserved on failure, cleaned only on success
Upload path	Often ships full screenshots to a server	Files above 1.5 MB use Gemini File API resumable upload; smaller ones go inline base64
Output surface	Chat message in a window	One row in observer_activity + a Discovered Tasks overlay above the floating bar
Model	Whatever the chat provider uses	gemini-pro-latest, with agentic function calling, maxTurns = 5

Grep-verifiable anchor checklist

Twelve claims, each greppable in the public Fazm tree at the exact line ranges cited. If any item fails, this page is wrong and should be corrected. If all pass, the page is a code tour, not a product pitch.

Twelve grep-verifiable claims

model = "gemini-pro-latest" is declared at GeminiAnalysisService.swift line 67
maxChunks = 120 is declared at GeminiAnalysisService.swift line 68
targetDurationSeconds: TimeInterval = 3600 is declared at GeminiAnalysisService.swift line 69
inlineSizeLimit = 1_500_000 is declared at GeminiAnalysisService.swift line 71
retryCooldown: TimeInterval = 300 is declared at GeminiAnalysisService.swift line 78
maxTurns: Int = 5 is the default arg of callGenerateContentAgentic at line 750, enforced in the for-loop at line 760
query_database, read_dev_log, get_active_sessions are declared in toolDeclarations at lines 533, 544, 555
The blockedSQLKeywords array (9 keywords) is at lines 569-571
The SELECT-or-WITH guard is at line 607
The multi-statement guard splits on ';' and rejects >1 statement at lines 612-617
TASK_FOUND verdicts are persisted to observer_activity and shown as an overlay at lines 273-275
Uploaded Gemini File API files are deleted after each run at the end of runAnalysis

Want to watch the observer refuse a redundant suggestion live?

Book a walkthrough. We show the 60-minute buffer triggering, the three introspection tools firing against a real observer_activity table, and the NO_TASK verdict that follows when the agent is already doing the work.

Frequently asked questions

What makes an AI desktop agent different from a chatbot with a Mac window?

A chatbot in a Mac window runs on a single trigger: the user types something. Fazm's desktop AI agent adds a second runtime that runs on a different trigger (60 minutes of buffered screen video, not a prompt), on a different model (gemini-pro-latest, not the chat model), with different tools (query_database, read_dev_log, get_active_sessions) that let it introspect the first agent before deciding whether to interrupt you. Verify at Desktop/Sources/GeminiAnalysisService.swift lines 67 through 78.

Why does the observer have read_dev_log and get_active_sessions as tools?

So the observer can refuse to suggest work the agent is already doing. The prompt at line 28 says: 'If you see terminal, IDE, or browser activity that looks automated (fast typing, command sequences, file edits happening rapidly), call read_dev_log or get_active_sessions FIRST to check whether Fazm's AI agent is already doing that work. Do NOT suggest automating something that is already being automated.' This is the most common false positive class, and the tools exist to eliminate it.

Why is the observer on Gemini and not the same model as the chat agent?

Gemini's multimodal File API can ingest up to 120 chunks of 1-minute MP4 video in one request. The alternative would be sending 120 frames as separate images, which is worse for temporal reasoning. The model constant at line 67 is gemini-pro-latest; the chat agent is a separate ACP-bridged model chosen by the user. Having a separate observer model is a deliberate split, not a convenience.

What does the observer actually query from the database?

Four tables, read-only, via SELECT or WITH only. Nine SQL keywords are denied (DROP, ALTER, TRUNCATE, CREATE, ATTACH, DETACH, VACUUM, REINDEX, PRAGMA) and multi-statement is blocked. The tables are chat_messages (recent conversation), observer_activity (past discovered tasks, for dedup), ai_user_profiles (the user's inferred profile), and indexed_files (a 500MB-filtered, depth-3 scan of Downloads, Documents, Desktop, Developer, Projects, Code, src, repos, Sites, and /Applications).

How does the agent avoid showing the same suggestion twice?

The prompt at lines 31-34 forces the observer to run SELECT content, status, createdAt FROM observer_activity WHERE type='gemini_analysis' ORDER BY createdAt DESC LIMIT 10 before it forms a verdict. If the proposed task is similar to anything in those 10 rows (same app + same category of work), the prompt requires NO_TASK. It is a dedup enforced at the LLM layer, not a post-hoc filter.

What happens when the Gemini call fails?

lastFailedAnalysis is set to Date(), the buffer is preserved intact, and no chunks are deleted from disk. On the next chunk arrival, the analyzer will see it is still inside the 300-second retryCooldown window and skip the retry. After cooldown, the next chunk that pushes bufferedDuration past 3600s will retry on the same buffer.

Why does Fazm split video chunks above 1.5 MB onto the Gemini File API instead of sending everything inline?

Gemini's generateContent endpoint has a hard inline size limit. The inlineSizeLimit = 1_500_000 constant at line 71 is the split point: smaller chunks are base64-encoded inline; larger ones go through the resumable upload flow and waitForProcessing polls until the file is ready. Uploaded files are deleted after each run so no residue remains on Google's servers.

Does the observer run in the cloud or on device?

The buffer, the chunk files, the buffer index JSON, and the SQL query execution all run on device. Only the MP4 chunks go to the Gemini API, and only for the turn the analysis covers. The uploaded files are deleted server-side after each run. The observer's verdict is written back to the local observer_activity table; the overlay UI reads it from the local SQLite file.

How do I turn the observer off?

handleChunk at line 148 checks UserDefaults.standard.object(forKey: "shortcut_screenObserverEnabled") and returns early if it is false. Toggling that setting disables chunk buffering immediately. Existing chunks are cleaned up by the orphan sweep on next init.

How many turns can the observer take in one analysis?

Five. The callGenerateContentAgentic function is declared with maxTurns: Int = 5 at line 750 and the for-loop iterates 1...maxTurns at line 760. Each turn can contain multiple tool calls in parallel. If the model does not emit a verdict within five turns, the service logs 'exhausted agentic turns' and returns null.

What is the overlap with 'Computer Use' style desktop agents?

Computer Use agents are reactive: a command comes in, they plan clicks and keystrokes, they execute. Fazm's desktop agent has a reactive half too (the chat agent), but it pairs it with a proactive observer whose job is to detect unsolicited work on the screen and surface it as a Discovered Task. Both halves share the same fazm.db SQLite file, which is how the observer can see what the reactive agent has been doing.

Is the observer's output private?

Only TASK_FOUND verdicts are persisted. The prompt enforces UNCLEAR as the return value when the video is ambiguous, and explicitly prefers UNCLEAR over hallucinated suggestions. PostHog tracks verdict, chunks_analyzed, tool_call_count, turns_used, and input/output tokens, but not the raw video. The raw chunk files never leave the local Application Support directory except during an active Gemini call, after which uploaded files are deleted.

Adjacent guides on the two halves of a Mac desktop AI agent: the reactive one and the observer.

Keep reading

Architecture

Desktop AI agent, structural primitives

The two Mac primitives (borderless NSWindow at .floating + Carbon RegisterEventHotKey) that separate a real desktop agent from a menu bar app.

Read

Tradeoff

Accessibility API vs screenshots

Why the primary Fazm agent reads the AX tree instead of screenshots, and why the observer is the one exception that uses screen video.

Read

Guide

macOS AI agent development guide

The AX permission flow, the screen recording permission flow, and how to test the observer from a terminal command.

Read

The AI desktop agent that watches itself so it never suggests work it is already doing

The wire values, in one marquee

Four numbers that bound the loop

Reactive desktop AI agent vs. a desktop AI agent with a self-observing loop

Two definitions of 'AI desktop agent'

Anchor fact: six constants, twelve lines

Six constants, one bento

maxChunks = 120

targetDurationSeconds = 3600

inlineSizeLimit = 1_500_000

retryCooldown = 300

maxTurns = 5

model = gemini-pro-latest

What happens to every chunk

SessionRecordingManager finalizes a chunk

handleChunk copies it to Application Support/Fazm/gemini-analysis/chunks

Buffer index is persisted as JSON on every append

Buffer is capped at 120 chunks, oldest dropped with its file

Once bufferedDuration >= 3600s, triggerAnalysis() fires

On failed analysis, the buffer is kept; on success, only analyzed chunks are cleared

Anchor fact 2: the three self-introspection tools

What each tool is for

query_database(sql)

read_dev_log(lines)

get_active_sessions()

Mandatory pre-checks in the prompt template

Verdict format is structured, not free-form

Inputs, hub, outputs

Observer signal flow

Wire-level trace for one analysis turn

Anchor fact 3: the mandatory dedup SELECTs

Anchor fact 4: the SQL guard

The buffer lifecycle in code

Three greps anyone can run to verify the page

Side by side

Grep-verifiable anchor checklist

Want to watch the observer refuse a redundant suggestion live?

Frequently asked questions

Keep reading

Desktop AI agent, structural primitives

Accessibility API vs screenshots

macOS AI agent development guide

Comments (••)

Comments ()