GuidemacOS agent internalsOpen source (MIT)

The unglamorous tool that makes an AI agent trustworthy is ask_followup

Every article about AI agents asking clarifying questions stops at the idea. Fazm ships it as a named tool in a signed macOS app, with three system-prompt rules that force the agent to read context first, render the question as clickable buttons, and end its turn immediately after asking. This is a walkthrough of how that actually works on a Mac, at the file-and-function level.

Matt Diakonov, Building Fazm

Published April 23, 202610 min read

4.8from Fazm agent internals

Tool: ask_followup

Source: ChatPrompts.swift L204-209

100% open source (MIT)

ask_followup

A clarifying question, as a tool call

Agent reads the accessibility tree

If unambiguous, it acts

If not, it fires ask_followup

Question renders as a chat bubble

Options render as clickable buttons

The turn ends, the UI waits

0:00 / 0:05

The idea most writing stops at

If you look at what's been written about agents asking clarifying questions over the past year, you get two genres. One is academic: Bayesian experimental design, preference elicitation, information gain. The other is prompt-engineering advice: tell your model to ask when the intent is ambiguous, give it a rubric, hope it listens. Both are useful. Neither tells you what happens on the user's screen when an agent decides to ask.

On a Mac, the answer has to be concrete: does a question appear as plain text? As a modal? Does the agent keep working while the user thinks? Does it remember the answer? Does it know when to NOT ask, because the information is already on screen in the focused window's accessibility tree? Those decisions get baked into one specific tool, and a small set of rules that govern when the model is allowed to call it.

The anchor fact

Fazm defines exactly one tool for this purpose. It is named ask_followup. You can open the file at acp-bridge/src/fazm-tools-stdio.ts and read it on line 406. The input schema accepts a question string and an options array of 2 to 3 button labels. That's the entire surface area.

acp-bridge/src/fazm-tools-stdio.ts

The interesting part is not the schema. It is the three rules in the system prompt that tell the model when it is allowed to call this, and how the call has to be shaped.

L207

“ask_followup MUST be the absolute LAST tool call in your turn, never generate any text, tool calls, or content after it. Your turn ends when ask_followup returns.”

ChatPrompts.swift line 207, verbatim

The three rules that make it trustworthy

These are not aspirational guidance. They are in the system prompt that loads on every session, under the heading "CRITICAL", in all caps. Lines 204 to 209 of Desktop/Sources/Chat/ChatPrompts.swift.

Rule 1. Don't repeat yourself

The question parameter IS the rendered chat bubble. Writing it again as plain text makes it appear twice. The prompt says: do NOT also write the question as text, it will appear twice.

Rule 2. No text bullets

Every question or choice MUST use ask_followup. Never present options as plain text bullets, because the user can't click them.

Rule 3. End of turn

ask_followup MUST be the absolute last tool call. No text, no tool calls, no content after it. The turn ends when it returns.

Rule 0. Look before you ask

Before the three rules above kick in, line 133 of the same file instructs the model: prefer looking things up over asking the user. Use browser profile, local files, the database. Only ask when the answer is load-bearing and not already retrievable.

What happens in a single turn

Here is the pipeline a single "the agent decides to ask" moment flows through. The left side is the agent's input: a user message and the accessibility tree of the currently focused window. The hub is the model turn itself. The right side is what reaches the screen.

One clarifying-question turn, end to end

Why this isn't just a chatbot asking a question

A clarifying question in a generic chatbot is text. Text is cheap. It invites the user to type a reply, which invites ambiguity, which sometimes means the model then has to ask again. A tool call is structured. It has a fixed shape, a renderer that draws buttons, and a protocol rule that ends the turn. You get fewer round-trips for the same information.

Feature	Generic chatbot prompt	Fazm ask_followup tool
Shape of the question	Free-form text in an assistant message	Structured tool call with question + options fields
How options render	Plain text bullets the user has to retype	Clickable pill buttons wired to the input
End-of-turn behavior	Model can keep speculating past the question	Turn stops; UI waits for click or typed reply
Anti-duplication rule	None; model often repeats the question	System prompt forbids writing question as plain text
Context check before asking	Asks whenever uncertainty is felt	Prefers lookup via AX tree, browser profile, local files
Works outside the browser	Confined to the chat surface it lives in	AX tree is read for any focused Mac app

Rows describe Fazm as of the 2.4.x release line and the prompt text in ChatPrompts.swift at the time of writing.

Look-before-ask: the rule nobody else writes down

This is the line that changes the agent's character. Without it, a helpful-by-default model asks too much, because asking is cheap from the model's perspective. With it, the agent treats the user's attention as expensive and tries to answer its own question first.

Line 133 of ChatPrompts.swift, verbatim:

Prefer looking things up over asking the user — use browser profile, local files, or the database when you expect the answer is there. But don't exhaustively check every data source before asking a simple clarifying question.

Concretely, here are the sources Fazm checks before it will fire ask_followup, in order:

Sources the agent consults before asking

The focused window's accessibility tree (button labels, field values, window title)
The user's knowledge graph (name, language, role, tools, projects)
The browser profile (autofill data, saved logins, bookmarks, history)
Local files and the on-device database
If all four draw a blank, then and only then: ask_followup

The screen-context signal: AX tree, not screenshot

The difference between a screenshot agent and an accessibility agent shows up exactly here. A screenshot-based agent sees pixels. When there are two buttons both visually labeled "Save", it either guesses or asks. A screenshot of the screen can't disambiguate without OCR + geometry heuristics.

Fazm reads the Mac's accessibility tree through AXUIElement calls. Every button has a role, a label, a parent window title, and bounds. The agent sees the structure directly. Most of the time it can resolve what the user means without asking. When it can't, the ambiguity is still concrete: "I see two buttons labeled Save, one in the Mail compose window and one in the Notes draft. Which one?" That's the exact shape of a good clarifying question, and it's only possible because the read path is structured.

Ask vs proceed, on the AX tree

⚙️

AX tree read

mcp-server-macos-use returns a structured list of elements in the focused window with roles, labels, and bounds.

✅

Match against intent

Does exactly one element match the user's phrasing? If yes, act. If more than one, continue.

🔔

ask_followup

Build options from the matched candidates (e.g. 'Save in Mail', 'Save in Notes'). Fire the tool. End the turn.

What the agent actually sends, in JSON

Because ask_followup is a standard tool call, you can see the payload on the wire. It's just a tool-use block. The UI renders the bubble + button row from these fields; no magic.

Example tool-use block

A minute with Fazm, shell-view

If you were instrumenting a run from the outside, the trace would look like this. The agent tries to resolve on its own, fails to disambiguate, fires ask_followup, ends the turn, waits.

Agent trace

When the agent is required to ask

There is one flow where the prefer-lookup rule is flipped: macOS permission grants. The agent cannot guess whether the user wants to grant microphone access or screen recording, and it can't read the answer out of any data source. So the prompt forces an explicit ask with three canonical options: Grant, Why?, Skip.

Permission grant sequence, from ChatPrompts.swift lines 320-337

Explain in one sentence

For each permission, emit a single short message saying what it unlocks. Example for mic: 'Mic access lets you talk instead of typing'.

Fire ask_followup with three options

options: ['Grant Microphone', 'Why?', 'Skip']. Turn ends.

Handle 'Why?' gracefully

If the user clicks 'Why?', send a 1-sentence reason, then re-ask the same permission with the same three options. The agent never proceeds without a click.

On grant, fire the actual permission request

Once the user clicks Grant, call the macOS API that triggers the system prompt. On Skip, move on.

The small numbers that matter

0hard rules in the system prompt

0-3options per ask_followup call

0tool dedicated to clarifying questions

0tool calls permitted after ask_followup

Read the prompt yourself

The entire system prompt is in the open-source repo. It's not hidden behind a server. Clone the repo, open the Swift file, and read the same text the model reads on every session.

Verify locally

Why this design generalizes

If you're building your own agent, the three rules above compose into a pattern you can lift wholesale.

One tool, not prose. Give the model exactly one way to express "I need the user's input before I can continue", and make the tool name obvious enough that the model uses it reliably. Fazm's tool is literally called ask_followup, not askUserQuestion or requestClarification.

Hard end-of-turn. Put "this must be the last tool call" in the prompt in all caps. Enforce it in the renderer if you have to. The model will be tempted to keep working past the question; don't let it.

Prefer lookup to ask. Write down the priority order: which sources the model should consult before asking. Without it, the agent will over-ask. With it, you preserve the user's attention for questions that actually matter.

Compared to the usual advice

Most published guidance stops one level above the implementation. Pattern-matched on what common articles cover vs what Fazm actually does:

Bayesian elicitationPrompt a rubric for askingMeasure information gainAdd a "confidence threshold"Fazm: one tool, three prompt rules, AX tree read pathAskUserQuestionTool (Spring AI)"Ask before acting" blog postsFazm: anti-duplication rule in the prompt

The theoretical work is real and useful. But a consumer desktop agent has a different constraint: the question has to land visually in a way a non-developer can click through in under a second. That constraint drives the three rules, and those rules drive the tool shape.

“The bit that makes Fazm feel competent isn't the model. It's that it knows when to stop and ask, and when to just act. Most agents get that wrong in both directions.”

Fazm internals

Observation from building the onboarding flow

Headline takeaway

"AI agent ask clarifying questions" is a behavior, but in Fazm it's also a file path. Two files, actually: ChatPrompts.swift for the rules, and fazm-tools-stdio.ts for the tool schema. Every time the agent pauses to ask, those two files are why. Every time it doesn't pause, line 133 is why. That is the whole mechanism.

Because the source is MIT and the app works with any Mac app through the real accessibility tree (not screenshots), the same pattern works for Mail, Notes, Finder, Xcode, Figma, or a custom internal tool nobody on the internet has ever heard of.

0 rules. 0 tool. 0 files.

Want to see the ask_followup flow running on your Mac?

Book a 15-minute walkthrough. I'll screen-share a live Fazm session, show the prompt and tool definitions side by side, and answer any question about how the AX-tree context flows into the decision to ask.

Frequently asked questions

Where is the clarifying-question tool actually defined in Fazm's source?

Two files. The tool schema lives in acp-bridge/src/fazm-tools-stdio.ts around line 406 under the name ask_followup, with a required question string and an options array of 2 to 3 button labels. The behavioral rules that govern when the agent uses it live in Desktop/Sources/Chat/ChatPrompts.swift. The 'prefer looking things up over asking the user' rule is on line 133 of that file, and the three hard rules about rendering, duplication, and end-of-turn placement are on lines 204 to 209.

Does Fazm guess when it is unsure, or does it always ask?

Neither extreme. The onboardingChat prompt in ChatPrompts.swift explicitly tells the agent to prefer data lookups over questions: try the browser profile, local files, and the user's knowledge graph before firing ask_followup. It only stops and asks when the answer is materially load-bearing and not already retrievable. On the other hand, during permission requests (mic, accessibility, screen recording), the agent must ask; the prompt forbids proceeding without a clicked option.

How is this different from a chatbot just typing a question in the chat?

A chatbot types 'Did you mean A or B?' and waits for the user to type back. Fazm's ask_followup is a tool call with a structured payload: the question renders as a chat bubble AND renders a row of clickable buttons. The system prompt has an explicit anti-duplication rule: the agent must not also write the question as plain text, because the question parameter is already rendered as the chat bubble. Typing the question inline creates a doubled-message bug. The buttons also count as tool output, so the agent's turn ends the moment they render.

Why does the system prompt require ask_followup to be the last tool call of the turn?

Because a clarifying question has no meaningful value unless the next token of work is the user's reply. If the agent asked 'which folder?' and then kept chaining file-system calls before the user answered, it would either act on a guess or spew speculative work that has to be thrown away. The rule 'ask_followup MUST be the absolute LAST tool call in your turn, never generate any text, tool calls, or content after it' hard-stops the model so the UI can wait cleanly for the click or the free-form reply.

How does macOS accessibility context factor into the decision?

Because Fazm reads the focused window through real AX APIs (AXUIElement, kAXFocusedWindowAttribute, kAXRoleAttribute), the agent sees exact button labels, window titles, field names, and element hierarchy before it acts. When the tree is unambiguous (a single 'Send' button in the focused Mail compose window), the agent proceeds without asking. When it is ambiguous (two buttons both labeled 'Save', two Safari tabs matching the user's phrasing, a cannotComplete status from the AX subsystem), the prompt instructs the agent to surface the ambiguity through ask_followup with each candidate as a button option.

Does this work outside the browser, or only for web tasks?

Outside the browser is the point. Fazm bundles mcp-server-macos-use at Contents/MacOS/mcp-server-macos-use inside the app bundle. That binary reads any macOS app's accessibility tree: Finder, Mail, Messages, Settings, Notes, Xcode, Figma, Photoshop, Logic, anything that exposes AX. The ask_followup tool therefore works the same way whether the agent is clarifying which browser tab to open or which email draft to send. Every other published playbook on this topic implicitly assumes a web UI or a terminal; Fazm's implementation is app-agnostic because the read path is.

What happens if the user ignores the buttons and types a free reply?

It still works. The tool description in fazm-tools-stdio.ts states explicitly: 'The user can click a button OR type their own reply.' The buttons are a convenience that lower the cost of a common answer; the chat input stays live. If the user types something unexpected, the agent processes the text as if it had come from a regular chat turn, updates its plan, and either continues or asks a more precise follow-up. The prompt at ChatPrompts.swift line 247 even acknowledges this pattern: 'do NOT use ask_followup, the user will type freely' for steps where an open-ended answer is expected.

Can I see the rules from the system prompt verbatim?

Yes. From Desktop/Sources/Chat/ChatPrompts.swift lines 204 to 209: 'CRITICAL, ask_followup RENDERS THE QUESTION: The question parameter is displayed as a chat bubble above the buttons. Do NOT also write the question as text, it will appear twice. Every question or choice MUST use ask_followup. Never present options as plain text bullets, the user can't click them. ask_followup MUST be the absolute LAST tool call in your turn, never generate any text, tool calls, or content after it. Your turn ends when ask_followup returns.' Line 133: 'Prefer looking things up over asking the user, use browser profile, local files, or the database when you expect the answer is there. But don't exhaustively check every data source before asking a simple clarifying question.'