Field notes from one shipping macOS agent

Voice agent desktop workflow handoff, the three code paths nobody describes

Most writing on voice agents stops at the moment of input: how the microphone opens, how the transcript lands, how the model picks a tool. That is the easy part. The interesting part is the seam between voice input and a long-running desktop run that is already executing when the human opens their mouth again. Below are the three concrete paths that seam takes in one shipping macOS agent, the methods that implement each, and what state survives the handoff.

Matthew Diakonov, Written with AI

Published May 7, 20268 min read

Direct answer (verified 2026-05-07)

A handoff is not one operation. It is three. While a desktop run is in flight, a new voice utterance can take any of three paths: enqueue without interrupt and drain when the current run finishes, interrupt and replace by cancelling the current tool call and pushing the new message to the front of the queue, or stop without replace and keep the partial output on screen. All three are scoped to a session key so a sibling pop-out window’s run is never killed.

Source: Desktop/Sources/Providers/ChatProvider.swift, three methods around lines 2652, 2718, and 2734. The push to talk machine sits in Desktop/Sources/FloatingControlBar/PushToTalkManager.swift.

The shape of the seam

Picture the runtime as two coupled systems. On one side a four-state push to talk machine in PushToTalkManager.swift turns the Option key plus a microphone into a final transcript. On the other a chat provider holds the conversation state, drives an Agent Client Protocol bridge, and issues tool calls into MCP servers that walk the macOS accessibility tree, drive the browser, and read Google Workspace. The handoff is the small piece of code that decides what to do with a new transcript when the chat provider is mid-run.

The naive answer is “disable the mic until the run is done.” That loses the whole point of a voice agent. People narrate. They notice the run is heading the wrong way ten seconds into a thirty second tool chain and want to redirect it without losing what already happened. They want to drop a follow up while the agent is still working, knowing it will be picked up next. They want to abort and read what got done. All three are legitimate. All three need a different code path.

The three methods below sit on a single Swift class, ChatProvider, in Desktop/Sources/Providers/ChatProvider.swift. Their names are the API surface; their bodies are where the interesting decisions live. I will quote enough of each that you can search for them in the public repo.

Path 1: enqueue without interrupt

Trigger: user speaks, target session is busy, user (or UI default) chose queue mode. Effect: the new utterance lands as a queue chip in the floating bar, the current tool call keeps running, and the moment the run completes the natural drain path picks the next item up.

// Desktop/Sources/Providers/ChatProvider.swift
/// Enqueue a message to be sent after the current query finishes.
/// Does NOT interrupt the current query — it will be picked up automatically.
/// Pass the caller's `sessionKey` so the message runs on the correct session
/// (not the currently-active one, which may belong to a different window).
func enqueueMessage(_ text: String, sessionKey: String? = nil) {
    let trimmedText = text.trimmingCharacters(in: .whitespacesAndNewlines)
    guard !trimmedText.isEmpty else { return }
    pendingMessages.append((
        text: trimmedText,
        sessionKey: sessionKey ?? activeSessionKey,
        userMessageAdded: false
    ))
    log("ChatProvider: message enqueued (\(pendingMessages.count) pending)...")
}

The important detail is the second argument. The queue is keyed by session, not globally. A second floating bar in another window can be running its own tool call on its own session, and the queue entry remembers which one it belongs to. When that session’s drain path fires, it filters by sessionKey and pulls only messages tagged for it. This is what makes concurrent runs in two windows actually concurrent, rather than one global queue everyone fights over.

The other detail is what is missing: there is no call to acpBridge.interrupt. The current tool call is untouched. The browser session, the macOS app the accessibility tree was reading, the partial assistant text already streamed back, all of it stays where it is. Queue mode is the minimum-violence option.

Path 2: interrupt and replace

Trigger: user speaks, the run is wrong, they want it cancelled and replaced now. Effect: the current ACP tool call is unwound, partial assistant text is preserved, the new message is inserted at the front of the queue, and the drain path runs it on the same session immediately.

// Desktop/Sources/Providers/ChatProvider.swift
/// Interrupt the current query and send a message immediately.
/// If the AI is idle, sends the message directly without interrupting.
func interruptAndSend(_ text: String, sessionKey: String? = nil) async {
    let trimmedText = text.trimmingCharacters(in: .whitespacesAndNewlines)
    guard !trimmedText.isEmpty else { return }
    let targetKey = sessionKey ?? activeSessionKey

    // Drop any duplicate queued copy of the same message on this session
    if let existingIdx = pendingMessages.firstIndex(where: {
        $0.text == trimmedText && $0.sessionKey == targetKey
    }) {
        pendingMessages.remove(at: existingIdx)
    }

    // If THIS session isn't busy, send directly as a follow-up
    let targetBusy: Bool = {
        if let key = targetKey { return sendingSessionKeys.contains(key) }
        return isSending
    }()
    if !targetBusy {
        await sendMessage(trimmedText, isFollowUp: false, sessionKey: targetKey)
        return
    }

    // Insert at front of queue, then interrupt this one session only
    pendingMessages.insert((
        text: trimmedText,
        sessionKey: targetKey,
        userMessageAdded: true
    ), at: 0)
    if let key = targetKey {
        await acpBridge.interrupt(sessionKey: key)
    } else {
        await acpBridge.interrupt()
    }
}

Three things in here are easy to miss. First, the dedup. If the user tapped “Send Now” on a message that was already in the queue chip, the method removes the queued copy first so the bridge does not execute the same instruction twice. Second, the target-busy check is per session. The global isSending flag is true if any session is sending; what matters here is whether this one is. A busy sibling pop-out should not push this session through the interrupt path when this session is actually idle.

Third, the interrupt is scoped. The session-keyed call to acpBridge.interrupt(sessionKey:) is the difference between cancelling one tool call and killing every concurrent run in the app. A user who has two pop-out chats running and interrupts one of them should still be able to watch the other one finish.

Interrupt and replace, same session

Path 3: stop without replace

Trigger: user does not say anything new. They tap the stop control or press a global shortcut. Effect: the current run is cancelled, partial output stays on screen, the queue is preserved, the upstream session ID is not rolled.

// Desktop/Sources/Providers/ChatProvider.swift
/// Stop the running agent, keeping partial response
func stopAgent() {
    guard isSending else { return }
    isStopping = true
    pendingCountAtStop = pendingMessages.count
    log("ChatProvider: user stopped agent, sending interrupt...")
    Task { await acpBridge.interrupt() }
    // Result flows back normally through the bridge with partial text
}

/// Stop the running agent for a specific session only.
/// Other concurrent sessions continue.
func stopAgent(sessionKey: String) {
    guard sendingSessionKeys.contains(sessionKey) else { return }
    Task { await acpBridge.interrupt(sessionKey: sessionKey) }
}

The reason there are two methods is the same reason interruptAndSend has a sessionKey argument: a stop on one window’s run should not touch another window. The global stopAgent is the “everything I have visible right now” gesture; the keyed variant is the per-session button on a specific floating bar.

The pendingCountAtStop snapshot is a small thing that matters. After the bridge unwinds and posts the partial response, the provider knows how many queued items existed at the moment of stop. If new items were added during the unwind (the user kept talking), they are not silently flushed. The user explicitly chose to stop the run, not the queue.

What survives a handoff

The single biggest source of bugs in this part of the system is treating an interrupt like a process kill. It is not. Cancelling an ACP tool call unwinds the in-flight call and nothing else. Everything below sticks around and is the reason the user can pick up where the run left off rather than starting over.

Preserved across an interrupt

Partial assistant text already streamed to the UI
Persisted message rows in the chat backend
Concurrent sessions in other floating bars or pop-outs
MCP server processes (macos-use, playwright, fazm_tools)
Open browser tab and accessibility tree cache
Upstream session ID and the session chain in UserDefaults
The pending queue, except the message just consumed

The one thing thrown away is the specific tool call that was mid-flight. If the model was three steps into a five-step run, the first three steps and any partial output of the fourth are still visible, the run is just paused. This is what lets a user say “wait, do that with the August invoice instead” without losing the July invoice work that already finished.

The voice side, in two constants

A handoff is only as clean as the voice machine producing the trigger. The push to talk side runs a four-state machine in PushToTalkManager.swift: idle, listening, lockedListening, finalizing. Two constants in that file carry most of the ergonomics:

// Desktop/Sources/FloatingControlBar/PushToTalkManager.swift

// Double-tap detection
private let doubleTapThreshold: TimeInterval = 0.4    // line 40

// Safety: max recording duration to prevent stuck PTT (5 minutes)
private let maxPTTDuration: TimeInterval = 300        // line 65

// Debounce: minimum interval between PTT activations to prevent
// rapid start/stop cycling that can crash the audio subsystem.
private let pttDebounceInterval: TimeInterval = 0.5   // line 71

The 0.4 second threshold is what separates “hold to talk” from “double tap to lock.” Tap Option twice within 400 ms and you are in lockedListening, which is the right mode for narrating a follow up while the desktop run is going. Hold Option for longer than 400 ms and you are in plain listening, which finalizes the moment you let go.

The 5-minute cap is a safety against a stuck modifier; the 500 ms debounce stops a fluttering Option key from cycling the audio subsystem, which on earlier builds could deadlock CoreAudio. None of these are exotic numbers, but they are the difference between a voice input that feels disposable and one that you can actually run all day alongside a busy desktop agent.

3 paths

“The handoff is not the voice. It is the queue, the interrupt, and the partial. Everything else is plumbing.”

From the field notes that produced this guide

When each path is the right one

Use enqueue when the new utterance is a follow up, not a correction. The run is doing the right thing, you just want to add “and then file this in the 2026 folder” without breaking flow. The user experience is a queue chip with a count badge; the runtime experience is a tuple appended to pendingMessages.

Use interrupt and replace when the run is on the wrong path. The user said “the August invoices” and saw the agent open July; they want to redirect now. The user experience is “send now” on a queued message or a long press of the PTT key with a configured interrupt mode; the runtime experience is a single acpBridge.interrupt(sessionKey:) plus a front-of-queue insert.

Use stop when the user wants to think. They saw enough partial output to know the answer or the next step is not what they expected, and they want to read what got done before saying anything else. The user experience is a stop button on the bar; the runtime experience is the same interrupt without the queue insert.

Run this on your own Mac

Fazm is a free, open source macOS agent. Talk through what you want to automate and we will walk through the queue, interrupt, and stop paths on a real workflow of yours.

Frequently asked questions

What does a voice agent desktop workflow handoff actually mean?

It is the seam between two systems that share one human. The voice agent owns the moment of input: a microphone, a push to talk key, a transcript. The desktop workflow owns the long tail of execution: tool calls against real applications, partial output streaming back, file writes, network requests. A handoff is what happens at the edge between those two systems, and there is more than one kind. The three that matter in practice are: speaking again while the desktop run is still going and wanting it queued, speaking again and wanting the current run cancelled, and stopping the run without saying anything new.

Why is this different from sending a follow-up prompt in a chat box?

Because the chat-box mental model assumes the agent is idle between turns. A computer-use agent is rarely idle. A real workflow run can spend thirty seconds opening four apps, reading the screen via accessibility APIs, and chaining tool calls. If the user dictates a new instruction during that window, the runtime has to decide what to do with it. Most chat UIs disable the input until the model is done, which is a third option this guide does not endorse: it strands the user in a watch-it-execute mode and removes the ability to course correct.

What does the queue path do, in code?

ChatProvider.enqueueMessage(_:sessionKey:) at Desktop/Sources/Providers/ChatProvider.swift around line 2718 appends a tuple of (text, sessionKey, userMessageAdded=false) to pendingMessages and returns. No bridge interrupt is sent, no in-flight tool call is touched. When the current sendMessage call completes, the natural drain path inside the provider pulls the next item off pendingMessages, posts a chatProviderDidDequeue notification with the right sessionKey, and starts the next send on that session. The new message is keyed to the session it was queued on, not the globally active one, so a second pop-out window queueing a message does not bleed into the first one's execution.

What does the interrupt-and-replace path do, in code?

ChatProvider.interruptAndSend(_:sessionKey:) at Desktop/Sources/Providers/ChatProvider.swift around line 2734 first removes any existing queued copy of the same text on the same session (so a user who tapped Send Now on an already-queued chip does not double-send). Then it checks whether the target session is actually busy. If the target session is idle, it sends the message directly as a follow-up. If the target session is running, it appends a user message to the UI tagged to that session, inserts the new message at the front of pendingMessages with userMessageAdded=true, and calls acpBridge.interrupt(sessionKey:) on that one session. The interrupt cancels the current ACP tool call, the partial response stays on screen, and the drain path picks up the front-of-queue message immediately on the same session.

What does stop without replace do?

ChatProvider.stopAgent() at Desktop/Sources/Providers/ChatProvider.swift around line 2652 sets isStopping=true, snapshots the current pendingMessages count into pendingCountAtStop, and calls acpBridge.interrupt(). The result flows back through the bridge with whatever partial text the model had already streamed. There is also a session-scoped variant, stopAgent(sessionKey:) around line 2664, that interrupts only the named session and leaves concurrent runs in other windows untouched. The point is to cancel the run while keeping the partial output on screen, so the user can read what got done before deciding the next move.

How does the voice side know which path to take?

The voice side itself does not decide. Push to talk in PushToTalkManager.swift just produces a final transcript and posts it to the floating control bar. The bar then asks ChatProvider what state the target session is in. If the session is idle, the transcript is sent directly. If the session is busy and the UI shows a queue chip with Send Now, that chip is wired to interruptAndSend. If the user holds the same modifier and the bar is configured for queue mode, the transcript goes through enqueueMessage. The decision is in the UI layer, not the audio layer, which is why the four-state PTT machine in PushToTalkManager.swift can be reused across all three paths.

Does an interrupt kill the bridge process?

No. acpBridge.interrupt(sessionKey:) sends a cancel signal over the Agent Client Protocol channel for the named session. The bridge process keeps running, the other sessions on it keep streaming, the MCP servers it spawned (macos-use, playwright, fazm_tools, whatsapp, google-workspace) keep their state. Only the in-flight tool call on the targeted session is unwound. This is the difference between cancelling a request and crashing a worker. A killed bridge would lose every concurrent session in the app at once, which is the opposite of what a handoff is supposed to do.

What state actually survives an interrupt?

The partial assistant text on screen, the message store rows already persisted to the backend, every other session running in another floating bar or pop-out, the MCP server processes and their accumulated state (open browser tab, accessibility tree cache, scheduled cron jobs), the upstream session ID (the cancel does not roll it forward), and the session chain stored in UserDefaults. The only thing thrown away is the tool call that was in progress at the moment of interrupt. If the model was three steps into a five-step run, you keep the first three steps and the partial output of the fourth.

What does the push to talk side of the handoff look like?

PushToTalkManager.swift at Desktop/Sources/FloatingControlBar/PushToTalkManager.swift runs a four-state machine: idle, listening, lockedListening, finalizing. Hold Option transitions idle to listening. Release goes listening to finalizing then back to idle. Tap Option twice within doubleTapThreshold = 0.4 seconds (line 40) and you are in lockedListening; the third tap goes lockedListening to finalizing. There is a 5-minute maxPTTDuration safety cap (line 65) so a stuck modifier cannot record forever, and a 0.5-second pttDebounceInterval (line 71) on activation so a fluttering modifier cannot cycle the audio subsystem. The whole thing runs on the main actor and posts the final transcript through a single chokepoint.

Where can I read the source for myself?

Two files tell the whole story. Desktop/Sources/Providers/ChatProvider.swift in the Fazm repo carries the three handoff methods: enqueueMessage around line 2718, interruptAndSend around line 2734, stopAgent around line 2652, plus the session-scoped variants. Desktop/Sources/FloatingControlBar/PushToTalkManager.swift carries the voice-side state machine, with the constants doubleTapThreshold and maxPTTDuration on lines 40 and 65. Both files are MIT-licensed at github.com/mediar-ai/fazm. Clone the repo and grep for pendingMessages and acpBridge.interrupt to follow every call site.

Keep reading

Voice + AX

A voice controlled macOS agent that actually clicks buttons

How push to talk audio at 16 kHz becomes a click at (x+w/2, y+h/2) on a real Slack thread, with the AX traversal and the MCP tool call shown in code.

Read

Session chain

Agent persistent session state, the rollover trap nobody warns you about

Why the upstream session ID is not stable, what happens when it rolls under a live conversation, and the chain pattern that keeps a handoff from amputating prior context.

Read

AX limits

Accessibility tree limits beyond the browser

Where the accessibility-first approach stops working, why canvas apps are blind to it, and which fallbacks (screenshot vision, OCR) earn their place.

Read