Building a Native Swift Voice Control App for macOS - Open Source
Building a Native Swift Voice Control App for macOS - Open Source
The idea is simple. You talk to your computer and it does things. Not "it shows you a chat response" - it actually opens apps, fills forms, sends messages, organizes files. A voice interface for your entire desktop.
Executing on that idea without building something slow, privacy-compromising, or unreliable requires specific choices. Here is what we chose and why.
Why Native Swift and Not Electron
Electron would have been faster to prototype. We went with native Swift anyway because desktop agents need low-latency access to system APIs that web wrappers cannot provide efficiently.
The specific APIs that matter:
- Accessibility APIs for reading and interacting with any on-screen element
- Core Audio for microphone access without browser permission dialogs or latency overhead
- AppKit for window management, system menu interaction, and launch-at-login
- AVFoundation for audio processing and session management
These are not nice-to-haves. They are the core functionality. A Chrome-based shell adds 100-200ms of interprocess communication overhead to every API call and cannot access the accessibility tree at the same depth as a native process.
SwiftUI handles the interface - a minimal overlay that shows transcription status and the action the agent is about to take. The UI stays out of the way because the whole point is that you should not need to look at a screen to control your computer.
WhisperKit: Local Transcription with Measured Latency
Speech-to-text runs locally using WhisperKit, Argmax's Swift implementation of OpenAI's Whisper models optimized for Apple Neural Engine.
The headline numbers from WhisperKit benchmarks:
- Hypothesis text latency (how quickly you start seeing partial results): approximately 0.45 seconds
- Confirmed text latency (final, committed transcription): approximately 1.7 seconds
- Word error rate: 2.2% - competitive with cloud transcription services
- Real-time factor on Apple Silicon: the tiny model processes audio at 27x real-time, meaning 1 second of audio completes transcription in 37ms
For practical voice control, the 0.45s hypothesis latency is what matters. You hear your own voice, there is a brief pause, and the transcription appears. This feels close to real-time. Cloud services that require network round trips cannot consistently match this on anything other than an excellent connection.
import WhisperKit
class TranscriptionManager: ObservableObject {
private var whisperKit: WhisperKit?
@Published var transcriptionResult: String = ""
func setup() async throws {
// Uses the base model by default - fast enough for voice commands
whisperKit = try await WhisperKit(model: "openai_whisper-base.en")
}
func transcribe(audioBuffer: AVAudioPCMBuffer) async throws -> String {
guard let whisperKit else { throw TranscriptionError.notInitialized }
// Convert AVAudioPCMBuffer to float array
let floatArray = audioBuffer.toFloatArray()
let results = try await whisperKit.transcribe(
audioArray: floatArray,
decodeOptions: DecodingOptions(
task: .transcribe,
language: "en",
temperature: 0.0 // deterministic for command recognition
)
)
return results.first?.text.trimmingCharacters(in: .whitespaces) ?? ""
}
}
No audio leaves the machine. Transcription works offline. When a cloud provider changes their pricing or deprecates a model version, nothing breaks.
Accessibility APIs for Full Desktop Control
macOS accessibility APIs let the agent read what is on screen - button labels, text field contents, menu items, table rows - and interact with those elements programmatically. This is the same API that screen readers use, which means it works across virtually every macOS application without needing app-specific integrations.
The control flow for a voice command like "send the weekly report to the team Slack channel":
- Voice captured via Core Audio
- WhisperKit transcribes: "send the weekly report to the team Slack channel"
- LLM interprets intent: find the weekly report file, open Slack, find the team channel, compose and send a message with the file attached
- Accessibility API identifies the Slack window, navigates to the correct channel, inserts text, clicks send
- Accessibility observer confirms the message was sent (or reports failure)
Each step is auditable. The agent does not guess - it reads actual UI state before acting. If Slack is not open, step 4 detects that and either opens it or reports that the precondition is not met.
import ApplicationServices
class DesktopController {
func findElement(role: String, label: String, inApp bundleID: String) throws -> AXUIElement {
guard let app = NSRunningApplication.runningApplications(
withBundleIdentifier: bundleID).first else {
throw ControlError.appNotRunning(bundleID)
}
let appElement = AXUIElementCreateApplication(app.processIdentifier)
return try searchTree(appElement, role: role, label: label)
}
func click(_ element: AXUIElement) throws {
let result = AXUIElementPerformAction(element, kAXPressAction as CFString)
guard result == .success else {
throw ControlError.actionFailed(result.rawValue)
}
}
func type(text: String, into element: AXUIElement) throws {
try element.setAttribute(kAXFocusedAttribute, value: true as CFTypeRef)
// Use CGEvent for reliable text input across all apps
let source = CGEventSource(stateID: .hidSystemState)
for scalar in text.unicodeScalars {
let char = UniChar(scalar.value)
let down = CGEvent(keyboardEventSource: source, virtualKey: 0, keyDown: true)
let up = CGEvent(keyboardEventSource: source, virtualKey: 0, keyDown: false)
down?.keyboardSetUnicodeString(stringLength: 1, unicodeString: [char])
up?.keyboardSetUnicodeString(stringLength: 1, unicodeString: [char])
down?.post(tap: .cghidEventTap)
up?.post(tap: .cghidEventTap)
}
}
}
The combination of voice input, local transcription, and accessibility-based control creates something that feels genuinely different from typing commands into a chat box. There is no intermediate UI. You say the thing and watch it happen.
The Privacy Case for Local Processing
Every cloud voice assistant routes your audio to a server. The provider's privacy policy determines what happens to it. Even with end-to-end encryption and retention policies, audio describing your work conversations, document names, and personal tasks is transmitted to infrastructure you do not control.
Local transcription changes this. WhisperKit runs entirely on your Mac's Neural Engine. No packets leave your network. The audio is processed and immediately discarded. The only data that leaves the machine is what you explicitly authorize the agent to send - the Slack message you dictated, not the audio used to interpret it.
This matters especially for professional use. Dictating a draft email about a confidential product launch or a client negotiation should not involve routing audio through a third-party cloud.
What "Open Source" Means Here
The entire project is on GitHub under an open license because desktop agents that can control your entire computer should not be black boxes. You should be able to read exactly how the agent interprets voice commands, what accessibility permissions it uses, and what code path executes when you say "delete this file."
Transparency is a feature, not a positioning statement. If your voice agent is closed source, you have no way to verify what it is doing with microphone access or what data it retains.
This post was inspired by a discussion on r/MacOS.
Fazm is an open source macOS AI agent. Open source on GitHub.