Built a Free Superwhisper Alternative Using Claude Code
Built a Free Superwhisper Alternative Using Claude Code
Superwhisper is a well-designed app, but it costs money and sends your audio to cloud servers for transcription. If you care about privacy or just want to avoid a recurring subscription, you can build the same functionality locally using whisper.cpp and Claude Code in a weekend. The result is faster than you might expect.
The Performance Case for Local Whisper
The latency argument for cloud transcription used to be convincing. Cloud models were trained on more data and ran on faster hardware. That gap has largely closed on Apple Silicon.
Benchmarks on the M4 chip show the whisper.cpp tiny model transcribing 10 seconds of audio in 0.37 seconds - that is 27x faster than real-time. The base model takes 0.54 seconds for the same audio (18x real-time). For typical voice input utterances of 5 to 15 seconds, the total round-trip including transcription and text injection is well under one second.
For comparison, the same workload sent to a cloud API introduces network round-trip latency on top of server processing time. In practice, cloud solutions often deliver worse perceived latency for short utterances because the network overhead dominates.
whisper.cpp is also optimized specifically for Apple Silicon via Metal and the Apple Neural Engine via Core ML. With Core ML enabled, encoder inference runs more than 3x faster than CPU-only execution. For the small and medium models used for most voice input, this brings real-time factor well above 10x on M2 or later chips.
The Privacy Case
Voice input captures everything you say near a microphone. Half-formed thoughts, confidential client names, API keys you accidentally read aloud, conversations happening in the background. Sending that audio to a cloud service means trusting their retention policies, their security posture, and their data handling practices.
Local transcription means the audio never leaves your machine. No audio sent to servers. No transcript stored in cloud logs. No uncertainty about what happens to the data after transcription.
For developers who dictate code, architecture notes, or conversations about unreleased products, the privacy argument alone justifies the setup effort.
The Architecture
The implementation has four components:
- Global hotkey handler - Listens for your trigger key combination (Command+Shift+Space is a natural choice) and activates recording
- Audio capture - Records from the default microphone using AVFoundation
- whisper.cpp integration - Transcribes the recorded audio chunk
- Text injection - Places the transcribed text at the cursor position using the macOS accessibility API
The accessibility API injection is the piece that makes this feel like Superwhisper instead of a developer tool. You can dictate into any application - a text editor, a browser, a terminal - and the text appears exactly where your cursor was.
Setting Up whisper.cpp
# Clone and build whisper.cpp
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make -j
# Download the base model (142MB, best balance of speed/accuracy)
bash ./models/download-ggml-model.sh base.en
# Test it
./main -m models/ggml-base.en.bin -f samples/jfk.wav
With Core ML acceleration:
# Generate the Core ML model for Neural Engine acceleration
xcrun coremlc compile models/ggml-base.en-encoder.mlpackage models/
# Build with Core ML support
make clean
WHISPER_COREML=1 make -j
On an M2 Mac, this typically reduces transcription time for a 10-second utterance from about 300ms to under 100ms.
The Swift Menu Bar App
Claude Code can scaffold the entire Swift app from a description. Here is the core transcription loop that integrates with whisper.cpp via its C API:
import AVFoundation
import AppKit
class VoiceTranscriber {
private var audioRecorder: AVAudioRecorder?
private let tempAudioPath = URL(fileURLWithPath: "/tmp/voice_input.wav")
func startRecording() {
let settings: [String: Any] = [
AVFormatIDKey: Int(kAudioFormatLinearPCM),
AVSampleRateKey: 16000, // Whisper requires 16kHz
AVNumberOfChannelsKey: 1,
AVLinearPCMBitDepthKey: 16,
AVLinearPCMIsFloatKey: false
]
try? audioRecorder = AVAudioRecorder(url: tempAudioPath, settings: settings)
audioRecorder?.record()
}
func stopAndTranscribe() -> String? {
audioRecorder?.stop()
// Run whisper.cpp via subprocess
let process = Process()
process.executableURL = URL(fileURLWithPath: "/usr/local/bin/whisper-main")
process.arguments = [
"-m", "\(NSHomeDirectory())/.whisper/ggml-base.en.bin",
"-f", tempAudioPath.path,
"--no-timestamps",
"-otxt"
]
let pipe = Pipe()
process.standardOutput = pipe
try? process.run()
process.waitUntilExit()
let data = pipe.fileHandleForReading.readDataToEndOfFile()
return String(data: data, encoding: .utf8)?.trimmingCharacters(in: .whitespacesAndNewlines)
}
}
And the text injection using the accessibility API:
func injectText(_ text: String) {
// Get the frontmost application and its focused element
guard let frontApp = NSWorkspace.shared.frontmostApplication else { return }
let axApp = AXUIElementCreateApplication(frontApp.processIdentifier)
var focusedElement: CFTypeRef?
AXUIElementCopyAttributeValue(axApp, kAXFocusedUIElementAttribute as CFString, &focusedElement)
guard let element = focusedElement else { return }
// Set the value directly on text fields, or use keyboard simulation for others
let result = AXUIElementSetAttributeValue(
element as! AXUIElement,
kAXValueAttribute as CFString,
text as CFTypeRef
)
if result != .success {
// Fallback: simulate keyboard input
simulateKeyboardInput(text)
}
}
Model Selection
Choosing the right model involves a tradeoff between speed and accuracy:
ggml-tiny.en (75MB) - 27x real-time on M4. Use when latency is critical and content is everyday English. Word error rate around 5% on clean audio.
ggml-base.en (142MB) - 18x real-time on M4. Better handling of accents, domain terminology, and background noise. The recommended starting point for most users.
ggml-small.en (466MB) - 8x real-time on M4. Noticeably better with technical vocabulary - programming terms, product names, jargon. Worth the speed tradeoff if you dictate code or technical content.
ggml-medium.en (1.5GB) - 3-4x real-time on M4. Near-OpenAI-API accuracy on difficult content. Use only if the smaller models are producing too many errors.
For most voice input use cases - dictating notes, writing emails, giving instructions to an AI agent - the base model is the right choice.
What Claude Code Contributed
Claude Code made the implementation significantly faster. The most time-consuming part without AI assistance would have been reading whisper.cpp documentation, learning AVFoundation's audio recording APIs, and figuring out the accessibility API injection patterns.
With Claude Code, you describe the architecture ("a menu bar app that records audio on hotkey, transcribes locally, and types the result into whatever is focused") and the agent handles the boilerplate, the API calls, and the permission scaffolding. The interesting engineering decisions - model selection, the fallback injection strategy, the Core ML acceleration setup - still require your judgment. But the implementation time drops from a weekend to an afternoon.
Accuracy Limitations
The honest tradeoff: specialized vocabulary. Superwhisper and cloud services fine-tune on massive datasets and can handle domain-specific terms, unusual proper nouns, and heavy accents better than base whisper.cpp models.
For everyday dictation and giving instructions to an AI agent, the base model is more than sufficient. For transcribing technical interviews, specialized terminology, or non-native speakers with strong accents, the medium model is worth trying before concluding that local transcription is insufficient.
Fazm is an open source macOS AI agent that uses local voice input for hands-free automation. Open source on GitHub.