Back to Blog

How LLMs Can Control Your Computer - Voice-Driven, Local, No API Keys

Fazm Team··3 min read
llmdesktop-agentvoice-controllocal-firstopen-source

How LLMs Can Control Your Computer

Most people interact with LLMs through chat interfaces. Type a question, get an answer. But there is a much more interesting use case: letting an LLM actually control your computer.

Not generating code for you to run. Not suggesting what to click. Actually moving the mouse, typing in text fields, navigating between apps, and completing multi-step workflows autonomously.

The Architecture

A desktop agent powered by an LLM needs three things:

  1. Perception - the ability to see what is on the screen and understand the current state of the UI
  2. Planning - the ability to break a high-level instruction ("update the CRM with call notes") into a sequence of concrete actions
  3. Execution - the ability to actually perform those actions (click buttons, type text, switch apps)

The LLM handles the planning step. It takes the current screen state as input and outputs a structured action plan. The perception and execution layers are handled by native APIs - ScreenCaptureKit for screen capture and accessibility APIs for UI interaction. We cover the technical implementation of these APIs in our post on building a macOS AI agent in Swift.

Why Voice Changes Everything

Typing instructions to an LLM-powered agent defeats the purpose. If you are already at your keyboard, you might as well just do the task yourself.

Voice input changes the equation. You can tell the agent what to do while walking to the kitchen, while on a phone call, or while working on something else entirely. The agent becomes ambient - always available, never requiring you to switch contexts.

Push-to-talk is the right interaction model. Always-listening creates privacy concerns and false activations. A single keyboard shortcut to start speaking, then release to execute, keeps you in control.

Local vs Cloud

Running the LLM locally means:

  • No API keys. Download the app, open it, start using it. No account creation, no billing setup, no rate limits.
  • No latency. The roundtrip to a cloud API adds 500ms-2s per action. For a multi-step workflow, that adds up to a noticeably sluggish experience.
  • No privacy concerns. Your screen content, voice recordings, and file contents never leave your machine.

With Ollama and models like Qwen running on Apple Silicon, local inference is fast enough for practical desktop automation. You trade some accuracy for complete independence from cloud services. Our post on on-device AI on Apple Silicon goes deeper into what models run well locally and the latency tradeoffs.

That said, Fazm also supports Claude and other cloud models for users who want maximum accuracy and do not mind the cloud dependency. The choice is yours.

What It Actually Looks Like

Here is a typical workflow:

  1. You press the hotkey and say "send Sarah the meeting notes from today's standup"
  2. The agent reads the current screen to understand context
  3. It opens your email client, finds Sarah's contact, drafts the email with the meeting notes it observed earlier, and sends it
  4. Total time: 15 seconds instead of 2 minutes of manual app-switching and typing

The boring tasks - CRM updates, form filling, file organization, email triage - are where this shines. Not because the AI is smarter than you, but because these tasks do not deserve your attention in the first place. We compiled a list of the most satisfying tasks to automate based on real user feedback.


Fazm is an open source macOS agent that does all of this. Check it out on GitHub or visit fazm.ai. Discussed in r/LLMDevs.

Related Posts