Building a $17 Local Voice Assistant with ESP32 for AI Agent Input

M
Matthew Diakonov

A $17 Voice Bridge

You do not need a $200 smart speaker to add voice input to your AI agent workflow. An ESP32 microcontroller with a MEMS microphone costs under $17 and can serve as a dedicated voice bridge between you and your desktop agent.

The concept is simple: the ESP32 captures audio, streams it to your local machine over WiFi, and your agent processes the speech. No cloud services required for the hardware layer.

The Hardware Stack

  • ESP32 dev board - around $8. Any variant with WiFi works.
  • INMP441 MEMS microphone - around $3. I2S interface, surprisingly good audio quality.
  • Wiring and USB cable - around $6.

Total: under $17 for a dedicated voice input device.

The ESP32 captures audio at 16kHz 16-bit mono via I2S and streams raw PCM data over a WebSocket connection to your local machine. No on-device processing needed - the ESP32 is just a microphone with WiFi.

Software Architecture

On the ESP32 side, the firmware is minimal: initialize I2S, connect to WiFi, open a WebSocket, and stream audio buffers. Around 200 lines of Arduino code.

On the desktop side, a small server receives the audio stream and feeds it to a speech-to-text engine. Whisper running locally handles transcription. The transcribed text becomes input for your AI agent.

The full pipeline: speak into the ESP32, audio streams over WiFi, Whisper transcribes locally, your agent receives the text command.

Why Not Just Use Your Laptop Microphone

Three reasons:

  1. Placement. The ESP32 can sit on your desk pointed at you. Your laptop microphone picks up fan noise and keyboard sounds.
  2. Always listening. A dedicated device can run a wake word detector without keeping your laptop microphone active.
  3. Multiple rooms. Put an ESP32 in each room. One desktop agent, multiple voice input points.

Integration with Desktop Agents

The voice bridge becomes particularly powerful paired with a desktop automation agent. Say "open my email and summarize unread messages" and the agent takes over - clicking through your mail client, reading content, and speaking a summary back through your speakers.

Voice input turns an AI agent from something you type at into something you talk to.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts