Home Services Process Work Open Source Blog es Book a call
Gemini Live API Voice Agent macOS Automation Hackathon

Building a Voice-Controlled macOS Agent with Gemini

What if you could control your entire Mac just by talking? We built Gemini Mac Pilot for the Gemini Live Agent Challenge — a voice agent that sees your screen, understands your apps, and takes action.

March 2026 8 min
Building a Voice-Controlled macOS Agent with Gemini

Every AI assistant today lives in a text box. You type a question, get an answer, maybe copy-paste something into another app. But the promise of AI has always been bigger than that — an assistant that actually does things on your computer, not just talks about them.

Gemini Mac Pilot is a voice-controlled macOS agent that can open your apps, navigate your browser, read your screen, type messages, run commands, and complete multi-step workflows — all from natural speech. No keyboard required.

Say "Open WhatsApp and message Daniel that I'll be late" and Mac Pilot opens WhatsApp, finds Daniel's conversation, types the message, and sends it. Say "Play Rosalia on YouTube" and it opens Chrome, searches YouTube, and plays the video. The interaction feels like having a skilled assistant sitting next to you, operating your Mac while you talk.

The Problem: AI assistants that can't actually assist

Current AI assistants are fundamentally disconnected from where you actually work. They live in their own window, isolated from your desktop, your apps, your browser tabs. When you ask an AI to "check your email," it tells you how to check your email. When you ask it to "schedule a meeting," it gives you instructions.

The gap between AI capability and AI usefulness is the last mile problem: getting the AI to actually interact with your real environment.

We wanted to bridge that gap completely. Not a chatbot that gives instructions, but an agent that executes. Not text-only, but voice-first — because if you are going to hand control to an AI, you need to be able to talk to it naturally, interrupt it, correct it, and guide it in real time.

The Gemini Live Agent Challenge gave us the perfect excuse to build this. Gemini's Live API provides something no other foundation model offers: true bidirectional native audio streaming.

Architecture: Two brains, one agent

The key architectural insight is separating voice from reasoning. Trying to do both in a single model creates a bottleneck — voice requires low-latency streaming, while tool-calling workflows need deliberate multi-step planning. So we split the agent into two layers.

Voice Layer — Gemini Live API

The voice layer uses the Gemini Live API with native audio for bidirectional speech. The user speaks naturally, and the model streams audio responses back in real time. When the user requests an action, the voice layer calls an execute_task function, handing the request to the brain layer.

Brain Layer — Gemini 3 Flash Preview

The brain layer uses Gemini 3 Flash Preview with native function calling and parallel function call support. It receives a task description, reads the current macOS accessibility tree to understand what is on screen, plans a sequence of actions, and executes them through tool calls. This is where the actual reasoning happens across 24 tools.

Reading any app's UI with the Accessibility API

The macOS Accessibility API (AX API) is the backbone of the native app control. Every macOS application exposes its UI as an accessibility tree — a hierarchy of elements with roles, labels, values, and positions. We traverse this tree recursively, assigning each element a numeric ID, and present it to Gemini as a structured text representation.

This approach works with any native macOS app without any app-specific integration. WhatsApp, Notes, Finder, System Settings — if it has an accessibility tree, Mac Pilot can read and control it.

Browser automation with Chrome DevTools Protocol

For web interactions, the Accessibility API is not enough — web content inside Chrome is opaque to AX. So we connect directly to the user's real Chrome browser via the Chrome DevTools Protocol (CDP). The brain can navigate to URLs, read page text, click by text or CSS selector, type into inputs, and execute arbitrary JavaScript — all inside the user's actual browsing session.

Google Workspace integration

Beyond the desktop and browser, Mac Pilot integrates directly with Google Workspace through CLI tools. Read and send Gmail, manage Google Calendar events, browse Google Drive, and edit Google Docs — all through voice commands. This brings the total to 24 tools across native macOS, browser, and cloud productivity.

Challenges

Voice session time limits: The Gemini Live API has a 15-minute session limit. We implemented automatic session reconnection — when the session approaches the limit, the voice layer cleanly reconnects and resumes listening.

Keeping the UI responsive: Brain tasks can take 10-30 seconds. We built an event bus that streams status updates from the brain and voice layers to the PyWebView overlay via WebSocket.

Accessibility permissions: macOS requires explicit user permission for accessibility control. We added clear setup instructions and runtime error messages that guide users through the permission flow.

Try it yourself

Gemini Mac Pilot is open source on GitHub. If you have a Mac, a Google Cloud project, and a microphone, you can be voice-controlling your desktop in minutes. See how the project went viral on LinkedIn.

Toni Soriano
Toni Soriano
Principal AI Engineer at Cloudstudio. 18+ years building production systems. Creator of Ollama Laravel (87K+ downloads).
LinkedIn →

Building a voice agent?

We build voice-controlled AI systems with real-time audio and desktop automation.

Free Resource

Get the AI Implementation Checklist

10 questions every team should answer before building AI systems. Avoid the most common mistakes we see in production projects.

Check your inbox!

We've sent you the AI Implementation Checklist.

No spam. Unsubscribe anytime.