# Building a Hands-Free AI Assistant: Speech Recognition Meets LLMs

*Posted on May 20, 2025 by David H Sells*

![A robot with headphones and a microphone](images/headphones.png)

## TL;DR
I built a hands-free AI assistant that lets you talk to an LLM without touching your keyboard. Speak, wait for silence, and let the AI respond with its synthesized voice. All using JavaScript, WebSockets, and the Web Speech API. Code included!

## The "Why?"

Ever had that moment when you're elbow-deep in cookie dough and suddenly need to convert tablespoons to milliliters? Or maybe you're changing a tire and need to remember the proper torque settings? 

I found myself constantly wanting to talk to AI assistants **without having to touch anything**. Sure, there are commercial solutions like Alexa and Google Assistant, but I wanted something:

1. That I could customize completely
2. That would use my choice of language model
3. That wouldn't constantly listen and send audio to the cloud
4. That I could host on my own hardware

So I built this hands-free LLM interface that uses speech recognition to understand you, sends your question to any LLM, and then speaks the response back to you.

## The Magic Ingredients

Our speech-powered AI assistant requires four main components:

1. **Speech Recognition** - To understand what you're saying
2. **LiteLLM Proxy** - A unified API gateway that interfaces with multiple LLM providers
3. **LLM API Communication** - To get intelligent responses (via GROQ's Llama3-70B)
4. **Speech Synthesis** - To speak those responses back to you

Let's dive into how we built each part!

## The Architecture: A Three-Tier Symphony

Our application consists of three main components working in harmony:

1. **Client (index.html)** - Handles speech recognition and synthesis in the browser
2. **Node.js Server (index.js)** - WebSocket server that manages client connections
3. **LiteLLM Proxy** - API gateway that communicates with GROQ's Llama3-70B model

![Client-Server Architecture Diagram](images/clientserver.png)

The LiteLLM proxy acts as a unified interface to various LLM providers, allowing us to easily switch between different models and providers without changing our application code. In our setup, it's configured to use GROQ's powerful Llama3-70B model for fast, high-quality responses.

## The Client: Teaching Your Browser to Listen and Speak

Our client code (in `index.html`) does two critical things:
- Listens for your voice input until you stop talking
- Speaks the AI's response back to you

### Speech Recognition: It's All About the Silence

The challenge with speech recognition isn't getting the words—it's knowing when you're done talking! Our solution uses a **silence detection** approach that automatically stops listening after you've been quiet for a few seconds.

```javascript
function handleSpeechResult(event) {
    // Get the text you've spoken so far
    const result = event.results[event.results.length - 1][0].transcript;
    
    // Reset our silence timer
    if (silenceTimer) clearTimeout(silenceTimer);
    
    // Start a new silence timer - if you stop talking, this will trigger
    silenceTimer = setTimeout(() => {
        // You've been quiet long enough, stop listening
        this.stop();
        processFinalSpeech(result);
    }, CONFIG.silenceTimeout);
}
```

This is genius in its simplicity. Every time you say something, we reset the timer. When you stop talking, the timer counts down and then triggers our processing function.

### Speech Synthesis: Making Your Computer Talk Back

Once we get the AI's response, we use the browser's built-in speech synthesis to read it aloud:

```javascript
function speakText(text) {
    const utterance = new SpeechSynthesisUtterance(text);
    speechSynthesis.speak(utterance);
}
```

Browser speech synthesis might not sound like Morgan Freeman, but it's surprisingly good these days. And unlike recorded audio, it can say literally anything our AI responds with!

## The Server: WebSocket Orchestrator

The Node.js server (in `index.js`) acts as the communication hub between your voice and the AI's brain. It:
1. Hosts the HTML interface
2. Handles WebSocket connections for real-time communication  
3. Forwards your spoken text to the LiteLLM proxy
4. Relays the AI responses back to your browser

The most interesting part is how we communicate with LiteLLM:

```javascript
async function queryLLM(ws, message) {
    try {
        const got = (await import('got')).default;
        const response = await got(process.env.LLM_API_ENDPOINT, {
            method: 'POST',
            headers: {
                'content-type': 'application/json',
                'Authorization': `Bearer ${API_KEY}`,
            },
            json: {
                model: LLM_MODEL,
                messages: [
                    {
                        "role": "system",
                        "content": "Keep response to 4 lines of text."
                    },
                    { role: 'user', content: message }
                ],
                max_tokens: 1000
            }
        });

        const data = JSON.parse(response.body);
        if (data?.choices?.[0]?.message?.content) {
            const content = data.choices[0].message.content;
            ws.send(content);
        }
    } catch (error) {
        console.error('Error querying LLM:', error);
        ws.send('Sorry, there was an error processing your request.');
    }
}
```

This function takes what you said, packages it for the LiteLLM API, and sends the response back to your browser via WebSockets. Notice how we include a system message to keep responses concise - perfect for voice interaction! The beauty of using LiteLLM is that it provides a unified interface to dozens of different LLM providers.

## The Whole Conversation Flow

![Voice Chat Webapp](images/VoiceChatWebapp.jpg)


Here's what happens when you use this application:

1. You click "Start Listening"
2. Your browser asks for microphone permission
3. You speak your question or command
4. You stop talking and wait (for about 15 seconds)
5. The browser detects silence and sends your speech text via WebSocket to the Node.js server
6. The Node.js server forwards your text to the LiteLLM proxy
7. LiteLLM routes the request to GROQ's Llama3-70B model
8. The AI model generates a response (limited to 4 lines for voice-friendly delivery)
9. The response travels back through LiteLLM → Node.js server → your browser
10. Your browser speaks the response aloud using speech synthesis

It's like a digital game of telephone with three stops, except nothing gets lost in translation and it's blazingly fast thanks to GROQ's inference speed!

## The Code: A Masterpiece of Modular Design

After some refactoring (because my first version looked like it was written during a caffeine overdose), both files now follow clean code principles:

### The Server (index.js)

```javascript
// Configuration loaded from environment variables
const PORT = 9898;
const API_KEY = 'your_secret_key_xx'; // Replace with actual key
const LLM_MODEL = 'llama3-70b';
const HTML_FILE = 'index.html';

// Single-purpose functions with clear names
function createHttpServer() {
    return http.createServer(handleHttpRequest);
}

function handleWebSocketConnection(ws) {
    console.log('Client connected');
    ws.on('message', (message) => handleIncomingMessage(ws, message));
    ws.on('close', () => console.log('Client disconnected'));
}

// Uses environment variable for LiteLLM endpoint
// process.env.LLM_API_ENDPOINT points to LiteLLM proxy
```

### The Client (index.html)

```javascript
// Organized into configuration, DOM elements, and state
const CONFIG = {
    silenceTimeout: 15000,
    wsEndpoint: 'wss://openui.davidsells.today',
    language: 'en-US'
};

const DOM = {
    startButton: document.getElementById('startButton'),
    status: document.getElementById('status'),
    output: document.getElementById('output'),
    finalResult: document.getElementById('finalResult')
};

const STATE = {
    silenceTimer: null,
    finalSpeechResult: '',
    speaking: false
};

// Clear initialization flow
function initApp() {
    checkBrowserSupport();
    setupWebSocket();
    setupSpeechRecognition();
}
```

## The LiteLLM Magic: One API to Rule Them All

One of the coolest parts of this setup is LiteLLM, which acts as a universal translator for different LLM APIs. Instead of writing separate code for OpenAI, Anthropic, GROQ, or dozens of other providers, LiteLLM provides a single, consistent interface.

Our `config.yaml` file tells LiteLLM how to route requests:

```yaml
model_list:
  - model_name: 'llama3-70b'
    litellm_params:
      model: 'groq/llama3-70b-8192'
      api_key: your_groq_api_key_here
```

This configuration maps our friendly model name `llama3-70b` to GROQ's specific endpoint. Want to switch to OpenAI's GPT-4? Just change the configuration file - no code changes needed!

## Running the Application: Docker-Powered Deployment

The modern way to run this application is with Docker Compose, which orchestrates both our Node.js application and the LiteLLM proxy. Here's how to get started:

### Prerequisites

1. Install Docker and Docker Compose
2. Get a GROQ API key from [groq.com](https://groq.com)
3. Clone the repository: `git clone [repository-url]`

### Quick Start with Docker Compose

1. **Create your environment file** (`.env`):
   ```bash
   API_KEY=your_litellm_master_key_here
   GROQ_API_KEY=your_groq_api_key_here
   ```

2. **Update the config.yaml** with your GROQ API key (replace the placeholder)

3. **Launch everything with one command**:
   ```bash
   docker-compose up -d
   ```

That's it! Docker Compose will:
- Build the Node.js application container
- Pull and configure the LiteLLM container  
- Set up networking between the containers
- Expose the web interface on port 9898

4. **Access your voice assistant**:
   - Open your browser to `http://localhost:9898`
   - Click "Start Listening" and start talking!

### Manual Setup (if you prefer the old-school way)

If you want to run things manually without Docker:

1. **Set up LiteLLM**:
   ```bash
   pip install litellm
   litellm --config config.yaml --port 4000
   ```

2. **Set up the Node.js app**:
   ```bash
   npm install
   export LLM_API_ENDPOINT=http://localhost:4000/v1/chat/completions
   node index.js
   ```

3. **Access the application** at `http://localhost:9898`

## Customization Ideas

The beauty of this modular architecture is how easily you can customize it:

### LLM Provider Changes
- **Switch to OpenAI**: Update `config.yaml` to use `openai/gpt-4`
- **Try Claude**: Change to `anthropic/claude-3-sonnet-20240229`
- **Use local models**: Point to Ollama, LM Studio, or other local endpoints
- **Multiple models**: Configure different models for different purposes

### Application Tweaks
- Adjust the silence timeout (currently 15 seconds) in the client code
- Modify the system prompt to change AI personality or response style
- Add conversation history and context memory
- Implement voice authentication
- Add wake word detection
- Create custom UI themes

### Docker Deployment Options
- **Production deployment**: Use Docker Swarm or Kubernetes
- **HTTPS/SSL**: Add Nginx reverse proxy for secure connections
- **Scaling**: Run multiple app instances behind a load balancer
- **Monitoring**: Add health checks and logging containers

## The Technical Challenges I Faced

Building this wasn't all sunshine and JavaScript. Here are some hurdles I overcame:

1. **Browser Compatibility**: The Web Speech API isn't universally supported (I'm looking at you, Firefox)
2. **Silence Detection**: Finding the right timeout value that doesn't cut you off mid-sentence but also doesn't wait forever
3. **WebSocket Stability**: Ensuring connections remain stable and reconnect if broken
4. **Container Networking**: Getting the Node.js app to communicate with LiteLLM inside Docker
5. **API Response Formatting**: Ensuring voice-friendly responses that aren't too long or technical
6. **Environment Configuration**: Managing API keys and endpoints across development and production
7. **Voice Synthesis Quality**: Working with the limitations of browser-based speech synthesis

## Why This Matters: The Future of Human-Computer Interaction

Voice interfaces are becoming increasingly important. They're not just convenient—they're essential for:

- Accessibility for those with mobility impairments
- Hands-free operation in industrial, medical, or culinary settings
- Reducing screen time while maintaining productivity
- Creating more natural human-computer interactions

## Conclusion: Talk Is No Longer Cheap—It's Valuable!

This project demonstrates how modern web technologies, containerization, and AI APIs can work together to create a sophisticated hands-free AI assistant. The combination of speech recognition, LiteLLM's universal API gateway, GROQ's lightning-fast inference, and speech synthesis creates an entirely new way to interact with artificial intelligence.

By containerizing the application with Docker, we've made it incredibly easy to deploy and scale. The LiteLLM proxy adds flexibility that would have required significant engineering effort to build from scratch. And with GROQ's blazing-fast Llama3-70B, responses come back so quickly you'll forget you're talking to a machine.

The three-tier architecture (Client → Node.js → LiteLLM → GROQ) might seem complex, but each component has a clear responsibility, making the system both maintainable and extensible.

So next time you're up to your elbows in engine grease, bread dough, or finger paint, just run `docker-compose up -d` and remember that your AI assistant is just a few spoken words away!

---

## Code Download

Full code is available on my GitHub: [https://home.davidhsells.ca/Public/voicechat.git](https://home.davidhsells.ca/Public/voicechatai.git/)

---

*Have you built something similar or have ideas for improvements? Let me know in the comments below!*

*Tags: #JavaScript #AI #SpeechRecognition #LLM #WebDevelopment #Accessibility*