How to set up Whisper Speech Detector on Windows

This guide demonstrates how to set up a local Whisper speech recognition server on Windows and connect it to Ozeki Voice Keyboard. You will learn how to install the required dependencies, start the Whisper server using agent-cli, and configure Ozeki Voice Keyboard to use it for speech-to-text transcription.

What is Whisper?

Whisper is an open-source speech recognition model developed by OpenAI. In this setup, it is run locally on your Windows machine using the faster-whisper backend via agent-cli, which exposes an OpenAI-compatible /v1/audio/transcriptions endpoint. This allows Ozeki Voice Keyboard to send recorded audio to the local server and receive transcribed text in response.

Steps to follow

Before proceeding, make sure Anaconda is installed on your system. FFmpeg can be installed via winget, which is built into Windows 10 and 11. The agent-cli package will be installed via pip during the setup process.

  1. Open a terminal window
  2. Install FFmpeg
  3. Set up the Anaconda environment
  4. Install agent-cli with faster-whisper
  5. Start the Whisper server
  6. Connect Whisper to Ozeki Voice Keyboard

Quick reference commands

# Install FFmpeg audio processing library
winget install ffmpeg

# Create a Python 3.11 Conda environment
conda create -n whisper-ai python=3.11 -y

# Activate the environment
conda activate whisper-ai

# Install agent-cli with the faster-whisper backend
pip install "agent-cli[faster-whisper]"

# Start the Whisper server using the small model
agent-cli server whisper --model small

How to set up and run Whisper on Windows video

The following video shows how to set up and run the Whisper speech recognition server on Windows step-by-step. The video covers installing FFmpeg, creating the Conda environment, installing agent-cli, and starting the server.

Step 1 - Open a terminal window

Open a terminal window on your system. All setup commands in this guide are run from the terminal. You can use Windows PowerShell, or the standard Command Prompt (Figure 1).

Open terminal window
Figure 1 - Open a terminal window

Step 2 - Install FFmpeg

Install FFmpeg using the winget package manager. FFmpeg is required by the Whisper backend to handle audio file processing before transcription (Figure 2).

winget install ffmpeg

Install FFmpeg using winget
Figure 2 - Install FFmpeg using winget

Step 3 - Set up the Anaconda environment

Create a dedicated Conda environment with Python 3.11 for the Whisper server. Using a separate environment keeps its dependencies isolated from other Python projects on your system (Figure 3).

conda create -n whisper-ai python=3.11 -y

Create Anaconda environment
Figure 3 - Create the Anaconda environment

Activate the newly created environment. Your terminal prompt will update to show the active environment name (Figure 4).

conda activate whisper-ai

Activate environment
Figure 4 - Activate the Anaconda environment

Step 4 - Install agent-cli with faster-whisper

With the environment active, install agent-cli together with the faster-whisper backend using pip. The faster-whisper extra installs all dependencies needed to run the Whisper speech recognition model locally (Figure 5).

pip install "agent-cli[faster-whisper]"

Use pip to install agent-cli faster-whisper
Figure 5 - Install agent-cli with the faster-whisper backend

Step 5 - Start the Whisper server

Start the Whisper server using agent-cli with the small model. You can choose between the different model sizes depending on how powerful your computer is (Figure 6).

agent-cli server whisper --model small

Start Whisper server
Figure 6 - Start the Whisper server

On the first run, agent-cli will automatically download and install the Wyoming protocol dependency. Wait for this process to complete before proceeding (Figure 7).

Wait for agent-cli to install Wyoming
Figure 7 - Wait for agent-cli to install the Wyoming dependency

Once the dependency installation finishes, run the server start command again. This is only required on the first run (Figure 8).

agent-cli server whisper --model small

Rerun server
Figure 8 - Re-run the server start command

The Whisper server is now running and listening for transcription requests. Keep this terminal open for the duration of your session (Figure 9).

Server started
Figure 9 - The Whisper server is running

Step 6 - Connect Whisper to Ozeki Voice Keyboard

The following video shows how to connect the Whisper server to Ozeki Voice Keyboard and verify that transcription is working correctly. The video covers copying the API URL, configuring the Voice settings, and confirming the connection through the log viewer.

Copy the API URL from the terminal output. This is the endpoint you will enter in Ozeki Voice Keyboard to point it at the local Whisper server (Figure 10).

http://localhost:10301/v1

Copy API URL from terminal
Figure 10 - Copy the API URL from the terminal

Open Ozeki Voice Keyboard and locate its icon in the Windows system tray in the bottom right corner of your taskbar (Figure 11).

Open Ozeki Voice Keyboard and locate tray icon
Figure 11 - Locate the Ozeki Voice Keyboard tray icon

Before configuring the Voice settings, enable HTTP logging so you can verify that requests are reaching the Whisper server. Right-click the tray icon and navigate to Logs from the context menu (Figure 12).

Navigate to logs from context menu
Figure 12 - Navigate to Logs from the context menu

In the Logs window, enable HTTP logging and close the window. This will allow you to monitor the requests sent to the Whisper server after configuration (Figure 13).

Enable HTTP logging and close window
Figure 13 - Enable HTTP logging and close the window

Right-click the tray icon again and open the Voice settings from the context menu (Figure 14).

Open Voice settings from context menu
Figure 14 - Open Voice settings from the context menu

Enter the API URL you copied from the terminal, append "/audio/transcriptions" to the end of the URL, specify the model name, and leave the API key field empty since the local server does not require authentication. Click OK to save the settings (Figure 15).

Enter API URL, model and key
Figure 15 - Enter the API URL, model and API key

Test the setup by placing your cursor in any input field and using the voice recording hotkey to dictate some text. The audio will be sent to the local Whisper server and the transcription will be pasted into the active field (Figure 16).

Use speech-to-text transcription
Figure 16 - Use speech-to-text transcription

Open the Logs window to verify the request. You should see an HTTP request to the /v1/audio/transcriptions endpoint, confirming that Ozeki Voice Keyboard is successfully communicating with the local Whisper server (Figure 17).

View request in logs
Figure 17 - View the transcription request in the logs

To sum it up

You have successfully set up a local Whisper speech recognition server on Windows and connected it to Ozeki Voice Keyboard. The application will now use your local Whisper instance for all voice transcriptions, giving you a fast, private, and fully offline speech-to-text pipeline that does not rely on any external cloud service.


More information