How to set up Whisper Speech Detector on Windows

This guide demonstrates how to set up a local Whisper speech recognition server on Windows and connect it to Ozeki Voice Keyboard. You will learn how to install the required dependencies, start the Whisper server using agent-cli, and configure Ozeki Voice Keyboard to use it for speech-to-text transcription.

What is Whisper?

Whisper is an open-source speech recognition model developed by OpenAI. In this setup, it is run locally on your Windows machine using the faster-whisper backend via agent-cli, which exposes an OpenAI-compatible /v1/audio/transcriptions endpoint. This allows Ozeki Voice Keyboard to send recorded audio to the local server and receive transcribed text in response.

Steps to follow

Before proceeding, make sure Anaconda is installed on your system. FFmpeg can be installed via winget, which is built into Windows 10 and 11. The agent-cli package will be installed via pip during the setup process.

Open a terminal window
Install FFmpeg
Set up the Anaconda environment
Install agent-cli with faster-whisper
Start the Whisper server
Connect Whisper to Ozeki Voice Keyboard

Quick reference commands

# Install FFmpeg audio processing library
winget install ffmpeg

# Create a Python 3.11 Conda environment
conda create -n whisper-ai python=3.11 -y

# Activate the environment
conda activate whisper-ai

# Install agent-cli with the faster-whisper backend
pip install "agent-cli[faster-whisper]"

# Start the Whisper server using the small model
agent-cli server whisper --model small

How to set up and run Whisper on Windows video

The following video shows how to set up and run the Whisper speech recognition server on Windows step-by-step. The video covers installing FFmpeg, creating the Conda environment, installing agent-cli, and starting the server.

Step 1 - Open a terminal window

Open a terminal window on your system. All setup commands in this guide are run from the terminal. You can use Windows PowerShell, or the standard Command Prompt (Figure 1).

Open terminal window — Figure 1 - Open a terminal window

Step 2 - Install FFmpeg

Install FFmpeg using the winget package manager. FFmpeg is required by the Whisper backend to handle audio file processing before transcription (Figure 2).

winget install ffmpeg

Step 3 - Set up the Anaconda environment

Create a dedicated Conda environment with Python 3.11 for the Whisper server. Using a separate environment keeps its dependencies isolated from other Python projects on your system (Figure 3).

conda create -n whisper-ai python=3.11 -y

Create Anaconda environment — Figure 3 - Create the Anaconda environment

Activate the newly created environment. Your terminal prompt will update to show the active environment name (Figure 4).

conda activate whisper-ai

Activate environment — Figure 4 - Activate the Anaconda environment

Step 4 - Install agent-cli with faster-whisper

With the environment active, install agent-cli together with the faster-whisper backend using pip. The faster-whisper extra installs all dependencies needed to run the Whisper speech recognition model locally (Figure 5).

pip install "agent-cli[faster-whisper]"

Use pip to install agent-cli faster-whisper — Figure 5 - Install agent-cli with the faster-whisper backend

Step 5 - Start the Whisper server

Start the Whisper server using agent-cli with the small model. You can choose between the different model sizes depending on how powerful your computer is (Figure 6).

agent-cli server whisper --model small

Start Whisper server — Figure 6 - Start the Whisper server

On the first run, agent-cli will automatically download and install the Wyoming protocol dependency. Wait for this process to complete before proceeding (Figure 7).

Wait for agent-cli to install Wyoming — Figure 7 - Wait for agent-cli to install the Wyoming dependency

Once the dependency installation finishes, run the server start command again. This is only required on the first run (Figure 8).

agent-cli server whisper --model small

Rerun server — Figure 8 - Re-run the server start command

The Whisper server is now running and listening for transcription requests. Keep this terminal open for the duration of your session (Figure 9).

Server started — Figure 9 - The Whisper server is running

Step 6 - Connect Whisper to Ozeki Voice Keyboard

The following video shows how to connect the Whisper server to Ozeki Voice Keyboard and verify that transcription is working correctly. The video covers copying the API URL, configuring the Voice settings, and confirming the connection through the log viewer.

Copy the API URL from the terminal output. This is the endpoint you will enter in Ozeki Voice Keyboard to point it at the local Whisper server (Figure 10).

http://localhost:10301/v1

Copy API URL from terminal — Figure 10 - Copy the API URL from the terminal

Open Ozeki Voice Keyboard and locate its icon in the Windows system tray in the bottom right corner of your taskbar (Figure 11).

Figure 11 - Locate the Ozeki Voice Keyboard tray icon

Before configuring the Voice settings, enable HTTP logging so you can verify that requests are reaching the Whisper server. Right-click the tray icon and navigate to Logs from the context menu (Figure 12).

Navigate to logs from context menu — Figure 12 - Navigate to Logs from the context menu

In the Logs window, enable HTTP logging and close the window. This will allow you to monitor the requests sent to the Whisper server after configuration (Figure 13).

Enable HTTP logging and close window — Figure 13 - Enable HTTP logging and close the window

Right-click the tray icon again and open the Voice settings from the context menu (Figure 14).

Open Voice settings from context menu — Figure 14 - Open Voice settings from the context menu

Enter the API URL you copied from the terminal, append "/audio/transcriptions" to the end of the URL, specify the model name, and leave the API key field empty since the local server does not require authentication. Click OK to save the settings (Figure 15).

Enter API URL, model and key — Figure 15 - Enter the API URL, model and API key

Test the setup by placing your cursor in any input field and using the voice recording hotkey to dictate some text. The audio will be sent to the local Whisper server and the transcription will be pasted into the active field (Figure 16).

Figure 16 - Use speech-to-text transcription

Open the Logs window to verify the request. You should see an HTTP request to the /v1/audio/transcriptions endpoint, confirming that Ozeki Voice Keyboard is successfully communicating with the local Whisper server (Figure 17).

View request in logs — Figure 17 - View the transcription request in the logs

To sum it up

You have successfully set up a local Whisper speech recognition server on Windows and connected it to Ozeki Voice Keyboard. The application will now use your local Whisper instance for all voice transcriptions, giving you a fast, private, and fully offline speech-to-text pipeline that does not rely on any external cloud service.