How to set up Whisper Speech Detector on Windows
This guide demonstrates how to set up a local Whisper speech recognition server on Windows and connect it to Ozeki Voice Keyboard. You will learn how to install the required dependencies, start the Whisper server using agent-cli, and configure Ozeki Voice Keyboard to use it for speech-to-text transcription.
What is Whisper?
Whisper is an open-source speech recognition model developed by OpenAI. In this setup,
it is run locally on your Windows machine using the faster-whisper backend via
agent-cli, which exposes an OpenAI-compatible /v1/audio/transcriptions endpoint.
This allows Ozeki Voice Keyboard to send recorded audio to the local server and receive transcribed text in response.
Steps to follow
Before proceeding, make sure Anaconda is installed on your system.
FFmpeg can be installed via winget,
which is built into Windows 10 and 11. The agent-cli package will be installed
via pip during the setup process.
- Open a terminal window
- Install FFmpeg
- Set up the Anaconda environment
- Install agent-cli with faster-whisper
- Start the Whisper server
- Connect Whisper to Ozeki Voice Keyboard
Quick reference commands
# Install FFmpeg audio processing library winget install ffmpeg # Create a Python 3.11 Conda environment conda create -n whisper-ai python=3.11 -y # Activate the environment conda activate whisper-ai # Install agent-cli with the faster-whisper backend pip install "agent-cli[faster-whisper]" # Start the Whisper server using the small model agent-cli server whisper --model small
How to set up and run Whisper on Windows video
The following video shows how to set up and run the Whisper speech recognition server on Windows step-by-step. The video covers installing FFmpeg, creating the Conda environment, installing agent-cli, and starting the server.
Step 1 - Open a terminal window
Open a terminal window on your system. All setup commands in this guide are run from the terminal. You can use Windows PowerShell, or the standard Command Prompt (Figure 1).
Step 2 - Install FFmpeg
Install FFmpeg using the winget package manager. FFmpeg is required by the Whisper backend to handle audio file processing before transcription (Figure 2).
winget install ffmpeg
Step 3 - Set up the Anaconda environment
Create a dedicated Conda environment with Python 3.11 for the Whisper server. Using a separate environment keeps its dependencies isolated from other Python projects on your system (Figure 3).
conda create -n whisper-ai python=3.11 -y
Activate the newly created environment. Your terminal prompt will update to show the active environment name (Figure 4).
conda activate whisper-ai
Step 4 - Install agent-cli with faster-whisper
With the environment active, install agent-cli together with the faster-whisper backend
using pip. The faster-whisper extra installs all dependencies needed to run
the Whisper speech recognition model locally (Figure 5).
pip install "agent-cli[faster-whisper]"
Step 5 - Start the Whisper server
Start the Whisper server using agent-cli with the small model. You can choose between the different model sizes depending on how powerful your computer is (Figure 6).
agent-cli server whisper --model small
On the first run, agent-cli will automatically download and install the Wyoming protocol dependency. Wait for this process to complete before proceeding (Figure 7).
Once the dependency installation finishes, run the server start command again. This is only required on the first run (Figure 8).
agent-cli server whisper --model small
The Whisper server is now running and listening for transcription requests. Keep this terminal open for the duration of your session (Figure 9).
Step 6 - Connect Whisper to Ozeki Voice Keyboard
The following video shows how to connect the Whisper server to Ozeki Voice Keyboard and verify that transcription is working correctly. The video covers copying the API URL, configuring the Voice settings, and confirming the connection through the log viewer.
Copy the API URL from the terminal output. This is the endpoint you will enter in Ozeki Voice Keyboard to point it at the local Whisper server (Figure 10).
http://localhost:10301/v1
Open Ozeki Voice Keyboard and locate its icon in the Windows system tray in the bottom right corner of your taskbar (Figure 11).
Before configuring the Voice settings, enable HTTP logging so you can verify that requests are reaching the Whisper server. Right-click the tray icon and navigate to Logs from the context menu (Figure 12).
In the Logs window, enable HTTP logging and close the window. This will allow you to monitor the requests sent to the Whisper server after configuration (Figure 13).
Right-click the tray icon again and open the Voice settings from the context menu (Figure 14).
Enter the API URL you copied from the terminal, append "/audio/transcriptions" to the end of the URL, specify the model name, and leave the API key field empty since the local server does not require authentication. Click OK to save the settings (Figure 15).
Test the setup by placing your cursor in any input field and using the voice recording hotkey to dictate some text. The audio will be sent to the local Whisper server and the transcription will be pasted into the active field (Figure 16).
Open the Logs window to verify the request. You should see an HTTP request to the
/v1/audio/transcriptions endpoint, confirming that Ozeki Voice Keyboard
is successfully communicating with the local Whisper server (Figure 17).
To sum it up
You have successfully set up a local Whisper speech recognition server on Windows and connected it to Ozeki Voice Keyboard. The application will now use your local Whisper instance for all voice transcriptions, giving you a fast, private, and fully offline speech-to-text pipeline that does not rely on any external cloud service.