How to use vLLM as the LLM backend for Ozeki Voice Keyboard on Ubuntu

This guide demonstrates how to configure a vLLM server on Ubuntu as the LLM backend for Ozeki Voice Keyboard on Windows. By running vLLM on a dedicated Ubuntu machine, the AI assistant feature can leverage GPU-accelerated inference over the network: your speech is transcribed by the configured voice model, the resulting text is forwarded to vLLM as a prompt, and the generated response is automatically inserted into the active input field on your Windows machine.

How it works

The diagram below illustrates the full pipeline of the AI assistant feature across the two machines.

sequenceDiagram participant Win as Windows Machine (192.168.95.26) participant Whisper as Voice Transcription Model participant VLLM as vLLM Server (192.168.95.22:8000) Win->>Whisper: Send recorded audio Whisper-->>Win: Return transcribed text Win->>VLLM: POST /v1/chat/completions (transcribed text) VLLM-->>Win: Return generated response Win->>Win: Paste response into active input field

Steps to follow

Before proceeding, make sure Anaconda is installed on your Ubuntu machine. A CUDA-compatible NVIDIA GPU is required for GPU-accelerated inference. The vllm package will be installed via pip during the setup process.

  1. Set up the Conda environment
  2. Install vLLM
  3. Start the vLLM server
  4. Connect vLLM to Ozeki Voice Keyboard

Quick reference commands

# Conda environment
conda create -n vllm python=3.12
conda activate vllm

# Install vLLM
pip install vllm

# Start the vLLM server
vllm serve Qwen/Qwen3.5-9B \
  --enforce-eager \
  --max-num-seqs 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

How to set up vLLM on Ubuntu video

The following video shows how to set up and run the vLLM server on Ubuntu step-by-step. The video covers creating the Conda environment, installing vLLM, and starting the server.

Step 1 - Set up the Conda environment

Open a terminal on your Ubuntu machine and create a dedicated Conda environment with Python 3.12. Using a separate environment keeps vLLM's dependencies isolated from other projects on your system (Figure 1).

conda create -n vllm python=3.12

Open terminal and create Conda environment
Figure 1 - Open a terminal and create the Conda environment

Activate the environment. Your terminal prompt will update to reflect the active environment name (Figure 2).

conda activate vllm

Activate Conda environment
Figure 2 - Activate the Conda environment

Step 2 - Install vLLM

With the environment active, install vLLM using pip. vLLM will automatically download all required dependencies for serving large language models with GPU acceleration (Figure 3).

pip install vllm

Install vLLM using pip
Figure 3 - Install vLLM using pip

Step 3 - Start the vLLM server

Start the vLLM server with the Qwen 3.5 9B model. On the first run, vLLM will automatically download the model from Hugging Face, which may take several minutes depending on your connection speed. The server is configured to use eager execution, limit concurrent sequences to 1, cap the context length at 8192 tokens, and utilize 95% of available GPU memory (Figure 4).

vllm serve Qwen/Qwen3.5-9B \
  --enforce-eager \
  --max-num-seqs 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

Start vLLM server
Figure 4 - Start the vLLM server

The vLLM server is now running and listening for requests on port 8000. Keep this terminal open for the duration of your session. The endpoint is accessible to other machines on your local network at http://{your-ubuntu-ip}:8000/v1 (Figure 5).

vLLM server started
Figure 5 - The vLLM server is running and ready

Step 4 - Connect vLLM to Ozeki Voice Keyboard

The following video shows how to connect the Ubuntu vLLM server to Ozeki Voice Keyboard on Windows and verify that the AI assistant is working correctly. The video covers locating the tray icon, enabling HTTP logging, configuring the LLM settings, and confirming the connection through the log viewer.

On your Windows machine, open Ozeki Voice Keyboard and locate its icon in the system tray in the bottom right corner of your taskbar (Figure 6).

Open Ozeki Voice Keyboard
Figure 6 - Open Ozeki Voice Keyboard

Before configuring the LLM settings, enable HTTP logging so you can verify that requests are reaching the Ubuntu server. Right-click the tray icon and navigate to Logs from the context menu (Figure 7).

Navigate to logs from context menu
Figure 7 - Navigate to Logs from the context menu

Enable HTTP logging and close the window. Outgoing requests to the vLLM server will now be recorded and visible in the log viewer (Figure 8).

Enable HTTP logging and close window
Figure 8 - Enable HTTP logging and close the window

Right-click the tray icon again and open the LLM settings from the context menu (Figure 9).

Open LLM settings from context menu
Figure 9 - Open LLM settings from the context menu

Enter the API URL of the Ubuntu machine and specify the model name. You can leave the API key field empty since vLLM does not require authentication by default. Click OK to save the settings (Figure 10).

http://{ubuntu-machine-ip}:8000/v1

Enter API URL, model and key
Figure 10 - Enter the API URL, model name and API key

To test the AI assistant, place your cursor in any input field, then press and hold Ctrl + Space and speak your question into the microphone. Once you release the keys, the recording is transcribed and the resulting text is forwarded to the vLLM server on the Ubuntu machine as a prompt (Figure 11).

Use AI assistant
Figure 11 - Use the AI assistant hotkey to ask a question

Open the Logs window to verify the request. You should see an HTTP request to the vLLM server's /v1/chat/completions endpoint on the Ubuntu machine, confirming that Ozeki Voice Keyboard is successfully communicating with the remote LLM backend (Figure 12).

View request in logs
Figure 12 - View the LLM request in the logs

Conclusion

You have successfully configured a vLLM server on Ubuntu and connected it to Ozeki Voice Keyboard on Windows. The AI assistant will now use your Ubuntu machine's GPU to generate responses, giving you a high-performance, fully local LLM backend that operates entirely within your own network without relying on any external cloud service.


More information