How to use LLama.cpp as the LLM backend for Ozeki Voice Keyboard on Ubuntu

This guide demonstrates how to build and configure a LLama.cpp server on Ubuntu as the LLM backend for Ozeki Voice Keyboard on Windows. By running LLama.cpp on a dedicated Ubuntu machine, the AI assistant feature can leverage GPU-accelerated inference over the network: your speech is transcribed by the configured voice model, the resulting text is forwarded to LLama.cpp as a prompt, and the generated response is inserted into the active input field on your Windows machine.

How it works

The diagram below illustrates the full pipeline of the AI assistant feature across the two machines.

sequenceDiagram participant Win as Windows Machine (192.168.95.26) participant Whisper as Voice Transcription Model participant LLama as LLama.cpp Server (192.168.95.22:8123) Win->>Whisper: Send recorded audio Whisper-->>Win: Return transcribed text Win->>LLama: POST /v1/chat/completions (transcribed text) LLama-->>Win: Return generated response Win->>Win: Paste response into active input field

Steps to follow

Before proceeding, make sure Anaconda and Git are installed on your Ubuntu machine. A CUDA-compatible NVIDIA GPU is required for GPU-accelerated inference.

  1. Set up the Conda environment
  2. Clone LLama.cpp
  3. Build LLama.cpp with CUDA support
  4. Download the LLM model
  5. Start the LLama.cpp server
  6. Connect LLama.cpp to Ozeki Voice Keyboard

Quick reference commands

# Conda environment
conda create -n qwen35 python=3.12
conda activate qwen35

# Working directory and clone
mkdir qwen35
cd qwen35
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Create the CUDA-enabled build_llama script
nano build_llama.sh

export PATH=/usr/local/cuda/bin:$PATH
export CUDACXX=/usr/local/cuda/bin/nvcc
export CUDA_HOME=/usr/local/cuda
rm -rf build
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Running the build script
chmod +x build_llama.sh
./build_llama.sh

# Download model
pip install huggingface_hub
hf download bartowski/Qwen_Qwen3.5-9B-GGUF Qwen_Qwen3.5-9B-Q4_K_M.gguf --local-dir .

# Create the server runner script
nano run.sh

./llama.cpp/build/bin/llama-server \
  -m Qwen_Qwen3.5-9B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8123 \
  --gpu-layers 99 \
  -ngl 99 \
  -c 8192 \
  -fa auto

# Running the script
chmod +x run.sh
./run.sh

How to set up LLama.cpp on Ubuntu video

The following video shows how to build and run the LLama.cpp server on Ubuntu step-by-step. The video covers creating the Conda environment, cloning and building LLama.cpp with CUDA support, downloading the model, and starting the server.

Step 1 - Set up the Conda environment

Open a terminal on your Ubuntu machine and create a dedicated Conda environment with Python 3.12. Using a separate environment keeps LLama.cpp's dependencies isolated from other projects on your system (Figure 1).

conda create -n qwen35 python=3.12

Open terminal and create Conda environment
Figure 1 - Open a terminal and create the Conda environment

Activate the environment. Your terminal prompt will update to reflect the active environment name (Figure 2).

conda activate qwen35

Activate Conda environment
Figure 2 - Activate the Conda environment

Create a working directory for this setup and navigate into it (Figure 3).

mkdir qwen35
cd qwen35

Create and navigate to qwen directory
Figure 3 - Create and navigate to the working directory

Step 2 - Clone LLama.cpp

Clone the LLama.cpp repository from GitHub into the working directory (Figure 4).

git clone https://github.com/ggerganov/llama.cpp

Clone LLama.cpp using Git
Figure 4 - Clone the LLama.cpp repository

Navigate into the cloned LLama.cpp directory (Figure 5).

cd llama.cpp

Navigate to the cloned LLama folder
Figure 5 - Navigate to the cloned LLama.cpp folder

Step 3 - Build LLama.cpp with CUDA support

Create a build script using nano. This script sets the required CUDA environment variables and runs the CMake build with GPU support enabled (Figure 6).

nano build_llama.sh

Create build LLama shell script using nano
Figure 6 - Create the build script using nano

Paste the following build script content into nano, then save and exit (Ctrl+X, then Y, then Enter). The script configures CMake to build with CUDA enabled and targets CUDA architecture 86, which corresponds to NVIDIA Ampere GPUs (Figure 7).

export PATH=/usr/local/cuda/bin:$PATH
export CUDACXX=/usr/local/cuda/bin/nvcc
export CUDA_HOME=/usr/local/cuda

rm -rf build
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j$(nproc)

Paste and save CUDA enabled build script
Figure 7 - Paste and save the CUDA-enabled build script

Make the build script executable (Figure 8).

chmod +x build_llama.sh

Make script executable
Figure 8 - Make the build script executable

Run the build script. The compilation process may take several minutes depending on your hardware. Once complete, the llama-server binary will be available in the build/bin directory (Figure 9).

./build_llama.sh

Run build LLama script
Figure 9 - Run the build script

Step 4 - Download the LLM model

Navigate back to the working directory and install the huggingface_hub package, which provides the command-line tool used to download the model (Figure 10).

cd ..
pip install huggingface_hub

Install huggingface hub in qwen directory
Figure 10 - Install huggingface_hub in the working directory

Download the quantized Qwen 3.5 9B model in GGUF format from Hugging Face. The Q4_K_M quantization offers a good balance between model quality and memory usage (Figure 11).

hf download bartowski/Qwen_Qwen3.5-9B-GGUF Qwen_Qwen3.5-9B-Q4_K_M.gguf --local-dir .

Download LLM model using Hugging Face
Figure 11 - Download the model using Hugging Face

Step 5 - Start the LLama.cpp server

Create a runner script using nano to avoid retyping the server command on each launch (Figure 12).

nano run.sh

Create runner shell script using nano
Figure 12 - Create the runner script using nano

Paste the server launch command into the script, then save and exit. The server is configured to listen on all network interfaces on port 8123, load all layers onto the GPU, use a context size of 8192 tokens, and enable flash attention (Figure 13).

./llama.cpp/build/bin/llama-server \
  -m Qwen_Qwen3.5-9B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8123 \
  --gpu-layers 99 \
  -ngl 99 \
  -c 8192 \
  -fa auto

Paste and save LLama server run command
Figure 13 - Paste and save the server launch command

Make the runner script executable (Figure 14).

chmod +x run.sh

Make runner script executable
Figure 14 - Make the runner script executable

Run the script to start the LLama.cpp server. Keep this terminal open for the duration of your session (Figure 15).

./run.sh

Start LLM runner script
Figure 15 - Start the LLama.cpp server

The LLama.cpp server is now running and listening for requests on port 8123. It is accessible to other machines on your local network at http://{your-ip-address}:8123/v1 (Figure 16).

LLama.cpp server started
Figure 16 - The LLama.cpp server is running

Step 6 - Connect LLama.cpp to Ozeki Voice Keyboard

The following video shows how to connect the Ubuntu LLama.cpp server to Ozeki Voice Keyboard on Windows and verify that the AI assistant is working correctly. The video covers locating the tray icon, enabling HTTP logging, configuring the LLM settings, and confirming the connection through the log viewer.

Open Ozeki Voice Keyboard and locate its icon in the system tray in the bottom right corner of your taskbar (Figure 17).

Open Ozeki Voice Keyboard
Figure 17 - Open Ozeki Voice Keyboard

Before configuring the LLM settings, enable HTTP logging so you can verify that requests are reaching the Ubuntu server. Right-click the tray icon and navigate to Logs from the context menu (Figure 18).

Navigate to logs from context menu
Figure 18 - Navigate to Logs from the context menu

Enable HTTP logging and close the window. Outgoing requests to the LLama.cpp server will now be recorded and visible in the log viewer (Figure 19).

Enable HTTP logging and close window
Figure 19 - Enable HTTP logging and close the window

Right-click the tray icon again and open the LLM settings from the context menu (Figure 20).

Open LLM settings from context menu
Figure 20 - Open LLM settings from the context menu

Enter the API URL of the Ubuntu machine and specify the model name. You can leave the API key field empty since LLama.cpp does not require authentication by default. Click OK to save the settings (Figure 21).

http://{ubuntu-machine-ip}:8123/v1

Enter API URL, model and key
Figure 21 - Enter the API URL, model name and API key

To test the AI assistant, place your cursor in any input field, then press and hold Ctrl + Space and speak your question into the microphone. Once you release the keys, the recording is transcribed and the resulting text is forwarded to the LLama.cpp server on the Ubuntu machine as a prompt (Figure 22).

Use AI assistant
Figure 22 - Use the AI assistant hotkey to ask a question

Open the Logs window to verify the request. You should see an HTTP request to the LLama.cpp server's /v1/chat/completions endpoint on the Ubuntu machine, confirming that Ozeki Voice Keyboard is successfully communicating with the remote LLM backend (Figure 23).

View request in logs
Figure 23 - View the LLM request in the logs

To sum it up

You have successfully built and configured a LLama.cpp server on Ubuntu and connected it to Ozeki Voice Keyboard on Windows. The AI assistant will now use your Ubuntu machine's GPU to generate responses, giving you a high-performance, fully local LLM backend that operates entirely within your own network without relying on any external cloud service.


More information