Building a Production-Ready Qwen3 Model Service Platform from Scratch

System Requirements

This guide covers deploying Qwen3 models on an Ubuntu 22.04 cloud instance equipped with an NVIDIA A10 GPU (24GB VRAM). The setup requires network connectivity for downloading container images and model files.

Environment Verification

Confirm GPU availability:

lspci | grep -i nvidia
gcc --version

NVIDIA Driver Installation

Install kernel headers matching the current kernel:

sudo apt-get update && sudo apt-get install linux-headers-$(uname -r)

Add the CUDA repository and install the driver:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update && sudo apt-get install nvidia-open -y
sudo reboot

Verify the installation:

nvidia-smi

Time estimate: 3 minutes

Container Runtime Setup

Docker Engine

Remove conflicting packages:

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do
    sudo apt-get remove $pkg
done

Configure the repository:

sudo apt-get update
sudo apt-get install ca-certificates curl -y
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

Install Docker:

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

Time estimate: 1 minute

NVIDIA Container Toolkit

Configure the repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install and configure:

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker

Modify /etc/docker/daemon.json to prevent container GPU errors:

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

Restart Docker:

sudo systemctl restart docker
docker info | grep -i runtime

Time estimate: 1.5 minutes

GPUStack Deployment

GPUStack is an open-source model serving platform supporting heterogeneous GPU clusters across NVIDIA, AMD, Apple Silicon, and domestic accelerators. It supports vLLM, MindIE, and llama-box inference engines with built-in OpenAI-compatible APIs.

Deploy via Docker:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    swr.cn-north-9.myhuaweicloud.com/gpustack/gpustack:v0.6.0

Monitor startup:

docker logs -f gpustack

Retrieve the initial admin password:

docker exec -it gpustack cat /var/lib/gpustack/initial_admin_password

Access the web interface at http://YOUR_HOST_IP and log in with the admin credentials. The dashboard displays detected GPU resources.

Time estimate: 21 minutes (primarily image download)

Qwen3 Model Deployment

Navigate to ModelsDeploy ModelModelScope and search for the official Qwen3 repository. For a single A10 GPU, select Qwen3-4B—a compact model delivering performance comparable to Qwen2.5-72B-Instruct.

Choose the vLLM backend for production-grade inference and initiate deployment.

Model download time: 14 minutes

After the model reaches Running status, test generation capabilities through the playground interface.

Summary

Total setup time: 43 minutes

  • Container image download: ~20 minutes
  • Model file download: ~14 minutes
  • Driver and runtime configuration: ~9 minutes

GPUStack v0.6 supports Qwen3 through both vLLM and llama-box backends. For larger models like Qwen3-235B-A22B exceeding single-GPU memory, distributed multi-node deployment is required—refer to the distributed inference documentation for configuration details.

Supported Configurations

Feature Support
Platforms Linux, Windows, macOS
GPUs NVIDIA, AMD, Apple Silicon, Huawei Ascend, Hygon, Moore Threads
Models LLM, Multimodal, Embedding, Reranker, Image Generation, Speech-to-Text, Text-to-Speech
Inference Engines vLLM, MindIE, lama-box (llama.cpp, stable-diffusion.cpp)
Enterprise Features Auto-scheduling, fault recovery, distributed inference, load balancing, monitoring, user management
Integration OpenAI-compatible API compatible with Dify, RAGFlow, FastGPT, MaxKB

Tags: Qwen3 GPUStack vLLM Model Serving docker

Posted on Thu, 14 May 2026 21:11:23 +0000 by phyzar