Deploying and Testing Gemma 4 Locally with GPUStack: A Multimodal Agent Capability Guide

The recent release of Gemma 4 introduces models that compete with Qwen 3.5, offering enhanced reasoning, native multi-modal understanding, and agentic features like tool calling and structured output. The model family supports text, image, video, and audio inputs with a 128K-256K context window, depending on the variant.

This walkthrough covers a complete local deployment pipeline using GPUStack, an open-source platform for GPU cluster orchestration. We test core capabilities including text generation, visual and audio processing, thinking mode, and tool calling.

Setting Up the GPUStack Environment

GPUStack provides a control plane for managing multi-GPU clusters and supports pluggable inference backends such as vLLM and SGLang. Begin by preparing a container runtime, such as Docker:

docker info

Launch the GPUStack server container, which does not require a GPU and can run on any node:

sudo docker run -d --name gpustack \
  --restart unless-stopped \
  -p 80:80 \
  --volume gpustack-data:/var/lib/gpustack \
  swr.cn-south-1.myhuaweicloud.com/gpustack/gpustack:v2.1.1 \
  --debug --bootstrap-password GPUStack@123

Key parameters:

  • -p 80:80—Exposes the web console on port 80; adjust mapping as needed.
  • --volume—Persists platform data such as model configurations and API keys.
  • --bootstrap-password—Sets the initial admin password.
  • --debug—Enables verbose logging for troubleshooting.

Verify the server is operational:

docker logs -f gpustack

Navigate to http://<Server-IP>:80 and log in with admin and the configured password. Create a Docker-based cluster, then register a GPU worker node.

Before adding a GPU worker, confirm the NVIDIA driver version is atleast 575:

nvidia-smi

Ensure the NVIDIA Container Toolkit is configured for Docker:

sudo docker info 2>/dev/null | grep -q "Runtime.*nvidia" && echo "Nvidia Container Toolkit OK" || (echo "Nvidia Container Toolkit not configured"; exit 1)

Follow the GPUStack console instructions to add the worker. The provided command starts a worker container and automatically registers it with the server. Confirm the node shows a Ready status.

Adding Custom Inference Backend Versions

GPUStack's architecture decouples the platform from specific inference engines, allowing custom images or versions. To run Gemma 4, we set up custom vLLM backends.

Build a tailored vLLM image because Gemma 4 requires transformers >= 5.5.0 and audio support:

FROM vllm/vllm-openai:v0.19.0
RUN uv pip install --system vllm[audio] \
  && uv pip install --system transformers==5.5.0

Build and tag the image:

docker build -t vllm/vllm-openai:v0.19.0-gemma4 .

In the GPUStack console under Inference Backends, edit the vLLM configuration and switch to YAML mode. Import the following definitions, which include the official image and the custom Gemma 4 image:

backend_name: vLLM
version_configs:
  0.19.0-custom:
    image_name: vllm/vllm-openai:v0.19.0
    entrypoint: vllm serve
    run_command: >-
      {{model_path}} --host {{worker_ip}} --port {{port}} --served-model-name
      {{model_name}}
    env: {}
    custom_framework: cuda
  0.19.0-gemma4-custom:
    image_name: vllm/vllm-openai:v0.19.0-gemma4
    entrypoint: vllm serve
    run_command: >-
      {{model_path}} --host {{worker_ip}} --port {{port}} --served-model-name
      {{model_name}}
    env: {}
    custom_framework: cuda

Ensure the custom image is present on the worker node before deployment.

Deploying a Gemma 4 Model

We deploy google/gemma-4-31B-it on a node with two 4090 GPUs (48GB each). For environments with internet access, use the Deploy → ModelScope option in the console. In offline setups, add model files from a local path beforehand.

Configuration:

  • Backend: vLLM
  • Version: The custom 0.19.0-gemma4-custom

Use the following back end parameters. Tensor parallelism is set to 2; adjust based on your available GPUs:

--tensor-parallel-size 2 --max-model-len 32768 --gpu-memory-utilization 0.9 --enable-auto-tool-choice --reasoning-parser gemma4 --tool-call-parser gemma4  --async-scheduling --limit-mm-per-prompt '{"image": 4, "video": 1, "audio": 1}'

Memory guidelines:

  • 31B and 26B models: at least 80GB VRAM recommended.
  • E2B and E4B models: 24GB VRAM is sufficient.

If your NVIDIA driver exceeds 575, you may see a compatibility warning because the container uses CUDA 12.9. This is safe to ignore; click Submit to proceed. Monitor logs via the View Logs operation until the instance reaches the Running state.

Testing Model Capabilities

Audio Transcription

Only gemma-4-e4b-it and gemma-4-e2b-it support audio. Use the following curl example (replace the endpoint, token, and model as needed):

curl http://<host-ip>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key>" \
-d '{
  "seed": null,
  "stop": null,
  "temperature": 1,
  "top_p": 1,
  "max_tokens": 4096,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "model": "gemma-4-e4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "<base64-encoded-audio>",
            "format": "mp3"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": "Provide a verbatim, word-for-word transcription of the audio."
    }
  ]
}' | jq

Video Understanding

Install the OpenAI Python library:

pip install openai

Create a script:

from openai import OpenAI

client = OpenAI(
    base_url="http://<host-ip>/v1",
    api_key="<your-api-key>"
)

response = client.chat.completions.create(
    model="gemma-4-31b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"}
                },
                {
                    "type": "text",
                    "text": "Summarize what happens in this video."
                }
            ]
        }
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

Execute it with python video_understanding.py.

Thinking Mode

Enable thinking by passing chat_template_kwargs in the request:

curl http://<host-ip>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key>" \
-d '{
  "max_tokens": 16384,
  "model": "gemma-4-31b-it",
  "messages": [
    {
      "role": "user",
      "content": "A classic problem: there are 40 heads and 130 legs. How many chickens and rabbits are there?"
    }
  ],
  "chat_template_kwargs": {"enable_thinking": true}
}' | jq

Note that Gemma 4 outputs its chain-of-thought inside a thought block rather than a separate reasoning field.

Tool Calling

Create a script that simulates a weather check:

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://<host-ip>/v1",
    api_key="<your-api-key>"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Step 1: Initiate tool call
response = client.chat.completions.create(
    model="gemma-4-31b-it",
    messages=[
        {"role": "user", "content": "What is the weather in Tokyo today?"}
    ],
    tools=tools,
    max_tokens=1024
)

message = response.choices[0].message

if message.tool_calls:
    tool_call = message.tool_calls[0]
    print(f"Tool: {tool_call.function.name}")
    print(f"Args: {tool_call.function.arguments}")

    # Step 2: Supply a synthetic result
    response = client.chat.completions.create(
        model="gemma-4-31b-it",
        messages=[
            {"role": "user", "content": "What is the weather in Tokyo today?"},
            message,
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps({"temperature": 22, "condition": "Partly cloudy", "unit": "celsius"})
            }
        ],
        tools=tools,
        max_tokens=1024
    )

    print(f"\nFinal answer: {response.choices[0].message.content}")

Run python toolcalling.py to observe the tool execution flow.

Tags: Gemma 4 GPUStack vLLM multimodal AI Local Deployment

Posted on Sat, 13 Jun 2026 17:26:06 +0000 by teongkia