Deploying and Testing Gemma 4 Locally with GPUStack: A Multimodal Agent Capability Guide

The recent release of Gemma 4 introduces models that compete with Qwen 3.5, offering enhanced reasoning, native multi-modal understanding, and agentic features like tool calling and structured output. The model family supports text, image, video, and audio inputs with a 128K-256K context window, depending on the variant. This walkthrough covers ...

Posted on Sat, 13 Jun 2026 17:26:06 +0000 by teongkia

Multi-Node Distributed Deployment of Qwen3.5-397B-A17B on Ascend 910B

While vLLM commonly relies on Ray for distributed multi-node inference, it is possible to achieve cross-node coordination without an external scheduler by combining data parallelism (DP) and tensor parallelism (TP). This article walks through a concrete deployment on two Atlas 800I A2 servers (each with 8× Ascend 910B 64 GB) using the quantized ...

Posted on Thu, 11 Jun 2026 18:34:50 +0000 by djp120

Building a Production-Ready Qwen3 Model Service Platform from Scratch

System Requirements This guide covers deploying Qwen3 models on an Ubuntu 22.04 cloud instance equipped with an NVIDIA A10 GPU (24GB VRAM). The setup requires network connectivity for downloading container images and model files. Environment Verification Confirm GPU availability: lspci | grep -i nvidia gcc --version NVIDIA Driver Installation ...

Posted on Thu, 14 May 2026 21:11:23 +0000 by phyzar

Accelerated Multi-node Inference with Ascend: Simplified Deployment of Large-scale Models Using GPUStack

Deploying large-scale models on Ascend NPUs often presents a significant challenge due to the complexity of configuring distributed inference using the standard MindIE engine. Although its performance is acceptable, the setup process involves intricate steps such as environment preparation, initialization, and fine-tuning of parameters. Even mi ...

Posted on Fri, 08 May 2026 09:15:03 +0000 by spasme