Deploying and Testing Gemma 4 Locally with GPUStack: A Multimodal Agent Capability Guide

The recent release of Gemma 4 introduces models that compete with Qwen 3.5, offering enhanced reasoning, native multi-modal understanding, and agentic features like tool calling and structured output. The model family supports text, image, video, and audio inputs with a 128K-256K context window, depending on the variant. This walkthrough covers ...

Posted on Sat, 13 Jun 2026 17:26:06 +0000 by teongkia

Multi-Node Distributed Deployment of Qwen3.5-397B-A17B on Ascend 910B

While vLLM commonly relies on Ray for distributed multi-node inference, it is possible to achieve cross-node coordination without an external scheduler by combining data parallelism (DP) and tensor parallelism (TP). This article walks through a concrete deployment on two Atlas 800I A2 servers (each with 8× Ascend 910B 64 GB) using the quantized ...

Posted on Thu, 11 Jun 2026 18:34:50 +0000 by djp120

Evaluating vLLM's Performance Mode Flag: Throughput and Latency Optimization with Qwen 3.5

The --performance-mode argument (balanced, interactivity, throughput) shifts inference configuration from manual parameter sweeping to objective-driven tuning. Under the hood, interactivity refines CUDA graph capture granularity to minimize per-request latency, while throughput expands batch capacity limits to maximize aggregate token generatio ...

Posted on Sat, 06 Jun 2026 16:42:54 +0000 by louie35

Aligning vLLM and Hugging Face Inference for Long-Context Models

Offline Inference ConfigurationThe following benchmarks and alignment procedures were conducted in a specific environment designed for long-context processing. The hardware setup consisted of a single NVIDIA A6000 (48GB) GPU. The software stack included Ubuntu, Python 3.10, PyTorch 2.3.0, Transformers 4.41.2, and vLLM 0.5.0.post1. The primary o ...

Posted on Mon, 25 May 2026 17:09:15 +0000 by gaogier

Implementing and Optimizing PagedAttention Kernels in vLLM

PagedAttention Memory Layout and Block Mapping PagedAttention replaces traditional contiguous key-value cache allocations with a virtual-to-physical block mapping scheme. This approach mirrors operating system memory paging, allowing non-contiguous GPU memory segments to serve sequential generation tasks without fragmentation overhead. Each req ...

Posted on Wed, 20 May 2026 06:06:02 +0000 by LarryK

Building a Production-Ready Qwen3 Model Service Platform from Scratch

System Requirements This guide covers deploying Qwen3 models on an Ubuntu 22.04 cloud instance equipped with an NVIDIA A10 GPU (24GB VRAM). The setup requires network connectivity for downloading container images and model files. Environment Verification Confirm GPU availability: lspci | grep -i nvidia gcc --version NVIDIA Driver Installation ...

Posted on Thu, 14 May 2026 21:11:23 +0000 by phyzar