LLM Inference - Freaks City - Where Weird Ideas Code Reality

LLM Inference

Evaluating vLLM's Performance Mode Flag: Throughput and Latency Optimization with Qwen 3.5

The --performance-mode argument (balanced, interactivity, throughput) shifts inference configuration from manual parameter sweeping to objective-driven tuning. Under the hood, interactivity refines CUDA graph capture granularity to minimize per-request latency, while throughput expands batch capacity limits to maximize aggregate token generatio ...

Posted on Sat, 06 Jun 2026 16:42:54 +0000 by louie35

Aligning vLLM and Hugging Face Inference for Long-Context Models

Offline Inference ConfigurationThe following benchmarks and alignment procedures were conducted in a specific environment designed for long-context processing. The hardware setup consisted of a single NVIDIA A6000 (48GB) GPU. The software stack included Ubuntu, Python 3.10, PyTorch 2.3.0, Transformers 4.41.2, and vLLM 0.5.0.post1. The primary o ...

Posted on Mon, 25 May 2026 17:09:15 +0000 by gaogier

Freaks City

Evaluating vLLM's Performance Mode Flag: Throughput and Latency Optimization with Qwen 3.5

Aligning vLLM and Hugging Face Inference for Long-Context Models

Hot Tags