Aligning vLLM and Hugging Face Inference for Long-Context Models

Offline Inference Configuration

The following benchmarks and alignment procedures were conducted in a specific environment designed for long-context processing. The hardware setup consisted of a single NVIDIA A6000 (48GB) GPU. The software stack included Ubuntu, Python 3.10, PyTorch 2.3.0, Transformers 4.41.2, and vLLM 0.5.0.post1. The primary objective was to achieve identical greedy decoding results between vLLM and Hugging Face for prompts exceeding 9,000 tokens.

Parameter Mapping and Consistency

To ensure output consistency, generation parameters must be strictly mapped between the two frameworks. While parameter names often correspond directly, there are critical behavioral differences.

vLLM Parameter	Hugging Face Parameter	Notes
`max_tokens`	`max_new_tokens`	Maximum number of tokens to generate.
`top_p`	`top_p`	Nucleus sampling probability.
`top_k`	`top_k`	vLLM defaults to 50, while Hugging Face defaults to -1 (entire vocabulary).
`temperature`	`temperature`	vLLM performs greedy decoding when `temperature=0`. Hugging Face does not allow `temperature=0` and requires `do_sample=False` for greedy decoding.
`repetition_penalty`	`repetition_penalty`	Penalty for repeated tokens.
N/A	`do_sample`	Enables sampling in Hugging Face. Must be `False` for greedy alignment.

For strict alignment, both frameworks should use greedy decoding. Note that top_p values can influence results even during greedy decoding attempts in some implementations, so they should be explicitly defined in both.

# Configuration for vLLM greedy decoding
vllm_config = {
    "max_tokens": 1150,
    "top_p": 0.9,
    "top_k": 50,
    "temperature": 0.0,
    "repetition_penalty": 1.0,
}

# Configuration for Hugging Face greedy decoding
hf_config = {
    "max_new_tokens": 1150,
    "top_p": 0.9,
    "top_k": 50,
    "temperature": 0.35,
    "repetition_penalty": 1.0,
    "do_sample": False
}

End-of-Sequence Token Alignment

Stopping criteria must be synchronized. Hugging Face uses eos_token_id, while vLLM uses stop_token_ids. For specific model architectures like Llama-3-instruct or Qwen2, these IDs must be manually set to match the tokenizer's expectations.

# Hugging Face EOS setup
if model_arch == 'llama':
    hf_config['eos_token_id'] = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
    hf_config['pad_token_id'] = tokenizer.eos_token_id
elif model_arch == 'qwen2':
    hf_config['eos_token_id'] = [151645, 151643]
    hf_config['pad_token_id'] = 151643

# vLLM Stop Token setup
if model_arch == 'llama':
    vllm_config['stop_token_ids'] = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
elif model_arch == 'qwen2':
    vllm_config['stop_token_ids'] = [151645, 151643]

Data Type and Model Initialization

The model loading data type must be consistent. If Hugging Face loads the model in float16, vLLM must also be initialized with dtype='float16' or dtype='auto' to ensure the same precision is used.

Batch Size Considerations for Long Sequences

Experiments with prompts averaging 9k tokens on an A6000 GPU revealed that vLLM's internal dynamic batching can introduce subtle differences compared to Hugging Face's sequential processing when the input batch is very large. To guarantee exact alignment in greedy decoding, it is necessary to manually control the batch size passed to the vLLM generate function. A batch size of 15 or fewer proved stable for exact reproducibility in this environment.

Inference Implementation

The following implementation handles long-context inference with Position Interpolation (PI) scaling. vLLM automatically detects rope_scaling in the model configuration to extend context windows (e.g., scaling an 8k model to 16k).

import os
import math
import copy
from typing import List, Union
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

class VLLMEngine:
    def __init__(
        self,
        model_path: str,
        model_arch: str,
        precision: str = 'float16',
        gpu_memory_util: float = 0.9,
        max_seq_capture: int = 9800
    ):
        self.model_arch = model_arch
        self.llm = LLM(
            model=model_path,
            dtype=precision,
            gpu_memory_utilization=gpu_memory_util,
            max_seq_len_to_capture=max_seq_capture,
            trust_remote_code=True
        )
        self.tokenizer = self.llm.get_tokenizer()

    def generate(self, prompts: Union[str, List[str]], config: dict, batch_size: int = 15):
        config = copy.deepcopy(config)
        if self.model_arch == 'llama':
            config['stop_token_ids'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
        elif self.model_arch == 'qwen2':
            config['stop_token_ids'] = [151645, 151643]
        
        sampling_params = SamplingParams(**config)
        
        if isinstance(prompts, str):
            prompts = [prompts]
            
        # Prepare input IDs
        input_ids_list = []
        for p in prompts:
            formatted_p = self._format_prompt(p)
            ids = self.tokenizer.encode(formatted_p, add_special_tokens=False)
            input_ids_list.append(ids)
            
        # Process in controlled batches
        all_outputs = []
        total_items = len(input_ids_list)
        
        for i in range(0, total_items, batch_size):
            batch = input_ids_list[i : i + batch_size]
            results = self.llm.generate(prompt_token_ids=batch, sampling_params=sampling_params)
            all_outputs.extend(results)
            
        return all_outputs

    def _format_prompt(self, text: str) -> str:
        # Helper to apply chat template logic if necessary
        return text.strip()

The Hugging Face implementation requires explicit configuration for padding and RoPE scaling to match the vLLM behavior.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

class HFEngine:
    def __init__(self, model_path: str, model_max_length: int = 16384):
        self.config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
        
        # Handle RoPE Scaling for long context
        orig_ctx_len = getattr(self.config, "max_position_embeddings", None)
        if orig_ctx_len and model_max_length > orig_ctx_len:
            scaling_factor = float(math.ceil(model_max_length / orig_ctx_len))
            self.config.rope_scaling = {"type": "linear", "factor": scaling_factor}

        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            config=self.config,
            torch_dtype=torch.float16,
            device_map="auto"
        ).eval()
        
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path, 
            trust_remote_code=True, 
            padding_side="left"
        )
        if self.tokenizer.pad_token is None:
             self.tokenizer.pad_token = self.tokenizer.eos_token

    def generate(self, prompts: List[str], config: dict):
        inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.model.device)
        outputs = self.model.generate(**inputs, **config)
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

Tags: vLLM Hugging Face LLM Inference pytorch Long Context

Posted on Mon, 25 May 2026 17:09:15 +0000 by gaogier

Freaks City