Offline Inference Configuration
The following benchmarks and alignment procedures were conducted in a specific environment designed for long-context processing. The hardware setup consisted of a single NVIDIA A6000 (48GB) GPU. The software stack included Ubuntu, Python 3.10, PyTorch 2.3.0, Transformers 4.41.2, and vLLM 0.5.0.post1. The primary objective was to achieve identical greedy decoding results between vLLM and Hugging Face for prompts exceeding 9,000 tokens.
Parameter Mapping and Consistency
To ensure output consistency, generation parameters must be strictly mapped between the two frameworks. While parameter names often correspond directly, there are critical behavioral differences.
| vLLM Parameter | Hugging Face Parameter | Notes |
|---|---|---|
max_tokens | max_new_tokens | Maximum number of tokens to generate. |
top_p | top_p | Nucleus sampling probability. |
top_k | top_k | vLLM defaults to 50, while Hugging Face defaults to -1 (entire vocabulary). |
temperature | temperature | vLLM performs greedy decoding when temperature=0. Hugging Face does not allow temperature=0 and requires do_sample=False for greedy decoding. |
repetition_penalty | repetition_penalty | Penalty for repeated tokens. |
| N/A | do_sample | Enables sampling in Hugging Face. Must be False for greedy alignment. |
For strict alignment, both frameworks should use greedy decoding. Note that top_p values can influence results even during greedy decoding attempts in some implementations, so they should be explicitly defined in both.
# Configuration for vLLM greedy decoding
vllm_config = {
"max_tokens": 1150,
"top_p": 0.9,
"top_k": 50,
"temperature": 0.0,
"repetition_penalty": 1.0,
}
# Configuration for Hugging Face greedy decoding
hf_config = {
"max_new_tokens": 1150,
"top_p": 0.9,
"top_k": 50,
"temperature": 0.35,
"repetition_penalty": 1.0,
"do_sample": False
}
End-of-Sequence Token Alignment
Stopping criteria must be synchronized. Hugging Face uses eos_token_id, while vLLM uses stop_token_ids. For specific model architectures like Llama-3-instruct or Qwen2, these IDs must be manually set to match the tokenizer's expectations.
# Hugging Face EOS setup
if model_arch == 'llama':
hf_config['eos_token_id'] = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
hf_config['pad_token_id'] = tokenizer.eos_token_id
elif model_arch == 'qwen2':
hf_config['eos_token_id'] = [151645, 151643]
hf_config['pad_token_id'] = 151643
# vLLM Stop Token setup
if model_arch == 'llama':
vllm_config['stop_token_ids'] = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
elif model_arch == 'qwen2':
vllm_config['stop_token_ids'] = [151645, 151643]
Data Type and Model Initialization
The model loading data type must be consistent. If Hugging Face loads the model in float16, vLLM must also be initialized with dtype='float16' or dtype='auto' to ensure the same precision is used.
Batch Size Considerations for Long Sequences
Experiments with prompts averaging 9k tokens on an A6000 GPU revealed that vLLM's internal dynamic batching can introduce subtle differences compared to Hugging Face's sequential processing when the input batch is very large. To guarantee exact alignment in greedy decoding, it is necessary to manually control the batch size passed to the vLLM generate function. A batch size of 15 or fewer proved stable for exact reproducibility in this environment.
Inference Implementation
The following implementation handles long-context inference with Position Interpolation (PI) scaling. vLLM automatically detects rope_scaling in the model configuration to extend context windows (e.g., scaling an 8k model to 16k).
import os
import math
import copy
from typing import List, Union
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
class VLLMEngine:
def __init__(
self,
model_path: str,
model_arch: str,
precision: str = 'float16',
gpu_memory_util: float = 0.9,
max_seq_capture: int = 9800
):
self.model_arch = model_arch
self.llm = LLM(
model=model_path,
dtype=precision,
gpu_memory_utilization=gpu_memory_util,
max_seq_len_to_capture=max_seq_capture,
trust_remote_code=True
)
self.tokenizer = self.llm.get_tokenizer()
def generate(self, prompts: Union[str, List[str]], config: dict, batch_size: int = 15):
config = copy.deepcopy(config)
if self.model_arch == 'llama':
config['stop_token_ids'] = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|end_of_text|>")]
elif self.model_arch == 'qwen2':
config['stop_token_ids'] = [151645, 151643]
sampling_params = SamplingParams(**config)
if isinstance(prompts, str):
prompts = [prompts]
# Prepare input IDs
input_ids_list = []
for p in prompts:
formatted_p = self._format_prompt(p)
ids = self.tokenizer.encode(formatted_p, add_special_tokens=False)
input_ids_list.append(ids)
# Process in controlled batches
all_outputs = []
total_items = len(input_ids_list)
for i in range(0, total_items, batch_size):
batch = input_ids_list[i : i + batch_size]
results = self.llm.generate(prompt_token_ids=batch, sampling_params=sampling_params)
all_outputs.extend(results)
return all_outputs
def _format_prompt(self, text: str) -> str:
# Helper to apply chat template logic if necessary
return text.strip()
The Hugging Face implementation requires explicit configuration for padding and RoPE scaling to match the vLLM behavior.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
class HFEngine:
def __init__(self, model_path: str, model_max_length: int = 16384):
self.config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# Handle RoPE Scaling for long context
orig_ctx_len = getattr(self.config, "max_position_embeddings", None)
if orig_ctx_len and model_max_length > orig_ctx_len:
scaling_factor = float(math.ceil(model_max_length / orig_ctx_len))
self.config.rope_scaling = {"type": "linear", "factor": scaling_factor}
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
config=self.config,
torch_dtype=torch.float16,
device_map="auto"
).eval()
self.tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True,
padding_side="left"
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def generate(self, prompts: List[str], config: dict):
inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.model.device)
outputs = self.model.generate(**inputs, **config)
return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)