While vLLM commonly relies on Ray for distributed multi-node inference, it is possible to achieve cross-node coordination without an external scheduler by combining data parallelism (DP) and tensor parallelism (TP). This article walks through a concrete deployment on two Atlas 800I A2 servers (each with 8× Ascend 910B 64 GB) using the quantized Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp model.
Official vLLM-Ascend documentation: Qwen3.5-397B-A17B deployment guide
Workflow Overview
- Pull the latest vllm-ascend container image.
- Register the image as a custom vLLM backend in GPUStack.
- Collect network interface names and IP addresses on both nodes.
- Configure the primary node.
- Configure the secondary node.
- Start the service and run inference tests.
Registering the Custom Backend
In GPUStack, add a custom backend with these settings:
- Version name: Must match the image tag (e.g.,
0.18.0rc1). - Image URL:
swr.cn-south-1.myhuaweicloud.com/gpustack/vllm-ascend:v0.18.0rc1(China mirror; original:quay.io/ascend/vllm-ascend:v0.18.0rc1) - Supported framework:
CANN - Entry command:
vllm serve - Execution command template:
{{model_path}} --host {{worker_ip}} --port {{port}} --served-model-name {{model_name}}(double curly braces are GPUStack placeholders)
Gathering Network Information
Before multi-node deployment, identify the network interfaces on both machines. The chosen interface must be on the same subnet and communicate freely between nodes. In this demo:
- Primary node: interface
enp67s0f0, IP192.168.13.33 - Secondary node: interface
enp67s0f0, IP192.168.13.34
Note: The selected interface is for API communication only; high-speed interconnects are used for internal data transfers. Verify multi-node connectivity: official guide.
Multi-Node Deployment with vLLM
Primary Node Configuration
- In GPUStack, navigate to the deployment menu and choose ModelScope.
- Search and select the model:
Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp. - Initially set the replica count to 0 (both nodes will be enabled together later).
- Set the inference backend to vLLM and the version to
0.18.0rc1. - Change the scheduling mode to Manual and select all 8 NPUs on the primary node.
- Fill in the advanced parameters:
--data-parallel-size 2 --tensor-parallel-size 8 # DP=2, TP=8 (keep on one line)
--data-parallel-size-local=1 # DP replicas on this node
--api-server-count=2 # API server processes (defaults to DP count)
--data-parallel-address=192.168.13.33 # Primary node IP for DP
--data-parallel-rpc-port=13389 # RPC port for DP coordination
--seed=1024 # Random seed, must match across nodes
--enable-expert-parallel
--max-num-seqs=16 # Concurrency per iteration
--max-model-len=32768 # Maximum context length
--max-num-batched-tokens=8192 # Tokens per batch
--gpu-memory-utilization=0.90
--trust-remote-code
--async-scheduling # Overlap CPU scheduling with GPU
--no-enable-prefix-caching
--speculative_config '{"method":"qwen3_5_mtp","num_speculative_tokens":3,"enforce_eager":true}'
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
--additional-config '{"enable_cpu_binding":true,"multistream_overlap_shared_expert":true}'
--disable-access-log-for-endpoints /health,/metrics,/ping
Environment variables for the primary node:
PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
HCCL_IF_IP=192.168.13.33
GLOO_SOCKET_IFNAME=enp67s0f0
TP_SOCKET_IFNAME=enp67s0f0
HCCL_SOCKET_IFNAME=enp67s0f0
OMP_PROC_BIND=false
OMP_NUM_THREADS=1
HCCL_BUFFSIZE=1024
TASK_QUEUE_ENABLE=1
VLLM_ENGINE_READY_TIMEOUT_S=1500
Secnodary Node Configuration
- Clone the primary node’s model configuration.
- In the scheduling options, select all 8 NPUs on the secondary node.
- Modify the advanced parameters: replace
--api-server-count=2with--data-parallel-start-rank=1and add--headless. - Update the environment variable
HCCL_IF_IPto192.168.13.34. If the network interface name differs, adjust it accordingly.
Starting and Testing the Model
Set both the primary and secondary node replica counts from 0 to 1. The order does not matter; start them within a short interval. Once the primary node shows a Running status, the model is ready for inference.
Note: The secondary node will remain in Starting state – this is expected.
To disable the thinking stage in responses, pass enable_thinking: false in the request body:
curl http://your-gpustack-server/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${YOUR_GPUSTACK_API_KEY}" \
-d '{
"model": "qwen3.5-397b-1",
"messages": [
{
"role": "user",
"content": "Hello"
}
],
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": false
}
}
}'
Performance Considerations
With multi-node DP, requests are distributed across nodes using dynamic load balancing rather than simple round‑robin. This reduces head‑of‑line blocking from long requests and improves overall throughput under high concurrency.
Community Resources
For further discussion on AI infrastructure, large model deployment, and engine optimizaton, join the GPUStack community. If the QR code or group link expires, visit the repository for updated links.