Multi-Node Distributed Deployment of Qwen3.5-397B-A17B on Ascend 910B

While vLLM commonly relies on Ray for distributed multi-node inference, it is possible to achieve cross-node coordination without an external scheduler by combining data parallelism (DP) and tensor parallelism (TP). This article walks through a concrete deployment on two Atlas 800I A2 servers (each with 8× Ascend 910B 64 GB) using the quantized Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp model.

Official vLLM-Ascend documentation: Qwen3.5-397B-A17B deployment guide

Workflow Overview

Pull the latest vllm-ascend container image.
Register the image as a custom vLLM backend in GPUStack.
Collect network interface names and IP addresses on both nodes.
Configure the primary node.
Configure the secondary node.
Start the service and run inference tests.

Registering the Custom Backend

In GPUStack, add a custom backend with these settings:

Version name: Must match the image tag (e.g., 0.18.0rc1).
Image URL: swr.cn-south-1.myhuaweicloud.com/gpustack/vllm-ascend:v0.18.0rc1 (China mirror; original: quay.io/ascend/vllm-ascend:v0.18.0rc1)
Supported framework: CANN
Entry command: vllm serve
Execution command template: {{model_path}} --host {{worker_ip}} --port {{port}} --served-model-name {{model_name}} (double curly braces are GPUStack placeholders)

Gathering Network Information

Before multi-node deployment, identify the network interfaces on both machines. The chosen interface must be on the same subnet and communicate freely between nodes. In this demo:

Primary node: interface enp67s0f0, IP 192.168.13.33
Secondary node: interface enp67s0f0, IP 192.168.13.34

Note: The selected interface is for API communication only; high-speed interconnects are used for internal data transfers. Verify multi-node connectivity: official guide.

Multi-Node Deployment with vLLM

Primary Node Configuration

In GPUStack, navigate to the deployment menu and choose ModelScope.
Search and select the model: Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp.
Initially set the replica count to 0 (both nodes will be enabled together later).
Set the inference backend to vLLM and the version to 0.18.0rc1.
Change the scheduling mode to Manual and select all 8 NPUs on the primary node.
Fill in the advanced parameters:

--data-parallel-size 2 --tensor-parallel-size 8 # DP=2, TP=8 (keep on one line)
--data-parallel-size-local=1                       # DP replicas on this node
--api-server-count=2                               # API server processes (defaults to DP count)
--data-parallel-address=192.168.13.33              # Primary node IP for DP
--data-parallel-rpc-port=13389                     # RPC port for DP coordination
--seed=1024                                        # Random seed, must match across nodes
--enable-expert-parallel
--max-num-seqs=16                                  # Concurrency per iteration
--max-model-len=32768                              # Maximum context length
--max-num-batched-tokens=8192                      # Tokens per batch
--gpu-memory-utilization=0.90
--trust-remote-code
--async-scheduling                                 # Overlap CPU scheduling with GPU
--no-enable-prefix-caching
--speculative_config '{"method":"qwen3_5_mtp","num_speculative_tokens":3,"enforce_eager":true}'
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
--additional-config '{"enable_cpu_binding":true,"multistream_overlap_shared_expert":true}'
--disable-access-log-for-endpoints /health,/metrics,/ping

Environment variables for the primary node:

PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
HCCL_IF_IP=192.168.13.33
GLOO_SOCKET_IFNAME=enp67s0f0
TP_SOCKET_IFNAME=enp67s0f0
HCCL_SOCKET_IFNAME=enp67s0f0
OMP_PROC_BIND=false
OMP_NUM_THREADS=1
HCCL_BUFFSIZE=1024
TASK_QUEUE_ENABLE=1
VLLM_ENGINE_READY_TIMEOUT_S=1500

Secnodary Node Configuration

Clone the primary node’s model configuration.
In the scheduling options, select all 8 NPUs on the secondary node.
Modify the advanced parameters: replace --api-server-count=2 with --data-parallel-start-rank=1 and add --headless.
Update the environment variable HCCL_IF_IP to 192.168.13.34. If the network interface name differs, adjust it accordingly.

Starting and Testing the Model

Set both the primary and secondary node replica counts from 0 to 1. The order does not matter; start them within a short interval. Once the primary node shows a Running status, the model is ready for inference.

Note: The secondary node will remain in Starting state – this is expected.

To disable the thinking stage in responses, pass enable_thinking: false in the request body:

curl http://your-gpustack-server/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${YOUR_GPUSTACK_API_KEY}" \
-d '{
  "model": "qwen3.5-397b-1",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "extra_body": {
    "chat_template_kwargs": {
      "enable_thinking": false
    }
  }
}'

Performance Considerations

With multi-node DP, requests are distributed across nodes using dynamic load balancing rather than simple round‑robin. This reduces head‑of‑line blocking from long requests and improves overall throughput under high concurrency.

Community Resources

For further discussion on AI infrastructure, large model deployment, and engine optimizaton, join the GPUStack community. If the QR code or group link expires, visit the repository for updated links.

Tags: vLLM Ascend 910B Qwen3.5 GPUStack distributed inference

Posted on Thu, 11 Jun 2026 18:34:50 +0000 by djp120

Freaks City