Accelerated Multi-node Inference with Ascend: Simplified Deployment of Large-scale Models Using GPUStack

Deploying large-scale models on Ascend NPUs often presents a significant challenge due to the complexity of configuring distributed inference using the standard MindIE engine. Although its performance is acceptable, the setup process involves intricate steps such as environment preparation, initialization, and fine-tuning of parameters. Even minor misconfigurations can lead to deployment failures and complicate troubleshooting.

GPUStack, an open-source Model-as-a-Service platform, offers high-performance inference capabilities along with robust model management features. It supports multiple hardware platforms including NVIDIA, AMD, Apple Silicon, Ascend, Hygon, Molet, TianShu, Cambricon, and Muxi, enabling seamless heterogeneous GPU cluster setups. It integrates support for inference engines like vLLM, MindIE, and llama-box.

To simplify deployment, GPUStack encapsulates and streamlines the distributed inference workflow of MindIE. Users can now deploy complex multi-node configurations through a few UI settings instead of manually adjusting numerous parameters, reducing errors and enhancing operational efficiency.

This guide demonstrates how to quickly deploy and run DeepSeek R1 671B using GPUStack's simplified interface on Ascend hardware with MindIE’s distributed inference capabilities.

Prerequisites

  1. Multiple Atlas 800T A2 servers (each with 8x 910B NPUs), interconnected via HCCN for RoCE networking

In a dual-server configuration, each server connects to its peer via 200 Gbps optical modules. For larger clusters, RoCE switches are used to establish high-speed interconnectivity between NPUs.

  1. Installed NPU drivers and firmware (https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.2.RC1&driver=Ascend+HDK+25.2.0)

The GPUStack v0.7.1 image includes CANN version 8.2.RC1, which requires driver version 25.2 or higher. Verify current driver versions with:

npu-smi info

Ensure compatibility between the installed CANN version and driver when upgrading.

  1. Configure network settings using hccn_tool (/usr/local/Ascend/driver/tools/hccn_tool):
  • Assign IPs to RoCE NICs of NPUs
  • Set gateways if cross-L3 communication is required
  • Define network detection targets (e.g., peer NPU IPs for direct connections or any peer node IP for RoCE networks)

On each node, verify and optimize RoCE settings with the following commands:

# Check physical links
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done

# Check link status
for i in {0..7}; do hccn_tool -i $i -link -g ; done

# Verify network health
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done

# Confirm network detection IPs
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done

# Check gateway configuration (optional)
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done

# Ensure TLS consistency across nodes (recommended: disable)
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch

# Disable TLS globally to prevent HCCL errors
for i in {0..7}; do hccn_tool -i $i -tls -s enable 0; done

Retreive NPU IP addresses:

for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done

Validate connectivity between nodes by replacing IP addresses with actual ones:

hccn_tool -i 0 -ping -g address 192.168.1.9
hccn_tool -i 1 -ping -g address 192.168.1.10
hccn_tool -i 2 -ping -g address 192.168.1.11
hccn_tool -i 3 -ping -g address 192.168.1.12
hccn_tool -i 4 -ping -g address 192.168.1.13
hccn_tool -i 5 -ping -g address 192.168.1.14
hccn_tool -i 6 -ping -g address 192.168.1.15
hccn_tool -i 7 -ping -g address 192.168.1.16

Reference documentation: https://support.huawei.com/enterprise/zh/doc/EDOC1100493980?idPath=23710424|251366513|22892968|252309113|250702818

  1. Download model weights. For quantized models, use community-prepared weights or apply msModelSlim from Ascend for quantization (https://gitcode.com/Ascend/msit/tree/master/msmodelslim)

This example uses the BF16 precision version of DeepSeek-R1. Running it requires four Atlas 800T A2 servers (8x 910B 64G). The model can be obtained from: https://huggingface.co/unsloth/DeepSeek-R1-BF16. For two servers, W8A8 quantization should be applied.

  1. GPUStack Ascend 910B NPU image, preloaded with MindIE 2.1RC1 and vLLM Ascend v0.9.1

Pull the GPUStack image via Docker:


docker pull --platform=linux/arm64 crpi-thyzhdzt86bexebt.cn-hangzhou.personal.cr.aliyuncs.com/gpustack_ai/gpustack:v0.7.1-npu-vllm-v0.9.1

Enstalling GPUStack

Follow the tutorial at: https://docs.gpustack.ai/latest/tutorials/running-deepseek-r1-671b-with-distributed-ascend-mindie/

  1. On the first node, start the server and worker:
docker run -d --name gpustack \
    --restart=unless-stopped \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware:ro \
    -v /etc/hccn.conf:/etc/hccn.conf:ro \
    -v /etc/ascend_install.info:/etc/ascend_install.info:ro \
    -v gpustack-data:/var/lib/gpustack \
    -v /data/models:/data/models \
    --shm-size=1g \
    --network=host \
    --ipc=host \
    crpi-thyzhdzt86bexebt.cn-hangzhou.personal.cr.aliyuncs.com/gpustack_ai/gpustack:v0.7.1-npu-vllm-v0.9.1 \
    --cache-dir /data/models

Check container logs for startup confirmation:

docker logs -f gpustack

Retrieve admin password and authentication token:

docker exec -it gpustack cat /var/lib/gpustack/initial_admin_password
docker exec gpustack cat /var/lib/gpustack/token
  1. On other nodes, launch the worker and register to the first node:
docker run -d --name gpustack \
    --restart=unless-stopped \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware:ro \
    -v /etc/hccn.conf:/etc/hccn.conf:ro \
    -v /etc/ascend_install.info:/etc/ascend_install.info:ro \
    -v gpustack-data:/var/lib/gpustack \
    -v /data/models:/data/models \
    --shm-size=1g \
    --network=host \
    --ipc=host \
    crpi-thyzhdzt86bexebt.cn-hangzhou.personal.cr.aliyuncs.com/gpustack_ai/gpustack:v0.7.1-npu-vllm-v0.9.1 \
    --cache-dir /data/models \
    --server-url http://<first_node_ip>:80 \
    --token <retrieved_token>

Access the GPUStack console via browser (http://HOST_IP) using default credentials (admin / retrieved password). After login, navigate to the resources section to confirm detected Ascend nodes and NPU availability.

Deploying DeepSeek R1 Model in Multi-node Setup

Navigate to Deploy > Deploy Model - Local Path:

  • Enter a custom name in the Name field
  • Specify the absolute path of the downloaded DeepSeek R1 model under Model Path
  • Select Ascend MindIE as the backend
  • Expand Advanced Settings and set the following parameters:
--data-parallel-size=4
--tensor-parallel-size=8
--moe-tensor-parallel-size=1
--moe-expert-parallel-size=32
--npu-memory-fraction=0.95

After passing compatibility checks, save the deployment.

GPUStack will automatically configure the distributed inference enviroment, generate necessary files like config.json and ranktable, and initiate MindIE Service Daemon processes across nodes. Hover over Distributed Across Workers to inspect resource allocation. Logs for the primary node can be viewed in the action panel. Initial model loading may take several minutes.

Should startup fail without visible errors in the main node logs, examine logs from worker nodes:

cd /var/lib/gpustack/log/serve/

Successful deployments show full utilization of NPU memory (~95%) across all nodes. Once deployed, test the model in the Experiment Playground:

Go to Experiment > Chat:

  • If only one model is deployed, it defaults to selection
  • Otherwise, choose DeepSeek-R1 from the dropdown menu

Enter prompts to interact with the model and assess generation quality and performance.

Use the Multi-model Comparison feature to conduct concurrent inference tests across multiple windows.

This tutorial enables efficient deployment of large-scale models like DeepSeek R1 671B using GPUStack and Ascend MindIE. The same approach applies to other models exceeding single-node capacity. Refer to the official MindIE model list for supported architectures: https://www.hiascend.com/software/mindie/modellist

Compared to native MindIE setups, GPUStack dramatically reduces manual effort and configuration risks, offering a more stable and manageable experience for deploying large models on Ascend hardware.

GPUStack provides enterprises with a reliable infrastructure for scalable inference on Ascend platforms, improving both productivity and user satisfaction.

Join Our Community

For more information, visit our GitHub repository: https://github.com/gpustack/gpustack. Submit issues or contribute to the project. Give us a star ⭐️ on GitHub to stay updated!

Join our Discord community: https://discord.gg/VXYJzuaqwD

Or scan QR codes to join our WeChat group for technical support and discussions.

Tags: Ascend MindIE GPUStack distributed-inference DeepSeek-R1

Posted on Fri, 08 May 2026 09:15:03 +0000 by spasme