Configuring GPU Resource Scheduling in Kubernetes Clusters

Prerequisites

Ensure NVIDIA drivers are installed on each node before proceeding.

Step 1: Install NVIDIA Container Runtime

Install the nvidia-container-runtime package on each node:

yum install nvidia-container-runtime

Step 2: Configure Docker

Edit /etc/docker/daemon.json to configure Docker to use the NVIDIA runtime:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Reload the Docker daemon and restart the service:

systemctl daemon-reload
systemctl restart docker

Step 3: Deploy NVIDIA Device Plugin

Create a DaemonSet manifest named nvidia-device-plugin.yaml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

Apply the cnofiguration:

kubectl create -f nvidia-device-plugin.yaml

Verification

After deployment, verify the DaemonSet is running:

kubectl get daemonset -n kube-system -l name=nvidia-device-plugin-ds

Nodes with GPUs will now expose the nvidia.com/gpu resource, allowing pods to request GPU access via resource limits:

resources:
  limits:
    nvidia.com/gpu: 1

The scheduler will automatically distribute GPU workloads across available nodes.

Tags: kubernetes gpu NVIDIA device-plugin scheduling

Posted on Sun, 31 May 2026 22:14:46 +0000 by thor erik