K8S安装看HAMi管理GPU

Max
Max
发布于 2026-05-12 / 16 阅读
0
0

K8S安装看HAMi管理GPU

前一章节【Ubuntu 24.04安装K8S 1.34】已经使用kubeadm安装好了k8s集群,现在使用k8s调度GPU

物理 GPU
   ↓
NVIDIA Driver
   ↓
NVIDIA Container Toolkit
   ↓
GPU Operator / nvidia-device-plugin
   ↓
Kubernetes 节点资源(nvidia.com/gpu)
   ↓
HAMi Device Plugin
   ↓
HAMi Scheduler
   ↓
AI Pod(训练/推理/共享)

安装环境:

                                +-----------------------------+
                                |    |
                                |   单节点 K8s + GPU 节点      |
                                +-------------+---------------+
                                              |
     +----------------------------------------------------------------------------------+
     | Ubuntu 24.04.3 LTS                                                               |
     | Kernel 6.8.0-90                                                                  |
     | containerd 2.2.1                                                                 |
     | kubelet / kubeadm / kubectl v1.34.7                                              |
     +----------------------------------------------------------------------------------+
                                              |
                                              v
     +----------------------------------------------------------------------------------+
     | Kubernetes                                                                       |
     |                                                                                  |
     |  Control Plane:                                                                  |
     |   - kube-apiserver                                                               |
     |   - kube-controller-manager                                                      |
     |   - kube-scheduler                                                               |
     |   - etcd                                                                         |
     |                                                                                  |
     |  CNI:                                                                            |
     |   - flannel / calico(按你当前实际环境)                                          |
     +----------------------------------------------------------------------------------+
                                              |
                                              v
     +----------------------------------------------------------------------------------+
     | NVIDIA GPU Stack                                                                 |
     |   - NVIDIA Driver 580.126.09                                                     |
     |   - NVIDIA Container Toolkit                                                     |
     |   - GPU Operator                                                                 |
     |   - nvidia-device-plugin                                                         |
     |   - gpu-feature-discovery                                                        |
     |                                                                                  |
     |   Node Resource:                                                                 |
     |   - nvidia.com/gpu: 2                                                            |
     +----------------------------------------------------------------------------------+
                                              |
                                              v
     +----------------------------------------------------------------------------------+
     | HAMi                                                                             |
     |   - hami-scheduler                                                               |
     |   - hami-device-plugin                                                           |
     |   - 节点标签: gpu=on                                                              |
     |                                                                                  |
     |   作用: GPU 共享 / 细粒度调度 / 多租户管理                                        |
     +----------------------------------------------------------------------------------+
                                              |
                                              v
                                +-----------------------------+
                                | AI Pod / GPU 任务            |
                                | 推理 / 训练 / 共享调度       |
                                +-----------------------------+

1、安装helm

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

2、安装驱动和NVIDIA Container Toolkit

安装驱动(待补充)

安装 NVIDIA 驱动

安装Nvidia Container Toolkit

配置源

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

安装驱动

sudo apt update
sudo apt install -y nvidia-container-toolkit

配置 containerd

如果你是 containerd,执行:

sudo nvidia-ctk runtime configure --runtime=containerd

然后重启:

sudo systemctl restart containerd

sudo systemctl restart kubelet

验证 containerd 配置

ls /etc/containerd/conf.d/

99-nvidia.toml

3、安装 Helm

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

4、NVIDIA 官方 GPU Operator添加 Helm 仓库

直接安装nvidia device plugins会有一些设置问题,显卡识别不到,node节点需要打标签等

添加 Helm 仓库

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

查看版本:

helm search repo nvidia/gpu-operator -l | head


创建命名空间

kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f -


安装 GPU Operator

因为你已经装了宿主机驱动,所以建议关闭 driver 安装:

helm install --wait --generate-name \
  -n gpu-operator \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set cdi.enabled=false

看节点资源

kubectl describe node ubuntu2401 | grep -A20 -E "Capacity:|Allocatable:"

目标是看到GPU数量

/nvidia.com/gpu: 2

安装完成后的测试 Pod

等节点出现 nvidia.com/gpu: 2 后,创建测试文件:

cat > gpu-test.yaml <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

kubectl apply -f gpu-test.yaml

kubectl logs -f gpu-test
    
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:0A:00.0 Off |                  N/A |
|  0%   21C    P8              5W /  180W |       0MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

如果日志输出了 GPU 信息,说明:

  • Operator 正常

  • Device plugin 正常

  • K8s 已识别 GPU

  • 可以继续装 HAMi

5、安装HAMi

添加 HAMi Helm 仓库

helm repo add hami-charts https://project-hami.github.io/HAMi/

helm repo update

查看 chart:

helm search repo hami-charts -l


创建命名空间

kubectl create namespace hami-system --dry-run=client -o yaml | kubectl apply -f -


安装 HAMi

先直接安装默认版本:

helm install hami hami-charts/hami -n hami-system

如果你想更稳一点,也可以先看 values:

helm show values hami-charts/hami > hami-values.yaml

然后再用自定义 values 装。


检查 HAMi 组件状态

kubectl get pods -n hami-system -o wide

hami的daemonset也是只会在gpu节点上运行,需要手动给节点打标签

kubectl get ds -n hami-system
NAME                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
hami-device-plugin   0         0         0       0            0           gpu=on          14h

给节点添加 gpu=on 标签

kubectl label node nf5588m4.goodix.com gpu=on --overwrite


查看标签是否生效

kubectl get node nf5588m4.goodix.com --show-labels

或者:

kubectl get nodes --show-labels | grep nf5588m4.goodix.com


如果后面想删除这个标签

kubectl label node ubuntu2401 gpu-


正常应该能看到 HAMi 相关组件 Running。

kubectl get pods -n hami-system
NAME                             READY   STATUS    RESTARTS   AGE
hami-device-plugin-v7mtc         2/2     Running   0          13h
hami-scheduler-97b7b6c9b-dqjpv   2/2     Running   0          14h

测试 Pod

cat hami-test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: hami-test
spec:
  restartPolicy: Never
  containers:
  - name: test
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    command: ["bash", "-c", "nvidia-smi && sleep 3600"]
    resources:
      limits:
        nvidia.com/gpu: 1
        nvidia.com/gpumem: 4000
c

可以看到已经可以指定分配显存大小了


评论