前一章节【Ubuntu 24.04安装K8S 1.34】已经使用kubeadm安装好了k8s集群,现在使用k8s调度GPU
物理 GPU
↓
NVIDIA Driver
↓
NVIDIA Container Toolkit
↓
GPU Operator / nvidia-device-plugin
↓
Kubernetes 节点资源(nvidia.com/gpu)
↓
HAMi Device Plugin
↓
HAMi Scheduler
↓
AI Pod(训练/推理/共享)安装环境:
+-----------------------------+
| |
| 单节点 K8s + GPU 节点 |
+-------------+---------------+
|
+----------------------------------------------------------------------------------+
| Ubuntu 24.04.3 LTS |
| Kernel 6.8.0-90 |
| containerd 2.2.1 |
| kubelet / kubeadm / kubectl v1.34.7 |
+----------------------------------------------------------------------------------+
|
v
+----------------------------------------------------------------------------------+
| Kubernetes |
| |
| Control Plane: |
| - kube-apiserver |
| - kube-controller-manager |
| - kube-scheduler |
| - etcd |
| |
| CNI: |
| - flannel / calico(按你当前实际环境) |
+----------------------------------------------------------------------------------+
|
v
+----------------------------------------------------------------------------------+
| NVIDIA GPU Stack |
| - NVIDIA Driver 580.126.09 |
| - NVIDIA Container Toolkit |
| - GPU Operator |
| - nvidia-device-plugin |
| - gpu-feature-discovery |
| |
| Node Resource: |
| - nvidia.com/gpu: 2 |
+----------------------------------------------------------------------------------+
|
v
+----------------------------------------------------------------------------------+
| HAMi |
| - hami-scheduler |
| - hami-device-plugin |
| - 节点标签: gpu=on |
| |
| 作用: GPU 共享 / 细粒度调度 / 多租户管理 |
+----------------------------------------------------------------------------------+
|
v
+-----------------------------+
| AI Pod / GPU 任务 |
| 推理 / 训练 / 共享调度 |
+-----------------------------+1、安装helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash2、安装驱动和NVIDIA Container Toolkit
安装驱动(待补充)
安装 NVIDIA 驱动安装Nvidia Container Toolkit
配置源
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list安装驱动
sudo apt update
sudo apt install -y nvidia-container-toolkit配置 containerd
如果你是 containerd,执行:
sudo nvidia-ctk runtime configure --runtime=containerd
然后重启:
sudo systemctl restart containerd
sudo systemctl restart kubelet
验证 containerd 配置
ls /etc/containerd/conf.d/
99-nvidia.toml
3、安装 Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version4、NVIDIA 官方 GPU Operator添加 Helm 仓库
直接安装nvidia device plugins会有一些设置问题,显卡识别不到,node节点需要打标签等
添加 Helm 仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update查看版本:
helm search repo nvidia/gpu-operator -l | head
创建命名空间
kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f -
安装 GPU Operator
因为你已经装了宿主机驱动,所以建议关闭 driver 安装:
helm install --wait --generate-name \
-n gpu-operator \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=true \
--set cdi.enabled=false看节点资源
kubectl describe node ubuntu2401 | grep -A20 -E "Capacity:|Allocatable:"
目标是看到GPU数量
/nvidia.com/gpu: 2
安装完成后的测试 Pod
等节点出现 nvidia.com/gpu: 2 后,创建测试文件:
cat > gpu-test.yaml <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOFkubectl apply -f gpu-test.yaml
kubectl logs -f gpu-test
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5060 Ti Off | 00000000:0A:00.0 Off | N/A |
| 0% 21C P8 5W / 180W | 0MiB / 16311MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+如果日志输出了 GPU 信息,说明:
Operator 正常
Device plugin 正常
K8s 已识别 GPU
可以继续装 HAMi
5、安装HAMi
添加 HAMi Helm 仓库
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm repo update
查看 chart:
helm search repo hami-charts -l
创建命名空间
kubectl create namespace hami-system --dry-run=client -o yaml | kubectl apply -f -
安装 HAMi
先直接安装默认版本:
helm install hami hami-charts/hami -n hami-system
如果你想更稳一点,也可以先看 values:
helm show values hami-charts/hami > hami-values.yaml
然后再用自定义 values 装。
检查 HAMi 组件状态
kubectl get pods -n hami-system -o wide
hami的daemonset也是只会在gpu节点上运行,需要手动给节点打标签
kubectl get ds -n hami-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
hami-device-plugin 0 0 0 0 0 gpu=on 14h给节点添加 gpu=on 标签
kubectl label node nf5588m4.goodix.com gpu=on --overwrite
查看标签是否生效
kubectl get node nf5588m4.goodix.com --show-labels
或者:
kubectl get nodes --show-labels | grep nf5588m4.goodix.com
如果后面想删除这个标签
kubectl label node ubuntu2401 gpu-
正常应该能看到 HAMi 相关组件 Running。
kubectl get pods -n hami-system
NAME READY STATUS RESTARTS AGE
hami-device-plugin-v7mtc 2/2 Running 0 13h
hami-scheduler-97b7b6c9b-dqjpv 2/2 Running 0 14h
测试 Pod
cat hami-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: hami-test
spec:
restartPolicy: Never
containers:
- name: test
image: nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["bash", "-c", "nvidia-smi && sleep 3600"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 4000c可以看到已经可以指定分配显存大小了