缺少英伟达 GPU 插件错误

尽管有可用的 GPU 节点，但模型部署失败并出现 GPU 不足错误。当 HyperPod集群中未安装 NVIDIA 设备插件时，就会发生这种情况。

错误消息：


0/15 nodes are available: 10 node(s) didn't match Pod's node affinity/selector, 
5 Insufficient nvidia.com/gpu. preemption: 0/15 nodes are available: 
10 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod.

根本原因：

如果没有 NVIDIA 设备插件，Kubernetes 就无法检测 GPU 资源
导致 GPU 工作负载调度失败

解决方法：

通过运行以下命令安装 NVIDIA GPU 插件：


kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/refs/tags/v0.17.1/deployments/static/nvidia-device-plugin.yml

验证步骤：

检查插件部署状态：


kubectl get pods -n kube-system | grep nvidia-device-plugin

验证 GPU 资源现在是否可见：


kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu

重试模型部署

注意

确保在 GPU 节点上安装了 NVIDIA 驱动程序。插件安装是每个集群的一次性设置。可能需要集群管理员权限才能安装。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

IAM 信任关系问题

推理运算符无法启动