本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
发行说明
请参阅以下版本说明,了解 SageMaker HyperPod 无检查点培训的最新更新。
SageMaker HyperPod 无检查点训练 v1.0.0
日期:2025 年 12 月 3 日
SageMaker HyperPod 无检查点训练功能
-
集体通信初始化改进:提供新颖的初始化方法,Rootless以及 TCPStoreless 适用于NCCL和Gloo的初始化方法。
-
内存映射 (MMAP) Dataloader:缓存(保留)预取的批次,这样即使故障导致训练作业重新启动,这些批次也可用。
-
Checkpointles s:通过进行框架级优化,可以更快地从大规模分布式训练环境中的集群训练错误中恢复
-
基@@ 于 Nvidia Nemo 和 PyTorch Lightning 构建:利用这些强大的框架进行高效、灵活的模型训练
SageMaker HyperPod 无检查点训练 Docker 容器
无检查点训练建立 HyperPod 在 NVIDIA NeMo
可用性
目前,图片仅在以下版本中可用:
eu-north-1 ap-south-1 us-east-2 eu-west-1 eu-central-1 sa-east-1 us-east-1 eu-west-2 ap-northeast-1 us-west-2 us-west-1 ap-southeast-1 ap-southeast-2
但在以下 3 个可选择加入的地区不可用:
ap-southeast-3 ap-southeast-4 eu-south-2
容器详细信息
使用 CUDA v12.9 的 PyTorch v2.6.0 的 Checker 无检查点训练 Docker 容器
963403601044.dkr.ecr.eu-north-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 423350936952.dkr.ecr.ap-south-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 556809692997.dkr.ecr.us-east-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 942446708630.dkr.ecr.eu-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 391061375763.dkr.ecr.eu-central-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 311136344257.dkr.ecr.sa-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 016839105697.dkr.ecr.eu-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 356859066553.dkr.ecr.ap-northeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 920498770698.dkr.ecr.us-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 827510180725.dkr.ecr.us-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 885852567298.dkr.ecr.ap-southeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.0 304708117039.dkr.ecr.ap-southeast-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.0
预安装的软件包
PyTorch: v2.6.0 CUDA: v12.9 NCCL: v2.27.5 EFA: v1.43.0 AWS-OFI-NCCL v1.16.0 Libfabric version 2.1 Megatron v0.15.0 Nemo v2.6.0rc0