

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# Amazon 中的无检查点培训 SageMaker HyperPod
<a name="sagemaker-eks-checkpointless"></a>

Amazon 上的 Checkpoint 无检查点培训 SageMaker HyperPod 可以更快地从培训基础设施故障中恢复。以下文档可帮助您开始使用无检查点训练和微调支持的模型。 NeMo

Checkpointless 培训具有以下先决条件：
+ [开始使用 Amazon EKS 支持 SageMaker HyperPod](sagemaker-hyperpod-eks-prerequisites.md)
+ [安装训练操作符](sagemaker-eks-operator-install.md)。 您必须安装 v1.2.0 或更高版本。

 无检查点训练建立在 [NVIDIA F SageMaker HyperPod ramewor NeMo k 用户](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager)指南之上。您可以使用预先 SageMaker HyperPod 创建的食谱进行无检查点训练。如果你熟悉 NeMo，那么使用无检查点训练食谱的过程是相似的。只需稍作改动，您就可以开始使用无检查点训练功能训练模型，这些功能使您能够从训练错误中快速恢复。

以下 HyperPod 配方已预先配置了无检查点训练优化。您可以将数据路径指定为配方的一部分，并使用相关的启动脚本来运行训练（请参阅下面的快速入门指南）：


| 模型 | 方法 | Size | Nodes | 实例 | Accelerator | 指南 | Script | 教程 | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT LOSS | 完整的微调示例 | 120b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh) | [link](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html) | 
| GPT LOSS | Lora-示例 | 120b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh) | [link](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html) | 
| Llama3 | 预训练示例 | 70b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/llama/checkpointless_llama3_70b_pretrain.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh) | [link](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-pretraining-llama3.html) | 
| Llama3 | Lora-示例 | 70b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/llama/checkpointless_llama3_70b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh) | [link](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft-llama.html) | 

以下快速入门指南提供了使用无检查点训练食谱的教程：

**入门示例**
+ [教程——Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [教程——亚马逊 SageMaker HyperPod Checkpointless peft-Lora GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md)
+ [教程——Amazon SageMaker HyperPod Checkpointless 预训练 Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [教程——亚马逊 SageMaker HyperPod Checkpointless peft-Lora Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md)

如果您想对自定义模型进行预训练或微调，请参阅。[教程-Amazon SageMaker HyperPod Checkpointless 预训练或微调自定义模型](sagemaker-eks-checkpointless-recipes-custom.md)

要详细了解如何整合特定的无检查点训练组件，. [HyperPod 无检查点训练功能](sagemaker-eks-checkpointless-features.md) 