了解如何使用 SageMaker Python 软件开发工具包激活 SageMaker 训练编译器。使用 SageMaker Python SDK 和扩展 SageMaker 人工智能框架 Deep Learning Containers 了解如何使用 SageMaker A CreateTrainingJob I API 操作激活 SageMaker 训练编译器。

使用 TensorFlow 训练编译器运行 SageMaker 训练作业

您可以使用任何 SageMaker AI 接口通过 Training Compiler 运行训练作业：Amazon SageMaker Studio Classic 适用于 Python (Boto3) 的 Amazon SDK、Amazon SageMaker 笔记本实例和 Amazon Command Line Interface。 SageMaker

主题

使用 SageMaker Python 开发工具包
使用 SageMaker AI Python SDK 和扩展 SageMaker 人工智能框架 Deep Learning Containers
使用 SageMaker A CreateTrainingJob I API 操作启用 SageMaker 训练编译器

使用 SageMaker Python 开发工具包

要打开 T SageMaker raining Compiler，请将compiler_config参数添加到 SageMaker AI TensorFlow 或 Hugging Face 估算器中。导入 TrainingCompilerConfig 类，并将它的一个实例传递给 compiler_config 参数。以下代码示例显示了开启 SageMaker 训练编译器的 SageMaker AI 估算器类的结构。

提示

要开始使用由《变形金刚》 TensorFlow 和《变形金刚》库提供的预建模型，请尝试使用参考表中提供的批次大小。经过测试的模型

注意

SageMaker 训练编译器可 TensorFlow 通过 SageMaker AI TensorFlow和 Hugging Face 框架估算器获得。

有关适合您的使用案例的信息，请参阅下列选项之一。

TensorFlow


from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64    

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_estimator=TensorFlow(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.9.1',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_estimator.fit()

要准备训练脚本，请参阅以下页面。

对于单个 GPU 训练使用 TensorFlow Keras (tf.keras.*) 构造的模型。
对于单个 GPU 训练使用 TensorFlow 模块（tf.*不包括 TensorFlow Keras 模块）构造的模型。

Hugging Face Estimator with TensorFlow


from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.21.1',
    tensorflow_version='2.6.3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()

要准备训练脚本，请参阅以下页面。

对于单个 GPU 训练带有 Hugging Face T TensorFlow ransformers 的 Keras 模型
对于单个 GPU 训练带有 Hugging Face Transformers 的 TensorFlow 模型

以下列表是使用编译器运行 SageMaker 训练作业所需的最少参数集。

注意

使用 SageMaker AI Hugging Face 估算器时，必须指定transformers_version、tensorflow_versionhyperparameters、compiler_config和参数才能 SageMaker 启用 Training Compiler。您无法使用 image_uri 手动指定集成了支持的框架上列出的深度学习容器的 Training Compiler。

entry_point (str) – 必需。指定训练脚本的文件名。
instance_count (int) – 必需。指定实例数。
instance_type (str) – 必需。指定实例类型。
transformers_version(str) — 仅在使用 SageMaker AI Hugging Face 估算器时才需要。指定训练编译器支持 SageMaker 的 Hugging Face 变形金刚库版本。要查找可用版本，请参阅支持的框架。
framework_version 或 tensorflow_version (str) – 必需。指定 SageMaker 训练编译器支持的 TensorFlow 版本。要查找可用版本，请参阅支持的框架。

注意
使用 SageMaker AI TensorFlow 估算器时，必须指定。framework_version
使用 SageMaker AI Hugging Face 估计器时，必须同时指定和。transformers_version tensorflow_version
hyperparameters (dict) – 可选。为训练作业指定超参数，例如 n_gpus、batch_size 和 learning_rate。启用 T SageMaker raining Compiler 后，请尝试更大的批量大小并相应地调整学习率。要查找有关使用编译器和调整的批处理大小以提高训练速度的案例研究，请参阅经过测试的模型和SageMaker 训练编译器示例笔记本和博客。
compiler_config（TrainingCompilerConfig 对象）-必填。添加此参数可打开 “ SageMaker 训练编译器”。下面是 TrainingCompilerConfig 类的参数。
- enabled (bool) – 可选。指定True或False以打开或关闭 SageMaker 训练编译器。默认值为 True。
- debug (bool) – 可选。要从编译器加速的训练作业中接收更详细的训练日志，请将此项更改为 True。但是，额外的日志记录可能会增加开销并减缓编译后的训练作业。默认值为 False。

警告

如果打开 SageMaker Debugger，可能会影响 SageMaker 训练编译器的性能。我们建议您在运行 SageMaker Training Compiler 时关闭调试器，以确保不会对性能产生影响。有关更多信息，请参阅注意事项。要关闭 Debugger 功能，请向估算器添加以下两个参数：


disable_profiler=True,
debugger_hook_config=False

如果使用编译器成功启动训练作业，则在作业初始化阶段将收到以下日志：

与 TrainingCompilerConfig(debug=False)


Found configuration for Training Compiler
Configuring SM Training Compiler...

与 TrainingCompilerConfig(debug=True)


Found configuration for Training Compiler
Configuring SM Training Compiler...
Training Compiler set to debug mode

使用 SageMaker AI Python SDK 和扩展 SageMaker 人工智能框架 Deep Learning Containers

Amazon Deep Learning Conta TensorFlow iners（DLC） TensorFlow的改编版本包括开源 TensorFlow 框架之上的更改。SageMaker AI Framework Deep Learning Containers 针对底层 Amazon 基础设施和 Amazon A SageMaker I 进行了优化。利用使用 SageMaker 训练编译器的优势，与原生版本相比 DLCs，Training Compiler 集成增加了更多的性能改进 TensorFlow。此外，您可以通过扩展 DLC 映像来创建自定义训练容器。

注意

此 Docker 自定义功能目前仅适用于。 TensorFlow

要 TensorFlow DLCs 针对您的用例扩展和自定义 SageMaker AI，请按照以下说明进行操作。

创建 Dockerfile

使用以下 Dockerfile 模板扩展 SageMaker AI TensorFlow DLC。你必须使用 SageMaker AI TensorFlow DLC 镜像作为 Docker 容器的基础镜像。要查找 A SageMaker I TensorFlow DLC 图片 URIs，请参阅支持的框架。


# SageMaker AI TensorFlow Deep Learning Container image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/tensorflow-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# This environment variable is used by the SageMaker AI container 
# to determine user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# Add more code lines to customize for your use-case
...

有关更多信息，请参阅步骤 2：创建并上传 Dockerfile 和 Python 训练脚本。

扩展 SageMaker AI 框架 DLCs时，请考虑以下陷阱：

请勿在 AI 容器中明确卸载或更改 SageMaker AI 容器中 TensorFlow 软件包的版本。这样做会导致 Amazon 经过优化的 TensorFlow 软件包被开源 TensorFlow 软件包覆盖，从而可能导致性能下降。
注意以特定 TensorFlow 版本或风格作为依赖项的软件包。这些软件包可能会隐式卸载 Amazon 经过优化的软件包 TensorFlow 并安装开源 TensorFlow 软件包。

例如，有一个已知问题，那就是 tensorflow/models 和 tensorflow/text 库总是尝试重新安装开源。 TensorFlow如果您需要安装这些库来为自己的用例选择特定版本，我们建议您查看 2.9 或更高版本的 SageMaker AI TensorFlow DLC Dockerfiles。Dockerfiles 的路径通常采用以下格式：tensorflow/training/docker/<tensorflow-version>/py3/<cuda-version>/Dockerfile.gpu。在 Dockerfiles 中，您应该找到按顺序重新安装 Amazon 托管 TensorFlow 二进制文件（指定给TF_URL环境变量）和其他依赖项的代码行。重新安装部分应与以下示例类似：


# tf-models does not respect existing installations of TensorFlow 
# and always installs open source TensorFlow

RUN pip3 install --no-cache-dir -U \
    tf-models-official==x.y.z

RUN pip3 uninstall -y tensorflow tensorflow-gpu \
  ; pip3 install --no-cache-dir -U \
    ${TF_URL} \
    tensorflow-io==x.y.z \
    tensorflow-datasets==x.y.z

构建并推送到 ECR

要构建 Docker 容器并将其推送到 Amazon ECR，请按照以下链接中的说明进行操作：

使用 SageMaker Python 软件开发工具包估算器运行

照常使用 SageMaker AI TensorFlow 框架估算器。您必须指定 image_uri 以使用您在 Amazon ECR 中托管的新容器。


import sagemaker, boto3
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'tf-custom-container-test'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'

byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(
    account_id, region, uri_suffix, ecr_repository + tag
)

byoc_image_uri
# This should return something like
# 111122223333.dkr.ecr.us-east-2.amazonaws.com/tf-custom-container-test:latest

estimator = TensorFlow(
    image_uri=image_uri,
    role=get_execution_role(),
    base_job_name='tf-custom-container-test-job',
    instance_count=1,
    instance_type='ml.p3.8xlarge'
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

# Start training
estimator.fit()

使用 SageMaker A `CreateTrainingJob` I API 操作启用 SageMaker 训练编译器

SageMaker 必须通过 CreateTrainingJobAPI 操作的请求语法中的AlgorithmSpecification和HyperParameters字段指定训练编译器配置选项。


"AlgorithmSpecification": {
    "TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
    "sagemaker_training_compiler_enabled": "true",
    "sagemaker_training_compiler_debug_mode": "false"
}

要查找已 SageMaker 实现 Training Compiler 的深度学习容器镜像 URIs 的完整列表，请参阅支持的框架。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

使用 PyTorch 训练编译器运行训练作业

示例笔记本和博客