步骤 2：使用 SageMaker Python SDK 启动和调试训练作业

要使用 SageMaker Debugger 配置 SageMaker 估算器，请使用 Amazon SageMaker Python SDK 并指定 Debugger 特定的参数。要充分利用调试功能，需要配置三个参数：debugger_hook_config、tensorboard_output_config 和 rules。

重要

在构造和运行估算器拟合方法以启动训练作业之前，请确保按照步骤 1：调整训练脚本以注册钩子中的说明调整训练脚本。

使用 Debugger 特定的参数构建 SageMaker 估算器

此部分中的代码示例显示如何使用 Debugger 特定的参数构造 SageMaker 估计器。

注意

以下代码示例是用于构造 SageMaker 框架估算器的模板，不能直接执行。您需要继续完成下一个部分中的内容，配置 Debugger 特定的参数。

PyTorch


# An example of constructing a SageMaker PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

TensorFlow


# An example of constructing a SageMaker TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

MXNet


# An example of constructing a SageMaker MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

XGBoost


# An example of constructing a SageMaker XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

Generic estimator


# An example of constructing a SageMaker generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

配置以下参数以激活 SageMaker Debugger：

debugger_hook_config（DebuggerHookConfig 的对象）– 需要此项以在步骤 1：调整训练脚本以注册钩子期间激活调整后训练脚本中的钩子，配置 SageMaker 训练启动器（估算器）以收集训练作业的输出张量，然后将张量保存到安全的 S3 存储桶或本地计算机中。要了解如何配置 debugger_hook_config 参数，请参阅配置 SageMaker Debugger 以保存张量。
rules（Rule 对象的列表）– 配置此参数以激活要实时运行的 SageMaker Debugger 内置规则。内置规则是逻辑，用于自动调试模型的训练进度，并通过分析保存在安全 S3 存储桶中的输出张量来发现训练问题。要了解如何配置 rules 参数，请参阅配置 Debugger 内置规则。要查找用于调试输出张量的内置规则的完整列表，请参阅Debugger 规则。如果您想创建自己的逻辑来检测任意训练问题，请参阅创建 Debugger 自定义规则用于训练作业分析。

注意
内置规则只能通过 SageMaker 训练实例使用。您不能在本地模式下使用它们。
tensorboard_output_config（TensorBoardOutputConfig 的对象）– 将 SageMaker Debugger 配置为以兼容 TensorBoard 的格式收集输出张量，并保存到 TensorBoardOutputConfig 对象中指定的 S3 输出路径。要了解更多信息，请参阅在 TensorBoard 中可视化 Amazon SageMaker Debugger 输出张量。

注意
tensorboard_output_config 必须使用 debugger_hook_config 参数进行配置，这还要求您添加 sagemaker-debugger 钩子以调整训练脚本。

注意

SageMaker Debugger 将输出张量安全地保存在 S3 存储桶的子文件夹中。例如，账户中默认 S3 存储桶 URI 的格式为 s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/。SageMaker Debugger 创建两个子文件夹：debug-output 和 rule-output。如果您添加 tensorboard_output_config 参数，则还会找到 tensorboard-output 文件夹。

请参阅以下主题，查找更多详细说明如何配置 Debugger 特定参数的示例。

主题

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

TensorFlow

配置 SageMaker Debugger 以保存张量