Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions,
see Getting Started with Amazon Web Services in China
(PDF).
Step 2: Launch and Debug Training
Jobs Using SageMaker Python SDK
To configure a SageMaker estimator with SageMaker Debugger, use Amazon SageMaker Python SDK and
specify Debugger-specific parameters. To fully utilize the debugging functionality, there are
three parameters you need to configure: debugger_hook_config
,
tensorboard_output_config
, and rules
.
Construct a SageMaker Estimator with
Debugger-specific parameters
The code examples in this section show how to construct a SageMaker estimator with the
Debugger-specific parameters.
The following code examples are templates for constructing the SageMaker framework estimators
and not directly executable. You need to proceed to the next sections and configure
the Debugger-specific parameters.
- PyTorch
-
# An example of constructing a SageMaker PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs
session=boto3.session.Session()
region=session.region_name
debugger_hook_config
=DebuggerHookConfig(...)
rules
=[
Rule.sagemaker(rule_configs.built_in_rule())
]
estimator=PyTorch(
entry_point="directory/to/your_training_script.py
",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo
",
instance_count=1
,
instance_type="ml.p3.2xlarge
",
framework_version="1.12.0
",
py_version="py37
",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config
,
rules=rules
)
estimator.fit(wait=False)
- TensorFlow
-
# An example of constructing a SageMaker TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs
session=boto3.session.Session()
region=session.region_name
debugger_hook_config
=DebuggerHookConfig(...)
rules
=[
Rule.sagemaker(rule_configs.built_in_rule())
,
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=TensorFlow(
entry_point="directory/to/your_training_script.py
",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo
",
instance_count=1
,
instance_type="ml.p3.2xlarge
",
framework_version="2.9.0
",
py_version="py39
",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config
,
rules=rules
)
estimator.fit(wait=False)
- MXNet
-
# An example of constructing a SageMaker MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs
debugger_hook_config
=DebuggerHookConfig(...)
rules
=[
Rule.sagemaker(rule_configs.built_in_rule())
]
estimator=MXNet(
entry_point="directory/to/your_training_script.py
",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo
",
instance_count=1
,
instance_type="ml.p3.2xlarge
",
framework_version="1.7.0
",
py_version="py37
",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config
,
rules=rules
)
estimator.fit(wait=False)
- XGBoost
-
# An example of constructing a SageMaker XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs
debugger_hook_config
=DebuggerHookConfig(...)
rules
=[
Rule.sagemaker(rule_configs.built_in_rule())
]
estimator=XGBoost(
entry_point="directory/to/your_training_script.py
",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo
",
instance_count=1
,
instance_type="ml.p3.2xlarge
",
framework_version="1.5-1
",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config
,
rules=rules
)
estimator.fit(wait=False)
- Generic estimator
-
# An example of constructing a SageMaker generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs
debugger_hook_config
=DebuggerHookConfig(...)
rules
=[
Rule.sagemaker(rule_configs.built_in_rule())
]
region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-demo
",
instance_count=1
,
instance_type="ml.m5.2xlarge
",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config
,
rules=rules
)
estimator.fit(wait=False)
Configure the following parameters to activate SageMaker Debugger:
-
debugger_hook_config
(an object of DebuggerHookConfig
) – Required to activate
the hook in the adapted training script during Step 1: Adapt Your Training Script to Register a
Hook, configure the SageMaker training
launcher (estimator) to collect output tensors from your training job, and save
the tensors into your secured S3 bucket or local machine. To learn how to
configure the debugger_hook_config
parameter, see Configure SageMaker Debugger to Save
Tensors.
-
rules
(a list of Rule
objects) – Configure this parameter to
activate SageMaker Debugger built-in rules that you want to run in real time. The
built-in rules are logics that automatically debug the training progress of your
model and find training issues by analyzing the output tensors saved in your
secured S3 bucket. To learn how to configure the rules
parameter,
see Configure Debugger Built-in Rules. To find a complete list of
built-in rules for debugging output tensors, see Debugger Rule. If you want to create your own
logic to detect any training issues, see Create Debugger Custom Rules for Training Job Analysis.
The built-in rules are available only through SageMaker training
instances. You cannot use them in local mode.
-
tensorboard_output_config
(an object of TensorBoardOutputConfig
) – Configure
SageMaker Debugger to collect output tensors in the TensorBoard-compatible format
and save to your S3 output path specified in the
TensorBoardOutputConfig
object. To learn more, see Visualize Amazon SageMaker Debugger Output Tensors in
TensorBoard.
The tensorboard_output_config
must be configured with the
debugger_hook_config
parameter, which also requires you to
adapt your training script by adding the sagemaker-debugger
hook.
SageMaker Debugger securely saves output tensors in subfolders of your S3 bucket. For
example, the format of the default S3 bucket URI in your account is
s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/
.
There are two subfolders created by SageMaker Debugger: debug-output
, and
rule-output
. If you add the tensorboard_output_config
parameter, you'll also find tensorboard-output
folder.
See the following topics to find more examples of how to configure the Debugger-specific
parameters in detail.