Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK

To configure a SageMaker estimator with SageMaker Debugger, use Amazon SageMaker Python SDK and specify Debugger-specific parameters. To fully utilize the debugging functionality, there are three parameters you need to configure: debugger_hook_config, tensorboard_output_config, and rules.

Important

Before constructing and running the estimator fit method to launch a training job, make sure that you adapt your training script following the instructions at Step 1: Adapt Your Training Script to Register a Hook.

Construct a SageMaker Estimator with Debugger-specific parameters

The code examples in this section show how to construct a SageMaker estimator with the Debugger-specific parameters.

Note

The following code examples are templates for constructing the SageMaker framework estimators and not directly executable. You need to proceed to the next sections and configure the Debugger-specific parameters.

PyTorch
# An example of constructing a SageMaker PyTorch estimator import boto3 import sagemaker from sagemaker.pytorch import PyTorch from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs session=boto3.session.Session() region=session.region_name debugger_hook_config=DebuggerHookConfig(...) rules=[ Rule.sagemaker(rule_configs.built_in_rule()) ] estimator=PyTorch( entry_point="directory/to/your_training_script.py", role=sagemaker.get_execution_role(), base_job_name="debugger-demo", instance_count=1, instance_type="ml.p3.2xlarge", framework_version="1.12.0", py_version="py37", # Debugger-specific parameters debugger_hook_config=debugger_hook_config, rules=rules ) estimator.fit(wait=False)
TensorFlow
# An example of constructing a SageMaker TensorFlow estimator import boto3 import sagemaker from sagemaker.tensorflow import TensorFlow from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs session=boto3.session.Session() region=session.region_name debugger_hook_config=DebuggerHookConfig(...) rules=[ Rule.sagemaker(rule_configs.built_in_rule()), ProfilerRule.sagemaker(rule_configs.BuiltInRule()) ] estimator=TensorFlow( entry_point="directory/to/your_training_script.py", role=sagemaker.get_execution_role(), base_job_name="debugger-demo", instance_count=1, instance_type="ml.p3.2xlarge", framework_version="2.9.0", py_version="py39", # Debugger-specific parameters debugger_hook_config=debugger_hook_config, rules=rules ) estimator.fit(wait=False)
MXNet
# An example of constructing a SageMaker MXNet estimator import sagemaker from sagemaker.mxnet import MXNet from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs debugger_hook_config=DebuggerHookConfig(...) rules=[ Rule.sagemaker(rule_configs.built_in_rule()) ] estimator=MXNet( entry_point="directory/to/your_training_script.py", role=sagemaker.get_execution_role(), base_job_name="debugger-demo", instance_count=1, instance_type="ml.p3.2xlarge", framework_version="1.7.0", py_version="py37", # Debugger-specific parameters debugger_hook_config=debugger_hook_config, rules=rules ) estimator.fit(wait=False)
XGBoost
# An example of constructing a SageMaker XGBoost estimator import sagemaker from sagemaker.xgboost.estimator import XGBoost from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs debugger_hook_config=DebuggerHookConfig(...) rules=[ Rule.sagemaker(rule_configs.built_in_rule()) ] estimator=XGBoost( entry_point="directory/to/your_training_script.py", role=sagemaker.get_execution_role(), base_job_name="debugger-demo", instance_count=1, instance_type="ml.p3.2xlarge", framework_version="1.5-1", # Debugger-specific parameters debugger_hook_config=debugger_hook_config, rules=rules ) estimator.fit(wait=False)
Generic estimator
# An example of constructing a SageMaker generic estimator using the XGBoost algorithm base image import boto3 import sagemaker from sagemaker.estimator import Estimator from sagemaker import image_uris from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs debugger_hook_config=DebuggerHookConfig(...) rules=[ Rule.sagemaker(rule_configs.built_in_rule()) ] region=boto3.Session().region_name xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1") estimator=Estimator( role=sagemaker.get_execution_role() image_uri=xgboost_container, base_job_name="debugger-demo", instance_count=1, instance_type="ml.m5.2xlarge", # Debugger-specific parameters debugger_hook_config=debugger_hook_config, rules=rules ) estimator.fit(wait=False)

Configure the following parameters to activate SageMaker Debugger:

  • debugger_hook_config (an object of DebuggerHookConfig) – Required to activate the hook in the adapted training script during Step 1: Adapt Your Training Script to Register a Hook, configure the SageMaker training launcher (estimator) to collect output tensors from your training job, and save the tensors into your secured S3 bucket or local machine. To learn how to configure the debugger_hook_config parameter, see Configure SageMaker Debugger to Save Tensors.

  • rules (a list of Rule objects) – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run in real time. The built-in rules are logics that automatically debug the training progress of your model and find training issues by analyzing the output tensors saved in your secured S3 bucket. To learn how to configure the rules parameter, see Configure Debugger Built-in Rules. To find a complete list of built-in rules for debugging output tensors, see Debugger Rule. If you want to create your own logic to detect any training issues, see Create Debugger Custom Rules for Training Job Analysis.

    Note

    The built-in rules are available only through SageMaker training instances. You cannot use them in local mode.

  • tensorboard_output_config (an object of TensorBoardOutputConfig) – Configure SageMaker Debugger to collect output tensors in the TensorBoard-compatible format and save to your S3 output path specified in the TensorBoardOutputConfig object. To learn more, see Visualize Amazon SageMaker Debugger Output Tensors in TensorBoard.

    Note

    The tensorboard_output_config must be configured with the debugger_hook_config parameter, which also requires you to adapt your training script by adding the sagemaker-debugger hook.

Note

SageMaker Debugger securely saves output tensors in subfolders of your S3 bucket. For example, the format of the default S3 bucket URI in your account is s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/. There are two subfolders created by SageMaker Debugger: debug-output, and rule-output. If you add the tensorboard_output_config parameter, you'll also find tensorboard-output folder.

See the following topics to find more examples of how to configure the Debugger-specific parameters in detail.