Best Practices for Amazon SageMaker Debugger - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Best Practices for Amazon SageMaker Debugger

Use the following guidelines when you run training jobs with Debugger.

Choose a Machine Learning Framework

You can choose a machine learning framework and use SageMaker pre-built training containers or your own containers. Use Debugger to detect training and performance issues, and analyze training progress of your training job in SageMaker. SageMaker provides you options to use pre-built containers that are prepared for a number of machine learning framework environments to train your model on Amazon EC2. Any training job can be adapted to run in Amazon Deep Learning Containers, SageMaker training containers, and custom containers.

Use Studio Debugger Insights Dashboard

With Studio Debugger insights dashboard, you are in control of your training jobs. Use the Studio Debugger dashboards to keep your model performance on Amazon EC2 instances in control and optimized. For any SageMaker training jobs running on Amazon EC2 instance, Debugger monitors resource utilization and basic model output data (loss and accuracy values). Through the Studio Debugger dashboards, gain insights into your training jobs and improve your model training performance. To learn more, see Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments.

Download Debugger Reports and Gain More Insights

You can view aggregated results and gain insights in Debugger reports. Debugger aggregates training and profiling results collected from the built-in rule analysis into a report per training job. You can find more detailed information about your training results through the Debugger reports. To learn more, see SageMaker Debugger interactive report.

Capture Data from Your Training Job and Save Data to Amazon S3

You can use a Debugger hook to save output tensors. After you choose a container and a framework that fit your training script, use a Debugger hook to configure which tensors to save and to which directory to save them, such as a Amazon S3 bucket. A Debugger hook helps you to build the configuration and to keep it in your account to use in subsequent analyses, where it is secured for use with the most privacy-sensitive applications. To learn more, see Configure SageMaker Debugger to Save Tensors.

Analyze the Data with a Fleet of Debugger Built-in Rules

You can use Debugger built-in rules to inspect tensors in parallel with a training job. To analyze the training performance data, Debugger provides built-in rules that watch for abnormal training process behaviors. For example, a Debugger rule detects issues when the training process suffers from system bottleneck issues or training issues, such as vanishing gradients, exploding tensors, overfitting, or overtraining. If necessary, you can also build customized rules by creating a rule definition with your own criteria to define a training issue. To learn more about the Debugger rules, see Configure Debugger Built-in Rules for detailed instructions of using the Amazon SageMaker Python SDK. For a full list of the Debugger built-in rules, see List of Debugger Built-in Rules. If you want to create a custom rule, see Create Debugger Custom Rules for Training Job Analysis.

Take Actions Based on the Built-in Rule Status

You can use Debugger with Amazon CloudWatch Events and Amazon Lambda. You can automate actions based on the rule status, such as stopping training jobs early and setting up notifications through email or text. When the Debugger rules detect problems and triggers an "IssuesFound" evaluation status, CloudWatch Events detects the rule status changes and invokes the Lambda function to take actions. To configure automated actions to your training issues, see Create Actions on Rules Using Amazon CloudWatch and Amazon Lambda.

Dive Deep into the Data Using the SMDebug Client Library

You can use the SMDebug tools to access and analyze training data collected by Debugger. The TrainingJob and create_trial classes load the metrics and tensors saved by Debugger. These classes provide extended class methods to analyze the data in real time or after the training has finished. The SMDebug library also provides visualization tools: merge timelines of framework metrics to aggregate different profiling, line charts and heatmap to track the system utilization, and histograms to find step duration outliers. To learn more about the SMDebug library tools, see Analyze data using the Debugger Python client library.

Monitor and Analyze Training Job Metrics

Amazon CloudWatch supports high-resolution custom metrics, and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see GetMetricStatistics in the Amazon CloudWatch API Reference.

If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second) granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any time, consider using Amazon SageMaker Debugger. SageMaker Debugger provides built-in rules to automatically detect common training issues; it detects hardware resource utilization issues (such as CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients, and exploding tensors).

SageMaker Debugger also provides visualizations through Studio Classic and its profiling report. Unlike CloudWatch metrics, which accumulates resource utilization rates of CPU and GPU cores and averages those out across multiple instances, Debugger tracks the utilization rate of each core. This enables you to identify unbalanced usage of hardware resources as you scale up to larger compute clusters. To explore the Debugger visualizations, see SageMaker Debugger Insights Dashboard Walkthrough, Debugger Profiling Report Walkthrough, and Analyze Data Using the SMDebug Client Library.

Monitoring System Utilization and Detect Bottlenecks

With Amazon SageMaker Debugger monitoring, you can measure hardware system resource utilization of Amazon EC2 instances. Monitoring is available for any SageMaker training job constructed with the SageMaker framework estimators (TensorFlow, PyTorch, and MXNet) and the generic SageMaker estimator (SageMaker built-in algorithms and your own custom containers). Debugger built-in rules for monitoring detect system bottleneck issues and notify you when they detect the bottleneck issues.

To learn how to enable Debugger system monitoring, see Configure an estimator with parameters for basic profiling using the Amazon SageMaker Debugger Python modules and then Configure settings for basic profiling of system resource utilization.

For a full list of available built-in rules for monitoring, see Debugger built-in rules for profiling hardware system resource utilization (system metrics).

Profiling Framework Operations

With Amazon SageMaker Debugger profiling you can profile deep learning frameworks operations. You can profile your model training with the SageMaker TensorFlow training containers, the SageMaker PyTorch framework containers, and your own training containers. Using the profiling feature of Debugger, you can drill down into the Python operators and functions that are executed to perform the training job. Debugger supports detailed profiling, Python profiling, data loader profiling, and Horovod distributed training profiling. You can merge the profiled timelines to correlate with the system bottlenecks. Debugger built-in rules for profiling watch framework operation related issues, including excessive training initialization time due to data downloading before training starts and step duration outliers in training loops.

To learn how to configure Debugger for framework profiling, see Configure an estimator with parameters for basic profiling using the Amazon SageMaker Debugger Python modules and then Configure for framework profiling.

For a complete list of available built-in rules for profiling, see Debugger built-in rules for profiling framework metrics.

Debugging Model Output Tensors

Debugging is available for deep learning frameworks using Amazon Deep Learning Containers and the SageMaker training containers. For fully supported framework versions (see the versions at Supported Frameworks and Algorithms), Debugger automatically registers hooks to collect output tensors, and you can directly run your training script. For the versions with one asterisk sign, you need to manually register the hooks to collect tensors. Debugger provides preconfigured tensor collections with generalized names that you can utilize across the different frameworks. If you want to customize output tensor configuration, you can also use the CollectionConfig and DebuggerHookConfig API operations and the Amazon SageMaker Python SDK to configure your own tensor collections. Debugger built-in rules for debugging analyze the output tensors and identifies model optimization problems that blocks your model from minimizing the loss function. For example, the rules identify overfitting, overtraining, loss not decreasing, exploding tensors, and vanishing gradients.

To learn how to configure Debugger for debugging output tensors, see Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK and then Configure SageMaker Debugger to Save Tensors.

For a full list of available built-in rules for debugging, see Debugger built-in rules for debugging model training data (output tensors).