Best Practices for Amazon SageMaker Debugger
Use the following guidelines when you run training jobs with Debugger.
Topics
- Choose a Machine Learning Framework
- Use Studio Debugger Insights Dashboard
- Download Debugger Reports and Gain More Insights
- Capture Data from Your Training Job and Save Data to Amazon S3
- Analyze the Data with a Fleet of Debugger Built-in Rules
- Take Actions Based on the Built-in Rule Status
- Dive Deep into the Data Using the SMDebug Client Library
- Monitor and Analyze Training Job Metrics
- Monitoring System Utilization and Detect Bottlenecks
- Profiling Framework Operations
- Debugging Model Output Tensors
Choose a Machine Learning Framework
You can choose a machine learning framework and use SageMaker pre-built training containers or your own containers. Use Debugger to detect training and performance issues, and analyze training progress of your training job in SageMaker. SageMaker provides you options to use pre-built containers that are prepared for a number of machine learning framework environments to train your model on Amazon EC2. Any training job can be adapted to run in Amazon Deep Learning Containers, SageMaker training containers, and custom containers.
Use Studio Debugger Insights Dashboard
With Studio Debugger insights dashboard, you are in control of your training jobs. Use the Studio Debugger dashboards to keep your model performance on Amazon EC2 instances in control and optimized. For any SageMaker training jobs running on Amazon EC2 instance, Debugger monitors resource utilization and basic model output data (loss and accuracy values). Through the Studio Debugger dashboards, gain insights into your training jobs and improve your model training performance. To learn more, see Amazon SageMaker Debugger UI in Amazon SageMaker Studio Experiments.
Download Debugger Reports and Gain More Insights
You can view aggregated results and gain insights in Debugger reports. Debugger aggregates training and profiling results collected from the built-in rule analysis into a report per training job. You can find more detailed information about your training results through the Debugger reports. To learn more, see SageMaker Debugger Interactive Report.
Capture Data from Your Training Job and Save Data to Amazon S3
You can use a Debugger hook to save output tensors. After you choose a container and a framework that fit your training script, use a Debugger hook to configure which tensors to save and to which directory to save them, such as a Amazon S3 bucket. A Debugger hook helps you to build the configuration and to keep it in your account to use in subsequent analyses, where it is secured for use with the most privacy-sensitive applications. To learn more, see Configure SageMaker Debugger to Save Tensors.
Analyze the Data with a Fleet of Debugger Built-in Rules
You can use Debugger built-in rules to inspect tensors in parallel with a training job.
To analyze the training performance data, Debugger provides built-in rules that watch for
abnormal training process behaviors. For example, a Debugger rule detects issues when the
training process suffers from system bottleneck issues or training issues, such as
vanishing gradients, exploding tensors, overfitting, or overtraining. If necessary, you
can also build customized rules by creating a rule definition with your own criteria to
define a training issue. To learn more about the Debugger rules, see Configure Debugger Built-in Rules
for detailed instructions of using the Amazon SageMaker Python SDK
Take Actions Based on the Built-in Rule Status
You can use Debugger with Amazon CloudWatch Events and Amazon Lambda. You can automate actions based on
the rule status, such as stopping training jobs early and setting up notifications
through email or text. When the Debugger rules detect problems and triggers an
"IssuesFound"
evaluation status, CloudWatch Events detects the rule status
changes and invokes the Lambda function to take actions. To configure automated
actions to your training issues, see Create Actions on Rules Using Amazon CloudWatch and
Amazon Lambda.
Dive Deep into the Data Using the SMDebug Client Library
You can use the SMDebug tools to access and analyze training data collected by
Debugger. The TrainingJob
and create_trial
classes load the
metrics and tensors saved by Debugger. These classes provide extended class methods to
analyze the data in real time or after the training has finished. The SMDebug
library also provides visualization tools: merge timelines of framework metrics to
aggregate different profiling, line charts and heatmap to track the system
utilization, and histograms to find step duration outliers. To learn more about the
SMDebug library tools, see Analyze Data Using the SMDebug Client
Library.
Monitor and Analyze Training Job Metrics
Amazon CloudWatch supports high-resolution custom metrics, and its finest resolution is 1
second. However, the finer the resolution, the shorter the lifespan of the CloudWatch
metrics. For the 1-second frequency resolution, the CloudWatch metrics are available
for 3 hours. For more information about the resolution and the lifespan of the
CloudWatch metrics, see GetMetricStatistics
If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second) granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any time, consider using Amazon SageMaker Debugger. SageMaker Debugger provides built-in rules to automatically detect common training issues; it detects hardware resource utilization issues (such as CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients, and exploding tensors).
SageMaker Debugger also provides visualizations through Studio and its profiling report. Unlike CloudWatch metrics, which accumulates resource utilization rates of CPU and GPU cores and averages those out across multiple instances, Debugger tracks the utilization rate of each core. This enables you to identify unbalanced usage of hardware resources as you scale up to larger compute clusters. To explore the Debugger visualizations, see SageMaker Debugger Insights Dashboard Walkthrough, Debugger Profiling Report Walkthrough, and Analyze Data Using the SMDebug Client Library.
Monitoring System Utilization and Detect Bottlenecks
With Amazon SageMaker Debugger monitoring, you can measure hardware system resource utilization of Amazon EC2 instances. Monitoring is available for any SageMaker training job constructed with the SageMaker framework estimators (TensorFlow, PyTorch, and MXNet) and the generic SageMaker estimator (SageMaker built-in algorithms and your own custom containers). Debugger built-in rules for monitoring detect system bottleneck issues and notify you when they detect the bottleneck issues.
To learn how to enable Debugger system monitoring, see Configure Debugger Using Amazon SageMaker Python SDK and then Configure Debugger for Monitoring Resource Utilization.
For a full list of available built-in rules for monitoring, see Debugger built-in rules for profiling hardware system resource utilization (system metrics).
Profiling Framework Operations
With Amazon SageMaker Debugger profiling you can profile deep learning frameworks operations. You can profile your model training with the SageMaker TensorFlow training containers, the SageMaker PyTorch framework containers, and your own training containers. Using the profiling feature of Debugger, you can drill down into the Python operators and functions that are executed to perform the training job. Debugger supports detailed profiling, Python profiling, data loader profiling, and Horovod distributed training profiling. You can merge the profiled timelines to correlate with the system bottlenecks. Debugger built-in rules for profiling watch framework operation related issues, including excessive training initialization time due to data downloading before training starts and step duration outliers in training loops.
To learn how to configure Debugger for framework profiling, see Configure Debugger Using Amazon SageMaker Python SDK and then Configure Debugger for Framework Profiling.
For a complete list of available built-in rules for profiling, see Debugger built-in rules for profiling framework metrics.
Debugging Model Output Tensors
Debugging is available for deep learning frameworks using Amazon Deep Learning
Containers and the SageMaker training containers. For fully supported framework versions
(see the versions at Supported Frameworks and
Algorithms), Debugger automatically registers
hooks to collect output tensors, and you can directly run your training script. For
the versions with one asterisk sign, you need to manually register the hooks to
collect tensors. Debugger provides preconfigured tensor collections with generalized
names that you can utilize across the different frameworks. If you want to customize
output tensor configuration, you can also use the CollectionConfig and
DebuggerHookConfig API operations and the Amazon SageMaker Python SDK
To learn how to configure Debugger for debugging output tensors, see Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK and then Configure SageMaker Debugger to Save Tensors.
For a full list of available built-in rules for debugging, see Debugger built-in rules for debugging model training data (output tensors).