Profile Training Jobs Using Amazon SageMaker Debugger - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Profile Training Jobs Using Amazon SageMaker Debugger

To profile compute resource utilization and framework operations of your training job, use profiling tools offered by Amazon SageMaker Debugger.

For any training job you run in SageMaker using the SageMaker Python SDK, Debugger starts profiling basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, network, and I/O wait time. It collects these resource utilization metrics every 500 milliseconds. To see the graphs of the resource utilization metrics of your training job, simply use the SageMaker Debugger UI in SageMaker Studio Experiments.

Deep learning operations and steps might operate in intervals of milliseconds. Compared to Amazon CloudWatch metrics, which collect metrics at intervals of 1 second, Debugger provides finer granularity into the resource utilization metrics down to 100-millisecond (0.1 second) intervals so you can dive deep into the metrics at the level of an operation or a step.

If you want to change the metric collection time interval, you need to add parameters for profiling to your training job launcher. If you're using SageMaker Python SDK, you need to pass the profiler_config parameter when you create an estimator. To learn how to adjust the resource utilization metric collection interval, see Construct a SageMaker Estimator with SageMaker Debugger and then Configure Debugger for Monitoring Resource Utilization.

Additionally, you can add profiling analysis tools called built-in profiling rules provided by SageMaker Debugger. The built-in profiling rules run analysis against the resource utilization metrics and detect computational performance issues. For more information, see Configure Built-in Profiling Rules Managed by Amazon SageMaker Debugger. You can receive rule analysis results through the SageMaker Debugger UI in SageMaker Studio Experiments or the SageMaker Debugger Profiling Report. You can also create custom profiling rules using the SageMaker Python SDK.

Use the following topics to learn more about profiling functionalities provided by SageMaker Debugger.