Configure for framework profiling
Warning
In favor of Amazon SageMaker Profiler, SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.
-
SageMaker Python SDK <= v2.130.0
-
PyTorch >= v1.6.0, < v2.0
-
TensorFlow >= v2.3.1, < v2.11
See also March 16, 2023.
To enable Debugger framework profiling, configure the
framework_profile_params
parameter when you construct an estimator.
Debugger framework profiling collects framework metrics, such as data from initialization
stage, data loader processes, Python operators of deep learning frameworks and training
scripts, detailed profiling within and between steps, with cProfile or Pyinstrument
options. Using the FrameworkProfile
class, you can configure custom
framework profiling options.
Note
Before getting started with Debugger framework profiling, verify that the framework used to build your model is supported by Debugger for framework profiling. For more information, see Supported Frameworks and Algorithms.
Debugger saves the framework metrics in a default S3 bucket. The format of the
default S3 bucket URI is
s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/
.
Start a training job with the default framework profiling
The following example code is the simplest profiler_config
parameter
setting to start the default system monitoring and the default framework profiling.
The FrameworkProfile
class in the following example code initiates the
default framework profiling when a training job starts. Debugger framework profiling
includes the following options: detailed profiling, data loader profiling, and
Python profiling.
from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile() )
With this profiler_config
parameter configuration, Debugger calls the
default settings of monitoring and profiling. Debugger monitors system metrics every
500 milliseconds; profiles the fifth step with the detailed profiling option; the
seventh step with the data loader profiling option; and the ninth, tenth, and
eleventh steps with the Python profiling option.
To find available profiling configuration options, the default parameter settings,
and examples of how to configure them, see Start a training
job with the default system monitoring and customized framework profiling with
different profiling options and SageMaker Debugger APIs – FrameworkProfile
If you want to change the system monitoring interval and enable the default
framework profiling, you can specify the system_monitor_interval_millis
parameter explicitly with the framework_profile_params
parameter. For
example, to monitor every 1000 milliseconds and enable the default framework
profiling, use the following example code.
from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( system_monitor_interval_millis=
1000
, framework_profile_params=FrameworkProfile() )
For more information about the FrameworkProfile
class, see SageMaker Debugger APIs – FrameworkProfile
Start a training job with the default system monitoring and customized framework profiling for target steps or a target time range
If you want to specify target steps or target time intervals to profile your
training job, you need to specify parameters for the FrameworkProfile
class. The following code examples show how to specify the target ranges for
profiling along with system monitoring.
-
For a target step range
With the following example configuration, Debugger monitors the entire training job every 500 milliseconds (the default monitoring) and profiles a target step range from step 5 to step 15 (for 10 steps).
from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile(start_step=
5
, num_steps=10
) )With the following example configuration, Debugger monitors the entire training job every 1000 milliseconds and profiles a target step range from step 5 to step 15 (for 10 steps).
from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( system_monitor_interval_millis=
1000
, framework_profile_params=FrameworkProfile(start_step=5
, num_steps=10
) ) -
For a target time range
With the following example configuration, Debugger monitors the entire training job every 500 milliseconds (the default monitoring) and profiles a target time range from the current Unix time for 600 seconds.
import time from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile(start_unix_time=int(
time.time()
), duration=600
) )With the following example configuration, Debugger monitors the entire training job every 1000 milliseconds and profiles a target time range from the current Unix time for 600 seconds.
import time from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( system_monitor_interval_millis=
1000
, framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()
), duration=600
) )The framework profiling is performed for all of the profiling options at the target step or time range.
To find more information about available profiling options, see SageMaker Debugger APIs – FrameworkProfile
in the Amazon SageMaker Python SDK . The next section shows you how to script the available profiling options.
Start a training job with the default system monitoring and customized framework profiling with different profiling options
You can use the following profiling configuration classes to manage the framework profiling options:
-
DetailedProfilingConfig
– Specify a target step or time range to profile framework operations using the native framework profilers (TensorFlow profiler and PyTorch profiler). For example, if using TensorFlow, the Debugger hooks enable the TensorFlow profiler to collect TensorFlow-specific framework metrics. Detailed profiling enables you to profile all framework operators at a pre-step (before the first step), within steps, and between steps of a training job. Note
Detailed profiling might significantly increase GPU memory consumption. We do not recommend enabling detailed profiling for more than a couple of steps.
-
DataloaderProfilingConfig
– Specify a target step or time range to profile deep learning framework data loader processes. Debugger collects every data loader event of the frameworks. Note
Data loader profiling might lower the training performance while collecting information from data loaders. We don't recommend enabling data loader profiling for more than a couple of steps.
Debugger is preconfigured to annotate data loader processes only for the Amazon deep learning containers. Debugger cannot profile data loader processes from any other custom or external training containers.
-
PythonProfilingConfig
– Specify a target step or time range to profile Python functions. You can also choose between two Python profilers: cProfile and Pyinstrument. -
cProfile – The standard Python profiler. cProfile collects information for every Python operator called during training. With cProfile, Debugger saves cumulative time and annotation for each function call, providing complete detail about Python functions. In deep learning, for example, the most frequently called functions might be the convolutional filters and backward pass operators, and cProfile profiles every single of them. For the cProfile option, you can further select a timer option: total time, CPU time, and off-CPU time. While you can profile every function call executing on processors (both CPU and GPU) in CPU time, you can also identify I/O or network bottlenecks with the off-CPU time option. The default is total time, and Debugger profiles both CPU and off-CPU time. With cProfile, you are able to drill down to every single functions when analyzing the profile data.
-
Pyinstrument – Pyinstrument is a low-overhead Python profiler that works based on sampling. With the Pyinstrument option, Debugger samples profiling events every millisecond. Because Pyinstrument measures elapsed wall-clock time instead of CPU time, the Pyinstrument option can be a better choice over the cProfile option for reducing profiling noise (filtering out irrelevant function calls that are cumulatively fast) and capturing operators that are actually compute intensive (cumulatively slow) for training your model. With Pyinstrument, you are able to see a tree of function calls and better understand the structure and root cause of the slowness.
Note
Enabling Python profiling might slow down the overall training time. cProfile profiles the most frequently called Python operators at every call, so the processing time on profiling increases with respect to the number of calls. For Pyinstrument, the cumulative profiling time increases with respect to time because of its sampling mechanism.
-
The following example configuration shows the full structure when you use the different profiling options with specified values.
import time from sagemaker.debugger import (ProfilerConfig, FrameworkProfile, DetailedProfilingConfig, DataloaderProfilingConfig, PythonProfilingConfig, PythonProfiler, cProfileTimer) profiler_config=ProfilerConfig( system_monitor_interval_millis=
500
, framework_profile_params=FrameworkProfile( detailed_profiling_config=DetailedProfilingConfig( start_step=5
, num_steps=1
), dataloader_profiling_config=DataloaderProfilingConfig( start_step=7
, num_steps=1
), python_profiling_config=PythonProfilingConfig( start_step=9
, num_steps=1
, python_profiler=PythonProfiler.CPROFILE
, cprofile_timer=cProfileTimer.TOTAL_TIME
) ) )
For more information about available profiling options, see DetailedProfilingConfig