SDK for Python (Boto3)
Amazon SageMaker Debugger built-in rules can be configured for a training job using the create_training_job()
RuleEvaluatorImage
parameter, and the following examples walk you through how to set up the request body
for the create_training_job()
The following code shows a complete example of how to configure Debugger for the
create_training_job()
request body and start a training job in
us-west-2
, assuming that a training script
entry_point/train.py
is prepared using TensorFlow. To find an
end-to-end example notebook, see Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker
Debugger (Boto3)
Note
Ensure that you use the correct Docker container images. To find available Amazon Deep Learning
Container images, see Available Deep Learning Containers Images
import sagemaker, boto3 import datetime, tarfile # Start setting up a SageMaker session and a Boto3 SageMaker client session = sagemaker.Session() region = session.boto_region_name bucket = session.default_bucket() # Upload a training script to a default Amazon S3 bucket of the current SageMaker session source = 'source.tar.gz' project = '
debugger-boto3-test
' tar = tarfile.open(source, 'w:gz') tar.add ('entry_point/train.py
') # Specify the directory and name of your training script tar.close() s3 = boto3.client('s3') s3.upload_file(source, bucket, project+'/'+source) # Set up a Boto3 session client for SageMaker sm = boto3.Session(region_name=region).client("sagemaker") # Start a training job sm.create_training_job( TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'), HyperParameters={ 'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source, 'sagemaker_program': '/entry_point/train.py
' # training scrip file location and name under the sagemaker_submit_directory }, AlgorithmSpecification={ # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage. 'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04
', 'TrainingInputMode': 'File
', 'EnableSageMakerMetricsTimeSeries':False
}, RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20201014T161125', OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'}, ResourceConfig={ 'InstanceType': 'ml.p3.8xlarge
', 'InstanceCount':1
, 'VolumeSizeInGB': 30 }, StoppingCondition={ 'MaxRuntimeInSeconds': 86400 }, DebugHookConfig={ 'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output', 'CollectionConfigurations': [ { 'CollectionName': 'losses
', 'CollectionParameters' : { 'train.save_interval': '500
', 'eval.save_interval': '50
' } } ] }, DebugRuleConfigurations=[ { 'RuleConfigurationName': 'LossNotDecreasing
', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest
', 'RuleParameters': {'rule_to_invoke': 'LossNotDecreasing
'} } ], ProfilerConfig={ 'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': { 'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }', 'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}', 'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs } }, ProfilerRuleConfigurations=[ { 'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest
', 'RuleParameters': {'rule_to_invoke': 'ProfilerReport'} } ] )
To configure a Debugger rule for debugging model parameters
The following code samples show how to configure a built-in
VanishingGradient
rule using this SageMaker API.
To enable Debugger to collect output tensors
Specify the Debugger hook configuration as follows:
DebugHookConfig={ 'S3OutputPath': '
s3://<default-bucket>/<training-job-name>/debug-output
', 'CollectionConfigurations': [ { 'CollectionName': 'gradients
', 'CollectionParameters' : { 'train.save_interval': '500
', 'eval.save_interval': '50
' } } ] }
This will make the training job save a tensor collection, gradients
,
every save_interval
of 500 steps. To find available
CollectionName
values, see Debugger Built-in CollectionsCollectionParameters
parameter keys and values, see the sagemaker.debugger.CollectionConfig
To enable Debugger rules for debugging the output tensors
The following DebugRuleConfigurations
API example shows how to run
the built-in VanishingGradient
rule on the saved gradients
collection.
DebugRuleConfigurations=[ { 'RuleConfigurationName': '
VanishingGradient
', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest
', 'RuleParameters': { 'rule_to_invoke': 'VanishingGradient
', 'threshold': '20.0
' } } ]
With a configuration like the one in this sample, Debugger starts a rule evaluation
job for your training job using the VanishingGradient
rule on the
collection of gradients
tensor. To find a complete list of available
Docker images for using the Debugger rules, see Docker images for Debugger rules. To find the key-value pairs for
RuleParameters
, see List of Debugger built-in rules.
To configure a Debugger built-in rule for profiling system and framework metrics
The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.
To enable Debugger profiling to collect system and framework metrics
To enable Debugger rules for profiling the metrics
The following example code shows how to configure the ProfilerReport
rule.
ProfilerRuleConfigurations=[ { 'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '
895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest
', 'RuleParameters': { 'rule_to_invoke': 'ProfilerReport', 'CPUBottleneck_cpu_threshold': '90
', 'IOBottleneck_threshold': '90
' } } ]
To find a complete list of available Docker images for using the Debugger rules, see
Docker images for Debugger rules. To find the key-value pairs for
RuleParameters
, see List of Debugger built-in rules.
Update Debugger Profiling Configuration Using
the UpdateTrainingJob
API Operation
Debugger profiling configuration can be updated while your training job is running
by using the update_training_job()
TrainingJobName
parameter.
ProfilerConfig={ 'DisableProfiler':
boolean
, 'ProfilingIntervalInMilliseconds':number
, 'ProfilingParameters': { 'string
' : 'string
' } }, ProfilerRuleConfigurations=[ { 'RuleConfigurationName': 'string
', 'RuleEvaluatorImage': 'string
', 'RuleParameters': { 'string' : 'string
' } } ], TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS
'
Add Debugger Custom Rule Configuration to the CreateTrainingJob API Operation
A custom rule can be configured for a training job using the
DebugHookConfig and
DebugRuleConfiguration objects using the Amazon Boto3 SageMaker AI client's
create_training_job()
ImproperActivation
rule written
with the smdebug library using this SageMaker API
operation. This example assumes that you’ve written the custom rule in custom_rules.py file and uploaded it to an Amazon S3 bucket.
The example provides pre-built Docker images that you can use to run your custom
rules. These are listed at Amazon SageMaker Debugger image URIs for custom rule
evaluators. You specify the URL registry
address for the pre-built Docker image in the RuleEvaluatorImage
parameter.
DebugHookConfig={ 'S3OutputPath': '
s3://<default-bucket>/<training-job-name>/debug-output
', 'CollectionConfigurations': [ { 'CollectionName': 'relu_activations
', 'CollectionParameters': { 'include_regex': 'relu
', 'save_interval': '500
', 'end_step': '5000
' } } ] }, DebugRulesConfigurations=[ { 'RuleConfigurationName': 'improper_activation_job
', 'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest
', 'InstanceType': 'ml.c4.xlarge
', 'VolumeSizeInGB':400
, 'RuleParameters': { 'source_s3_uri': 's3://bucket/custom_rules.py
', 'rule_to_invoke': 'ImproperActivation
', 'collection_names': 'relu_activations
' } } ]
To find a complete list of available Docker images for using the Debugger rules, see
Docker images for Debugger rules. To find the key-value pairs for
RuleParameters
, see List of Debugger built-in rules.