SDK for Python (Boto3)
Amazon SageMaker Debugger built-in rules can be configured for a training job using the create_training_job()RuleEvaluatorImage
            parameter, and the following examples walk you through how to set up the request body
            for the create_training_job()
The following code shows a complete example of how to configure Debugger for the
                create_training_job() request body and start a training job in
                us-west-2, assuming that a training script
                entry_point/train.py is prepared using TensorFlow. To find an
            end-to-end example notebook, see Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker
                Debugger (Boto3)
Note
Ensure that you use the correct Docker container images. To find available Amazon Deep Learning
            Container images, see Available Deep Learning Containers Images
import sagemaker, boto3 import datetime, tarfile # Start setting up a SageMaker session and a Boto3 SageMaker client session = sagemaker.Session() region = session.boto_region_name bucket = session.default_bucket() # Upload a training script to a default Amazon S3 bucket of the current SageMaker session source = 'source.tar.gz' project = 'debugger-boto3-test' tar = tarfile.open(source, 'w:gz') tar.add ('entry_point/train.py') # Specify the directory and name of your training script tar.close() s3 = boto3.client('s3') s3.upload_file(source, bucket, project+'/'+source) # Set up a Boto3 session client for SageMaker sm = boto3.Session(region_name=region).client("sagemaker") # Start a training job sm.create_training_job( TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'), HyperParameters={ 'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source, 'sagemaker_program': '/entry_point/train.py' # training scrip file location and name under the sagemaker_submit_directory }, AlgorithmSpecification={ # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage. 'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04', 'TrainingInputMode': 'File', 'EnableSageMakerMetricsTimeSeries':False}, RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20201014T161125', OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'}, ResourceConfig={ 'InstanceType': 'ml.p3.8xlarge', 'InstanceCount':1, 'VolumeSizeInGB': 30 }, StoppingCondition={ 'MaxRuntimeInSeconds': 86400 }, DebugHookConfig={ 'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output', 'CollectionConfigurations': [ { 'CollectionName': 'losses', 'CollectionParameters' : { 'train.save_interval': '500', 'eval.save_interval': '50' } } ] }, DebugRuleConfigurations=[ { 'RuleConfigurationName': 'LossNotDecreasing', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'RuleParameters': {'rule_to_invoke': 'LossNotDecreasing'} } ], ProfilerConfig={ 'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': { 'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }', 'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}', 'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs } }, ProfilerRuleConfigurations=[ { 'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'RuleParameters': {'rule_to_invoke': 'ProfilerReport'} } ] )
To configure a Debugger rule for debugging model parameters
The following code samples show how to configure a built-in
                    VanishingGradient rule using this SageMaker API. 
To enable Debugger to collect output tensors
Specify the Debugger hook configuration as follows:
DebugHookConfig={ 'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output', 'CollectionConfigurations': [ { 'CollectionName': 'gradients', 'CollectionParameters' : { 'train.save_interval': '500', 'eval.save_interval': '50' } } ] }
This will make the training job save a tensor collection, gradients,
                every save_interval of 500 steps. To find available
                    CollectionName values, see Debugger Built-in CollectionsCollectionParameters
                parameter keys and values, see the sagemaker.debugger.CollectionConfig
To enable Debugger rules for debugging the output tensors
The following DebugRuleConfigurations API example shows how to run
                the built-in VanishingGradient rule on the saved gradients
                collection.
DebugRuleConfigurations=[ { 'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'RuleParameters': { 'rule_to_invoke': 'VanishingGradient', 'threshold': '20.0' } } ]
With a configuration like the one in this sample, Debugger starts a rule evaluation
                job for your training job using the VanishingGradient rule on the
                collection of gradients tensor. To find a complete list of available
                Docker images for using the Debugger rules, see Docker images for Debugger rules. To find the key-value pairs for
                    RuleParameters, see List of Debugger built-in rules.
To configure a Debugger built-in rule for profiling system and framework metrics
The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.
To enable Debugger profiling to collect system and framework metrics
To enable Debugger rules for profiling the metrics
The following example code shows how to configure the ProfilerReport rule.
ProfilerRuleConfigurations=[ { 'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'RuleParameters': { 'rule_to_invoke': 'ProfilerReport', 'CPUBottleneck_cpu_threshold': '90', 'IOBottleneck_threshold': '90' } } ]
To find a complete list of available Docker images for using the Debugger rules, see
                    Docker images for Debugger rules. To find the key-value pairs for
                    RuleParameters, see List of Debugger built-in rules.
Update Debugger Profiling Configuration Using
                    the UpdateTrainingJob API Operation
            Debugger profiling configuration can be updated while your training job is running
                by using the update_training_job()TrainingJobName parameter.
ProfilerConfig={ 'DisableProfiler':boolean, 'ProfilingIntervalInMilliseconds':number, 'ProfilingParameters': { 'string' : 'string' } }, ProfilerRuleConfigurations=[ { 'RuleConfigurationName': 'string', 'RuleEvaluatorImage': 'string', 'RuleParameters': { 'string' : 'string' } } ], TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS'
Add Debugger Custom Rule Configuration to the CreateTrainingJob API Operation
A custom rule can be configured for a training job using the 
                    DebugHookConfig and 
                    DebugRuleConfiguration objects using the Amazon Boto3 SageMaker AI client's
                    create_training_job()ImproperActivation rule written
                with the smdebug library using this SageMaker API
                operation. This example assumes that you’ve written the custom rule in custom_rules.py file and uploaded it to an Amazon S3 bucket.
                The example provides pre-built Docker images that you can use to run your custom
                rules. These are listed at Amazon SageMaker Debugger image URIs for custom rule
                evaluators. You specify the URL registry
                address for the pre-built Docker image in the RuleEvaluatorImage
                parameter.
DebugHookConfig={ 'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output', 'CollectionConfigurations': [ { 'CollectionName': 'relu_activations', 'CollectionParameters': { 'include_regex': 'relu', 'save_interval': '500', 'end_step': '5000' } } ] }, DebugRulesConfigurations=[ { 'RuleConfigurationName': 'improper_activation_job', 'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', 'InstanceType': 'ml.c4.xlarge', 'VolumeSizeInGB':400, 'RuleParameters': { 'source_s3_uri': 's3://bucket/custom_rules.py', 'rule_to_invoke': 'ImproperActivation', 'collection_names': 'relu_activations' } } ]
To find a complete list of available Docker images for using the Debugger rules, see
                    Docker images for Debugger rules. To find the key-value pairs for
                    RuleParameters, see List of Debugger built-in rules.