

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 适用于 Python 的 SDK（Boto3）
<a name="debugger-built-in-rules-api.Boto3"></a>

可以使用 B Amazon oto3 SageMaker AI 客户端的[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job)功能为训练作业配置 Amazon D SageMaker ebugger 内置规则。您需要在 `RuleEvaluatorImage` 参数中指定正确的映像 URI，以下示例演示如何为 [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) 函数设置请求正文。

以下代码显示了一个完整的示例，说明如何为`create_training_job()`请求正文配置 Debugger 并在中`us-west-2`启动训练作业（假设使用 TensorFlow准备了训练脚本`entry_point/train.py`）。要查找 end-to-end示例笔记本，请参阅[使用 Amazon D SageMaker ebugger 分析多 GPU TensorFlow 多节点训练 Job (Boto3)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_profiling/tf-resnet-profiling-multi-gpu-multi-node-boto3.html)。

**注意**  
确保使用正确的 Docker 容器映像。要查找可用的 Amazon 深度学习容器镜像，请参阅[可用的 Deep Learning Containers 镜像](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)。要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅[用于 Debugger 规则的 Docker 映像](debugger-reference.md#debugger-docker-images-rules)。

```
import sagemaker, boto3
import datetime, tarfile

# Start setting up a SageMaker session and a Boto3 SageMaker client
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()

# Upload a training script to a default Amazon S3 bucket of the current SageMaker session
source = 'source.tar.gz'
project = '{{debugger-boto3-test}}'

tar = tarfile.open(source, 'w:gz')
tar.add ('{{entry_point/train.py}}') # Specify the directory and name of your training script
tar.close()

s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)

# Set up a Boto3 session client for SageMaker
sm = boto3.Session(region_name=region).client("sagemaker")

# Start a training job
sm.create_training_job(
    TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'),
    HyperParameters={
        'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source,
        'sagemaker_program': '{{/entry_point/train.py}}' # training scrip file location and name under the sagemaker_submit_directory
    },
    AlgorithmSpecification={
        # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
        'TrainingImage': '{{763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04}}',
        'TrainingInputMode': '{{File}}',
        'EnableSageMakerMetricsTimeSeries': {{False}}
    },
    RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20201014T161125',
    OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'},
    ResourceConfig={
        'InstanceType': '{{ml.p3.8xlarge}}',
        'InstanceCount': {{1}},
        'VolumeSizeInGB': 30
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 86400
    },
    DebugHookConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output',
        'CollectionConfigurations': [
            {
                'CollectionName': '{{losses}}',
                'CollectionParameters' : {
                    'train.save_interval': '{{500}}',
                    'eval.save_interval': '{{50}}'
                }
            }
        ]
    },
    DebugRuleConfigurations=[
        {
            'RuleConfigurationName': '{{LossNotDecreasing}}',
            'RuleEvaluatorImage': '{{895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest}}',
            'RuleParameters': {'rule_to_invoke': '{{LossNotDecreasing}}'}
        }
    ],
    ProfilerConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output',
        'ProfilingIntervalInMilliseconds': 500,
        'ProfilingParameters': {
            'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }',
            'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }',
            'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}',
            'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
        }
    },
    ProfilerRuleConfigurations=[
        {
            'RuleConfigurationName': 'ProfilerReport',
            'RuleEvaluatorImage': '{{895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest}}',
            'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}
        }
    ]
)
```

## 配置 Debugger 规则以调试模型参数
<a name="debugger-built-in-rules-api-debug.Boto3"></a>

以下代码示例展示了如何使用此 SageMaker API 配置内置`VanishingGradient`规则。

**启用 Debugger 收集输出张量**

按如下方式指定 Debugger 钩子配置：

```
DebugHookConfig={
    'S3OutputPath': '{{s3://<default-bucket>/<training-job-name>/debug-output}}',
    'CollectionConfigurations': [
        {
            'CollectionName': '{{gradients}}',
            'CollectionParameters' : {
                'train.save_interval': '{{500}}',
                'eval.save_interval': '{{50}}'
            }
        }
    ]
}
```

这将使训练作业按每 500 个步骤的 `save_interval` 保存一次 `gradients` 张量集合。要查找可用`CollectionName`值，请参阅*SMDebug 客户端库文档*中的[调试器内置集合](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections)。要查找可用的`CollectionParameters`参数键和值，请参阅 *SageMaker Python SDK 文档*中的[https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig)类。

**启用 Debugger 规则来调试输出张量**

以下`DebugRuleConfigurations` API 示例说明了如何对已保存的 `gradients` 集合运行内置 `VanishingGradient` 规则。

```
DebugRuleConfigurations=[
    {
        'RuleConfigurationName': '{{VanishingGradient}}',
        'RuleEvaluatorImage': '{{895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest}}',
        'RuleParameters': {
            'rule_to_invoke': '{{VanishingGradient}}',
            'threshold': '{{20.0}}'
        }
    }
]
```

通过类似于此示例中的配置，Debugger 使用 `VanishingGradient` 规则，在 `gradients` 张量的集合上为您的训练作业启动规则评估作业。要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅[用于 Debugger 规则的 Docker 映像](debugger-reference.md#debugger-docker-images-rules)。要查找 `RuleParameters` 的键值对，请参阅 [Debugger 内置规则列表](debugger-built-in-rules.md)。

## 为分析系统和框架指标配置 Debugger 内置规则
<a name="debugger-built-in-rules-api-profile.Boto3"></a>

以下示例代码演示如何指定 ProfilerConfig API 操作以启用收集系统和框架指标。

**启用 Debugger 分析以收集系统和框架指标**

------
#### [ Target Step ]

```
ProfilerConfig={ 
    'S3OutputPath': '{{s3://<default-bucket>/<training-job-name>/profiler-output}}', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': {{500}}, 
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartStep": {{5}}, 
            "NumSteps": {{3}}, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartStep": {{5}}, 
            "NumSteps": {{3}} 
        }',
        'PythonProfilingConfig': '{
            "StartStep": {{5}}, 
            "NumSteps": {{3}}, 
            "ProfilerName": "{{cprofile}}",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "{{total_time}}"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}
```

------
#### [ Target Time Duration ]

```
ProfilerConfig={ 
    'S3OutputPath': '{{s3://<default-bucket>/<training-job-name>/profiler-output}}', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': {{500}},
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartTimeInSecSinceEpoch": {{12345567789}}, 
            "DurationInSeconds": {{10}}, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartTimeInSecSinceEpoch": {{12345567789}}, 
            "DurationInSeconds": {{10}}
        }',
        'PythonProfilingConfig': '{
            "StartTimeInSecSinceEpoch": {{12345567789}}, 
            "DurationInSeconds": {{10}}, 
            "ProfilerName": "{{cprofile}}",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "{{total_time}}"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}
```

------

**启用 Debugger 规则来分析指标**

以下示例代码显示了如何配置 `ProfilerReport` 规则。

```
ProfilerRuleConfigurations=[ 
    {
        'RuleConfigurationName': 'ProfilerReport',
        'RuleEvaluatorImage': '{{895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest}}',
        'RuleParameters': {
            'rule_to_invoke': 'ProfilerReport',
            'CPUBottleneck_cpu_threshold': '{{90}}',
            'IOBottleneck_threshold': '{{90}}'
        }
    }
]
```

要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅[用于 Debugger 规则的 Docker 映像](debugger-reference.md#debugger-docker-images-rules)。要查找 `RuleParameters` 的键值对，请参阅 [Debugger 内置规则列表](debugger-built-in-rules.md)。

## 使用 `UpdateTrainingJob` API 操作更新 Debugger 分析配置
<a name="debugger-updatetrainingjob-api.Boto3"></a>

在训练作业运行期间，可以使用 Amazon Boto3 SageMaker AI 客户端的[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_training_job)功能更新调试器分析配置。配置新的[ProfilerConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ProfilerConfig.html)和[ProfilerRuleConfiguration](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html)对象，并在`TrainingJobName`参数中指定训练作业名称。

```
ProfilerConfig={ 
    'DisableProfiler': {{boolean}},
    'ProfilingIntervalInMilliseconds': {{number}},
    'ProfilingParameters': { 
        '{{string}}' : '{{string}}' 
    }
},
ProfilerRuleConfigurations=[ 
    { 
        'RuleConfigurationName': '{{string}}',
        'RuleEvaluatorImage': '{{string}}',
        'RuleParameters': { 
            'string' : '{{string}}' 
        }
    }
],
TrainingJobName='{{your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS}}'
```

## 在 CreateTrainingJob API 操作中添加调试器自定义规则配置
<a name="debugger-custom-rules-api.Boto3"></a>

可以使用 Amazon Boto3 SageMaker AI 客户端的功能使用[ DebugHookConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DebugHookConfig.html)和[ DebugRuleConfiguration](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html)对象为训练作业配置自定义规则。[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job)以下代码示例显示了如何使用此 SageMaker API 操作配置使用 *smdebug* 库编写的自定义`ImproperActivation`规则。此示例假定您已在 *custom\_rules.py* 文件中编写自定义规则，并将其上传到 Amazon S3 存储桶。该示例提供了预构建的 Docker 映像，您可以使用这些映像运行自定义规则。[URIs 适用于自定义规则评估者的 Amazon SageMaker 调试器图片](debugger-reference.md#debuger-custom-rule-registry-ids) 中列出了这些映像。您可以在 `RuleEvaluatorImage` 参数中为预构建的 Docker 映像指定 URL 注册表地址。

```
DebugHookConfig={
    'S3OutputPath': '{{s3://<default-bucket>/<training-job-name>/debug-output}}',
    'CollectionConfigurations': [
        {
            'CollectionName': '{{relu_activations}}',
            'CollectionParameters': {
                'include_regex': '{{relu}}',
                'save_interval': '{{500}}',
                'end_step': '{{5000}}'
            }
        }
    ]
},
DebugRulesConfigurations=[
    {
        'RuleConfigurationName': '{{improper_activation_job}}',
        'RuleEvaluatorImage': '{{552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest}}',
        'InstanceType': '{{ml.c4.xlarge}}',
        'VolumeSizeInGB': {{400}},
        'RuleParameters': {
           'source_s3_uri': '{{s3://bucket/custom_rules.py}}',
           'rule_to_invoke': '{{ImproperActivation}}',
           'collection_names': '{{relu_activations}}'
        }
    }
]
```

要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅[用于 Debugger 规则的 Docker 映像](debugger-reference.md#debugger-docker-images-rules)。要查找 `RuleParameters` 的键值对，请参阅 [Debugger 内置规则列表](debugger-built-in-rules.md)。