JSON (Amazon CLI) - Amazon SageMaker
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

JSON (Amazon CLI)

可以使用为训练作业配置 Amazon SageMaker 调试器内置规则debugHookconfig调试规则配置ProfilerConfig, 和ProfilerRules 配置通过 SageMaker 的对象CreateTrainingJobAPI 操作。你需要在RuleEvaluatorImage参数,以下示例向您介绍如何设置要请求的 JSON 字符串。CreateTrainingJob.

下面的代码显示了一个完整的 JSON 模板,用于运行具有所需设置和调试器配置的训练作业。将模板保存为 JSON 文件到工作目录中,然后使用AmazonCLI。例如,将以下代码保存为debugger-training-job-cli.json.

注意

确保使用正确的 Docker 容器镜像。要查找Amazon深度学习容器映像,请参阅可用的 Deep Learning Containers 映像. 要查找用于使用调试器规则的 Docker 映像的完整列表,请参阅将调试程序 Docker 映像用于内置规则或自定义规则.

{ "TrainingJobName": "debugger-aws-cli-test", "RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-YYYYMMDDT123456", "AlgorithmSpecification": { // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage. "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04", "TrainingInputMode": "File", "EnableSageMakerMetricsTimeSeries": false }, "HyperParameters": { "sagemaker_program": "entry_point/tf-hvd-train.py", "sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-profiling-test/source.tar.gz" }, "OutputDataConfig": { "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output" }, "DebugHookConfig": { "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-output", "CollectionConfigurations": [ { "CollectionName": "losses", "CollectionParameters" : { "train.save_interval": "50" } } ] }, "DebugRuleConfigurations": [ { "RuleConfigurationName": "LossNotDecreasing", "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest", "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"} } ], "ProfilerConfig": { "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/profiler-output", "ProfilingIntervalInMilliseconds": 500, "ProfilingParameters": { "DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }", "DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }", "PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}", "LocalPath": "/opt/ml/output/profiler/" } }, "ProfilerRuleConfigurations": [ { "RuleConfigurationName": "ProfilerReport", "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest", "RuleParameters": {"rule_to_invoke": "ProfilerReport"} } ], "ResourceConfig": { "InstanceType": "ml.p3.8xlarge", "InstanceCount": 1, "VolumeSizeInGB": 30 }, "StoppingCondition": { "MaxRuntimeInSeconds": 86400 } }

保存 JSON 文件后,在您的终端中运行以下命令。(使用!如果你使用 Jupyter 笔记本,则在行的开头。)

aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json

配置调试器规则以调试模型参数

以下代码示例演示了如何配置内置VanishingGradient使用此 SageMaker API 的规则。

启用调试器收集输出张量

按如下方式指定调试器挂接配置:

"DebugHookConfig": { "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output", "CollectionConfigurations": [ { "CollectionName": "gradients", "CollectionParameters" : { "save_interval": "500" } } ] }

这将使训练工作保存张量集合,gradients,每save_interval500 个步骤。要查找可用CollectionName值,请参阅调试器内置集合中的smDebug 客户端库文档. 要查找可用CollectionParameters参数键和值,请参阅sagemaker.debugger.CollectionConfig中的类SageMaker Python 开发工具包文档.

启用调试器规则来调试输出张量

以下DebugRuleConfigurationsAPI 示例演示如何运行内置VanishingGradient对已保存的规则gradients收集。

"DebugRuleConfigurations": [ { "RuleConfigurationName": "VanishingGradient", "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest", "RuleParameters": { "rule_to_invoke": "VanishingGradient", "threshold": "20.0" } } ]

通过使用类似于该示例中的配置,调试器使用VanishingGradient关于收集的规则gradients张量。要查找用于使用调试器规则的 Docker 映像的完整列表,请参阅将调试程序 Docker 映像用于内置规则或自定义规则. 要查找的键值对RuleParameters,请参阅调试程序内置规则列表.

为分析系统和框架指标配置调试器内置规则

以下示例代码显示如何指定 ProfilerConfig API 操作以启用收集系统和框架指标。

启用调试器分析以收集系统和框架指标

Target Step
"ProfilerConfig": { // Optional. Path to an S3 bucket to save profiling outputs "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds. "ProfilingIntervalInMilliseconds": 500, "ProfilingParameters": { "DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\" }", "DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }", // For PythonProfilingConfig, // available ProfilerName options: cProfile, Pyinstrument // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time "PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }", // Optional. Local path for profiling outputs "LocalPath": "/opt/ml/output/profiler/" } }
Target Time Duration
"ProfilerConfig": { // Optional. Path to an S3 bucket to save profiling outputs "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds. "ProfilingIntervalInMilliseconds": 500, "ProfilingParameters": { "DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }", "DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10 }", // For PythonProfilingConfig, // available ProfilerName options: cProfile, Pyinstrument // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time "PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }", // Optional. Local path for profiling outputs "LocalPath": "/opt/ml/output/profiler/" } }

启用调试器规则来分析指标

以下示例代码说明了如何配置ProfilerReport规则。

"ProfilerRuleConfigurations": [ { "RuleConfigurationName": "ProfilerReport", "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest", "RuleParameters": { "rule_to_invoke": "ProfilerReport", "CPUBottleneck_cpu_threshold": "90", "IOBottleneck_threshold": "90" } } ]

要查找用于使用调试器规则的 Docker 映像的完整列表,请参阅将调试程序 Docker 映像用于内置规则或自定义规则. 要查找的键值对RuleParameters,请参阅调试程序内置规则列表.

使用更新调试器性能分析配置UpdateTrainingJobAPI 操作

在训练作业运行期间,可以使用更新培训工作API 操作。配置新的ProfilerConfigProfilerRules 配置对象,然后将训练作业名称指定为TrainingJobName参数。

{ "ProfilerConfig": { "DisableProfiler": boolean, "ProfilingIntervalInMilliseconds": number, "ProfilingParameters": { "string" : "string" } }, "ProfilerRuleConfigurations": [ { "RuleConfigurationName": "string", "RuleEvaluatorImage": "string", "RuleParameters": { "string" : "string" } } ], "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS" }

将调试器自定义规则配置添加到 CreateTrainingJob API 操作

可以使用为训练作业配置自定义规则debugHookconfig调试规则配置中的对象CreateTrainingJobAPI 操作。以下代码示例演示了如何配置自定义ImproperActivationsmdebug使用此 SageMaker API 操作的库。此示例假定您已在中编写自定义规则custom_rules.py文件并将其上传到 Amazon S3 存储桶。该示例提供了预构建的 Docker 映像,您可以使用这些映像运行自定义规则。Amazon SageMaker 调试程序的自定义规则评估程序的注册表 URL 中列出了这些映像。您可以在 RuleEvaluatorImage 参数中为预构建的 Docker 映像指定 URL 注册表地址。

"DebugHookConfig": { "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output", "CollectionConfigurations": [ { "CollectionName": "relu_activations", "CollectionParameters": { "include_regex": "relu", "save_interval": "500", "end_step": "5000" } } ] }, "DebugRulesConfigurations": [ { "RuleConfigurationName": "improper_activation_job", "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest", "InstanceType": "ml.c4.xlarge", "VolumeSizeInGB": 400, "RuleParameters": { "source_s3_uri": "s3://bucket/custom_rules.py", "rule_to_invoke": "ImproperActivation", "collection_names": "relu_activations" } } ]

要查找用于使用调试器规则的 Docker 映像的完整列表,请参阅将调试程序 Docker 映像用于内置规则或自定义规则. 要查找的键值对RuleParameters,请参阅调试程序内置规则列表.