本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
训练
注意
本节介绍如何使用和在 Dee Amazon p Learning Containers for Amazon 弹性容器 PyTorch 服务上运行训练 TensorFlow。
重要
如果您的账户已经创建了 Amazon ECS 服务相关角色,则除非您在此处指定角色,否则该角色将默认用于您的服务。如果您的任务定义使用 awsvpc 网络模式或将服务配置为使用服务发现,则需要服务相关角色。如果服务使用外部部署控制器、多个目标组或 Elastic Inference 加速器,则也需要该角色,在这种情况下,您不应在此处指定角色。有关更多信息,请参阅《亚马逊ECS开发者指南》ECS中的使用亚马逊服务相关角色。
PyTorch 训练
必须先注册任务定义,然后才能在 Amazon ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用一个示例 Docker 镜像,该镜像将训练脚本添加到 Deep Learning Containers 中。
-
使用以下内容创建名为
ecs-deep-learning-container-training-taskdef.json
的文件。-
对于 CPU
{ "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda" ], "entryPoint":[ "sh", "-c" ], "name":"
pytorch-training-container
", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
", "memory":4000
, "cpu":256
, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/pytorch-training-cpu", "awslogs-region":"us-east-1
", "awslogs-stream-prefix":"mnist
", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"pytorch
" } -
对于 GPU
{ "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py" ], "entryPoint": [ "sh", "-c" ], "name": "
pytorch-training-container
", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-gpu-py36-cu101-ubuntu16.04
", "memory":6111
, "cpu":256
, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/pytorch-training-gpu", "awslogs-region": "us-east-1
", "awslogs-stream-prefix": "mnist
", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "pytorch-training
" }
-
-
注册任务定义。记下输出中的修订版号,并在下一步中使用它。
aws ecs register-task-definition --cli-input-json file://
ecs-deep-learning-container-training-taskdef.json
-
使用任务定义创建任务。您需要上一步中的修订标识符。
aws ecs run-task --cluster
ecs-ec2-training-inference
--task-definitionpytorch
:1
-
在 https://console.aws.amazon.com/ecs/v2
中打开控制台。 -
选择
ecs-ec2-training-inference
集群。 -
在 Cluster 页面上,选择 Tasks。
-
任务处于
RUNNING
状态后,选择任务标识符。 -
在 “日志” 下,选择 “查看登录信息” CloudWatch。这会将您带到 CloudWatch 控制台以查看训练进度日志。
后续步骤
要在亚马逊上ECS使用 Deep Learning Cont PyTorch ainers 学习推理,请参阅PyTorch 推断。
TensorFlow训练
必须先注册任务定义,然后才能在ECS集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用一个示例 Docker 镜像,该镜像将训练脚本添加到 Deep Learning Containers 中。您可以将此脚本与 TensorFlow 或 TensorFlow 2 配合使用。要将其与 TensorFlow 2 一起使用,请将 Docker 镜像更改为 TensorFlow 2 镜像。
-
使用以下内容创建名为
ecs-deep-learning-container-training-taskdef.json
的文件。-
对于 CPU
{ "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [{ "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "
tensorflow-training-container
", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.15.2-cpu-py36-ubuntu18.04
", "memory":4000
, "cpu":256
, "essential": true, "portMappings": [{ "containerPort": 80, "protocol": "tcp" }], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1
", "awslogs-stream-prefix": "tf
", "awslogs-create-group": "true" } } }], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "TensorFlow
" } -
对于 GPU
{ "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "
tensorflow-training-container
", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.15.2-gpu-py37-cu100-ubuntu18.04
", "memory":6111
, "cpu":256
, "resourceRequirements" : [{ "type" : "GPU", "value" : "1
" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1
", "awslogs-stream-prefix": "tf
", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "tensorflow-training
" }
-
-
注册任务定义。记下输出中的修订版号,并在下一步中使用它。
aws ecs register-task-definition --cli-input-json file://
ecs-deep-learning-container-training-taskdef.json
-
使用任务定义创建任务。您需要上一步中的修订版号和在安装过程中创建的集群的名称
aws ecs run-task --cluster
ecs-ec2-training-inference
--task-definitiontf
:1
打开 Amazon ECS 经典游戏机,网址为https://console.aws.amazon.com/ecs/
。 -
选择
ecs-ec2-training-inference
集群。 -
在 Cluster 页面上,选择 Tasks。
-
任务处于
RUNNING
状态后,选择任务标识符。 -
在 “日志” 下,选择 “查看登录信息” CloudWatch。这会将您带到 CloudWatch 控制台以查看训练进度日志。
后续步骤
要在亚马逊上ECS使用 Deep Learning Cont TensorFlow ainers 学习推理,请参阅TensorFlow推断。