训练 - AWS 深度学习容器
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

训练

本部分指导您如何使用 MXNet、PyTorch、TensorFlow 和 TensorFlow 2 在适用于 Amazon Elastic Container Service 的 AWS Deep Learning Containers 上运行训练。

有关 Deep Learning Containers的完整列表,请参阅Deep Learning Containers 映像

注意

MKL 用户:阅读AWS Deep Learning Containers Intel 数学内核库 (MKL) 建议以获得最佳训练或推理性能。

重要

如果您的账户已创建 Amazon ECS 服务相关角色,则默认情况下会为您的服务使用该角色,除非您在此处指定一个角色。如果您的任务定义使用 awsvpc 网络模式,或者如果服务配置为使用服务发现,则需要使用服务相关角色。如果服务使用外部部署控制器、多个目标组或 Elastic Inference 加速器(在这种情况下,不应在此处指定角色),则也需要使用此角色。有关更多信息,请参阅 Amazon ECS 开发人员指南 中的为 Amazon ECS 使用服务相关角色

TensorFlow 训练

您必须先注册任务定义才能在 ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 Deep Learning Containers的示例 Docker 映像。您可以将此脚本与 TensorFlow 或 TensorFlow 2 一起使用。要将其与 TensorFlow2 一起使用,请将 Docker 映像更改为 TensorFlow 2 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [{ "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.15.0-cpu-py36-ubuntu18.04", "memory": 4000, "cpu": 256, "essential": true, "portMappings": [{ "containerPort": 80, "protocol": "tcp" }], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } }], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "TensorFlow" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.15.0-gpu-py36-cu100-ubuntu18.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "tensorflow-training" }
  2. 注册任务定义。记下输出中的修订号,以供下一步中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订号。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition tf:1
  4. https://console.amazonaws.cn/ecs/ 上打开 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster (集群) 页面上,选择 Tasks (任务)

  7. 当您的任务处于 RUNNING 状态后,请选择任务标识符。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 下,选择 View logs in CloudWatch (查看 CloudWatch 中的日志 )。这会将您转到 CloudWatch 控制台以查看训练进度日志。

后续步骤

要了解有关将 MXNet 与 Deep Learning Containers 结合使用在 Amazon ECS 上进行推理,请参阅MXNet 推理

MXNet 训练

您必须先注册任务定义,然后才能在 Amazon Elastic Container Service 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 Deep Learning Containers的示例 Docker 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet" }
    • 对于 GPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-gpu-py36-cu101-ubuntu16.04", "memory":4000, "cpu":256, "resourceRequirements":[ { "type":"GPU", "value":"1" } ], "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-gpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet-training" }
  2. 注册任务定义。记下输出中的修订号,以供下一步中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订号。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition mx:1
  4. https://console.amazonaws.cn/ecs/ 上打开 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster (集群) 页面上,选择 Tasks (任务)

  7. 当您的任务处于 RUNNING 状态后,请选择任务标识符。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 中,选择 View logs in CloudWatch (查看 CloudWatch 中的日志)。这会将您转到 CloudWatch 控制台以查看训练进度日志。

后续步骤

要了解有关将 TensorFlow 与 Deep Learning Containers 结合使用在 Amazon ECS 上进行推理,请参阅TensorFlow 推理

PyTorch 训练

您必须先注册任务定义,然后才能在 Amazon ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 Deep Learning Containers的示例 Docker 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda" ], "entryPoint":[ "sh", "-c" ], "name":"pytorch-training-container", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.3.1-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/pytorch-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"pytorch" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py" ], "entryPoint": [ "sh", "-c" ], "name": "pytorch-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.3.1-gpu-py36-cu101-ubuntu16.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/pytorch-training-gpu", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "mnist", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "pytorch-training" }
  2. 注册任务定义。记下输出中的修订号,以供下一步中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订标识符。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition pytorch:1
  4. https://console.amazonaws.cn/ecs/ 上打开 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster (集群) 页面上,选择 Tasks (任务)

  7. 当您的任务处于 RUNNING 状态后,请选择任务标识符。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 中,选择 View logs in CloudWatch (查看 CloudWatch 中的日志)。这会将您转到 CloudWatch 控制台以查看训练进度日志。

后续步骤

要了解有关将 PyTorch 与 Deep Learning Containers 结合使用在 Amazon ECS 上进行推理,请参阅PyTorch 推理