培训 - Amazon深度学习容器
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

培训

本节介绍如何使用 Apache MXNet(孵化)、、和 TensorFlow 2 在适用于 Amazon Elastic 容器服务的DeeAmazon p Learning Conta PyTorch in TensorFlow ers 上运行训练。

有关Deep Learning Containers 完整列表,请参阅Deep Learning Containers 映像

注意

MKL 用户:阅读AmazonDeep Learning Containers 英特尔数学核心库 (MKL) 建议以获得最佳训练或推理性能。

重要

如果您的账户已创建 Amazon ECS 服务相关角色,则默认情况下会为您的服务使用该角色,除非您在此处指定一个角色。如果任务定义使用 awsvpc 网络模式,或者将服务配置为使用服务发现,则需要使用服务相关角色。如果服务使用外部部署控制器、多个目标组或 Elastic Inference 加速器(在这种情况下,不应在此处指定角色),则还需要使用该角色。有关更多信息,请参阅 Amazon ECS 开发人员指南中的 Amazon ECS 使用服务关联角色

TensorFlow训练

您必须先注册任务定义才能在 ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用了向Deep Learning Containers 添加训练脚本的 Docker 示例。您可以将此脚本与任一 TensorFlow 或 TensorFlow 2 一起使用。要将其与 TensorFlow 2 一起使用,请将 Docker 镜像更改为 TensorFlow 2 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 适用于 CPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [{ "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.15.2-cpu-py36-ubuntu18.04", "memory": 4000, "cpu": 256, "essential": true, "portMappings": [{ "containerPort": 80, "protocol": "tcp" }], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } }], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "TensorFlow" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.15.2-gpu-py37-cu100-ubuntu18.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "tensorflow-training" }
  2. 注册任务定义。请记下输中的修订号,并在下这些中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要上一步的修订号和在安装期间创建的集群的名称

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition tf:1
  4. https://console.aws.amazon.com/ecs/ 上打开 AmazECS 经典控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster 页面上,选择 Tasks

  7. 任务处于某种RUNNING状态后,选择任务标识符。

  8. 在 “日志” 下,选择 “查看日志” CloudWatch。这将带您进入 CloudWatch 控制台查看训练进度日志。

后续步骤

要使用D TensorFlow eep Learning Containers 在 Amazon ECS 上学习推理,请参阅TensorFlow理推理推理

Apache MXNet(孵化)培训

您必须先注册任务定义才能在 AmazECECECON 上运行任务。任务定义是分组在一起的一系列容器。以下示例使用了向Deep Learning Containers 添加训练脚本的 Docker 示例。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 适用于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet" }
    • 对于 GPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-gpu-py36-cu101-ubuntu16.04", "memory":4000, "cpu":256, "resourceRequirements":[ { "type":"GPU", "value":"1" } ], "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-gpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet-training" }
  2. 注册任务定义。请记下输中的修订号,并在下这些中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要上中的修订号。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition mx:1
  4. https://console.aws.amazon.com/ecs/v2 打开控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster 页面上,选择 Tasks

  7. 任务处于某种RUNNING状态后,选择任务标识符。

  8. 在 “日志” 下,选择 “查看日志” CloudWatch。这将带您进入 CloudWatch 控制台查看训练进度日志。

后续步骤

要使用带有Deep Learning Containers 的 MXNet 在 Amazon ECS 上学习推理,请参阅Apache MXNet(孵化)推断

PyTorch 训练

您必须先注册任务定义才能在 AmazECS 上运行任务。任务定义是分组在一起的一系列容器。以下示例使用了向Deep Learning Containers 添加训练脚本的 Docker 示例。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 适用于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda" ], "entryPoint":[ "sh", "-c" ], "name":"pytorch-training-container", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/pytorch-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"pytorch" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py" ], "entryPoint": [ "sh", "-c" ], "name": "pytorch-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-gpu-py36-cu101-ubuntu16.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/pytorch-training-gpu", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "mnist", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "pytorch-training" }
  2. 注册任务定义。请记下输中的修订号,并在下这些中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要上中的修订标识符。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition pytorch:1
  4. https://console.aws.amazon.com/ecs/v2 打开控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster 页面上,选择 Tasks

  7. 任务处于某种RUNNING状态后,选择任务标识符。

  8. 在 “日志” 下,选择 “查看日志” CloudWatch。这将带您进入 CloudWatch 控制台查看训练进度日志。

Amazon S3 插件 PyTorch

Deep Learning Containers 包含一个插件,使您可以使用来自 Amazon S3 ECECECTON 的数据进行 PyTorch 训练。

  1. 要开始在 Amazon ECS 中使用 Amazon S3 插件,请使用您选择的区域设置您的AWS_REGION环境变量。

    export AWS_REGION=us-east-1
  2. 使用以下内容创建名为 ecs-deep-learning-container-pytorch-s3-plugin-taskdef.json 的文件。

    • 适用于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git && python amazon-s3-plugin-for-pytorch/examples/s3_imagenet_example.py" ], "entryPoint":[ "sh", "-c" ], "name":"pytorch-s3-plugin-container", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-cpu-py36-ubuntu18.04-v1.6", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/pytorch-s3-plugin-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"imagenet", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"pytorch-s3-plugin" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git && python amazon-s3-plugin-for-pytorch/examples/s3_imagenet_example.py" ], "entryPoint": [ "sh", "-c" ], "name": "pytorch-s3-plugin-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04-v1.7", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/pytorch-s3-plugin-gpu", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "imagenet", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "pytorch-s3-plugin" }
  3. 注册任务定义。请记下输中的修订号,并在下这些中使用。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-pytorch-s3-plugin-taskdef.json
  4. 使用任务定义创建任务。您需要上中的修订标识符。

    aws ecs run-task --cluster ecs-pytorch-s3-plugin --task-definition pytorch-s3-plugin:1
  5. https://console.aws.amazon.com/ecs/v2 打开控制台。

  6. 选择 ecs-pytorch-s3-plugin 集群。

  7. Cluster 页面上,选择 Tasks

  8. 任务处于某种RUNNING状态后,选择任务标识符。

  9. 在 “日志” 下,选择 “查看日志” CloudWatch。这将带您进入 CloudWatch 控制台查看 Amazon S3 插件示例日志。

有关更多信息和其他示例,请参阅 PyTorch存储库的 Amazon S3 插件

后续步骤

要使用D PyTorch eep Learning Containers 在 Amazon ECS 上学习推理,请参阅PyTorch 理推理推理