深度学习 AMI
开发人员指南
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

训练

本部分介绍如何使用 MXNet 和 TensorFlow 在 ECS 的 AWS Deep Learning Containers上运行训练。

有关 AWS Deep Learning Containers的完整列表,请参阅Deep Learning Containers映像

注意

MKL 用户:读取 AWS Deep Learning Containers MKL 建议以获得最佳训练或推理性能。

TensorFlow 训练

您必须先注册任务定义才能在 ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 AWS Deep Learning Containers的示例 Docker 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU:

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [{ "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.13-cpu-py36-ubuntu16.04", "memory": 4000, "cpu": 256, "essential": true, "portMappings": [{ "containerPort": 80, "protocol": "tcp" }], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } }], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "TensorFlow" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.13-horovod-gpu-py36-cu100-ubuntu16.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "tensorflow-training" }
  2. 注册任务定义。记下输出中的修订号。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订 ID。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition tf:1
  4. https://console.amazonaws.cn/ecs/ 上打开 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster (集群) 页面上,选择 Tasks (任务)

  7. 一旦您的任务处于 RUNNING 状态,请选择任务 ID。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 下,选择 View logs in CloudWatch (查看 CloudWatch 中的日志 )。这会将您转到 CloudWatch 控制台以查看训练进度日志。

MXNet 训练

您必须先注册任务定义才能在 ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 AWS Deep Learning Containers的示例 Docker 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU:

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.4.0-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet" }
    • 对于 GPU:

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.4.0-gpu-py36-cu90-ubuntu16.04", "memory":4000, "cpu":256, "resourceRequirements":[ { "type":"GPU", "value":"1" } ], "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-gpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet-training" }
  2. 注册任务定义。记下输出中的修订号。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订 ID。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition mx:1
  4. https://console.amazonaws.cn/ecs/ 上打开 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster (集群) 页面上,选择 Tasks (任务)

  7. 一旦您的任务处于 RUNNING 状态,请选择任务 ID。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 中,选择 View logs in CloudWatch (查看 CloudWatch 中的日志)。这会将您转到 CloudWatch 控制台以查看训练进度日志。