Using deep learning DL1 instances on Amazon ECS - Amazon Elastic Container Service
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Using deep learning DL1 instances on Amazon ECS

To use deep learning workloads on Amazon ECS, register Amazon EC2 DL1 instances to your clusters. Amazon EC2 DL1 instances are powered by Gaudi accelerators from Habana Labs (an Intel company). Use the Habana SynapseAI SDK to connect to the Habana Gaudi accelerators. The SDK supports the popular machine learning frameworks, TensorFlow and PyTorch.

Considerations

Before you begin deploying DL1 on Amazon ECS, consider the following:

  • Your clusters can contain a mix of DL1 and non-DL1 instances.

  • When creating a service or running a standalone task, you can use instance type attributes specifically when you configure task placement constraints to ensure that your task is launched on the container instance that you specify. Doing so ensures that your resources are used effectively and that your tasks for deep learning workloads are on your DL1 instances. For more information, see Amazon ECS task placement.

    The following example runs a task on a dl1.24xlarge instance on your default cluster.

    aws ecs run-task \ --cluster default \ --task-definition ecs-dl1-task-def \ --placement-constraints type=memberOf,expression="attribute:ecs.instance-type == dl1.24xlarge"

Using a DL1 AMI

You have three options for running an AMI on Amazon EC2 DL1 instances for Amazon ECS:

  • Amazon Web Services Marketplace AMIs that are provided by Habana here.

  • Habana Deep Learning AMIs that are provided by Amazon Web Services. Because it's not included, you need to install the Amazon ECS container agent separately.

  • Use Packer to build a custom AMI that's provided by the GitHub repo. For more information, see the Packer documentation.

Task definition requirements

To run Habana Gaudi accelerated deep learning containers on Amazon ECS, your task definition must contain the container definition for a pre-built container that serves the deep learning model for TensorFlow or PyTorch using Habana SynapseAI that's provided by Amazon Deep Learning Containers.

The following container image has TensorFlow 2.7.0 and Ubuntu 20.04. A complete list of pre-built Deep Learning Containers that's optimized for the Habana Gaudi accelerators is maintained on GitHub. For more information, see Habana Training Containers.

763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training-habana:2.7.0-hpu-py38-synapseai1.2.0-ubuntu20.04

The following is an example task definition for Linux containers on Amazon EC2, displaying the syntax to use. This example uses an image containing the Habana Labs System Management Interface Tool (HL-SMI) found here: vault.habana.ai/gaudi-docker/1.1.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.6.0:1.1.0-614

{ "family": "dl-test", "requiresCompatibilities": ["EC2"], "placementConstraints": [ { "type": "memberOf", "expression": "attribute:ecs.os-type == linux" }, { "type": "memberOf", "expression": "attribute:ecs.instance-type == dl1.24xlarge" } ], "networkMode": "host", "cpu": "10240", "memory": "1024", "containerDefinitions": [ { "entryPoint": [ "sh", "-c" ], "command": [ "hl-smi" ], "cpu": 8192, "environment": [ { "name": "HABANA_VISIBLE_DEVICES", "value": "all" } ], "image": "vault.habana.ai/gaudi-docker/1.1.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.6.0:1.1.0-614", "essential": true, "name": "tensorflow-installer-tf-hpu" } ] }