Using inference Inf1 instances on Amazon ECS - Amazon Elastic Container Service
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Using inference Inf1 instances on Amazon ECS

You can register Amazon EC2 Inf1 instances to your clusters for machine learning inference workloads. These Amazon EC2 Inf1 instances are powered by Amazon Inferentia chips, which are custom built by Amazon Web Services to provide high performance and lowest cost inference in the cloud. Machine learning models are deployed to containers using Amazon Neuron, which is a specialized SDK. It consists of a compiler, runtime, and profiling tools that optimize the machine learning inference performance of Inferentia chips. Amazon Neuron supports popular machine learning frameworks such as TensorFlow, PyTorch, and Apache MXNet (Incubating).

Considerations

Before you begin deploying Neuron on Amazon ECS, consider the following:

  • Your clusters can contain a mix of Inf1 and non-Inf1 instances.

  • We recommend that you place only one task with an Inferentia resource requirement for each Inf1 instance.

  • When creating a service or running a standalone task, you can use instance type attributes when you configure task placement constraints. This ensures that the task is launched on the container instance that you specify. Doing so can help you optimize overall resource utilization and ensure that tasks for inference workloads are on your Inf1 instances. For more information, see Amazon ECS task placement.

    In the following example, a task is run on an Inf1.xlarge instance on your default cluster.

    aws ecs run-task \ --cluster default \ --task-definition ecs-inference-task-def \ --placement-constraints type=memberOf,expression="attribute:ecs.instance-type == Inf1.xlarge"
  • Inferentia resource requirements can't be defined in a task definition. However, you can configure a container to use specific Inferentia available on the host container nstance. You can do so by using the linuxParameters parameter and specifying the device details. For more information, see Task definition requirements.

Using the Amazon ECS-optimized Amazon Linux 2 (Inferentia) AMI

Amazon ECS provides an Amazon ECS optimized AMI that's based on Amazon Linux 2 for Inferentia workloads. It is pre-configured with Amazon Inferentia drivers and the Amazon Neuron runtime for Docker. This AMI makes running machine learning inference workloads easier on Amazon ECS.

We recommend using the Amazon ECS-optimized Amazon Linux 2 (Inferentia) AMI when launching your Amazon EC2 Inf1 instances. You can retrieve the current Amazon ECS-optimized Amazon Linux 2 (Inferentia) AMI using the Amazon CLI with the following command.

aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/inf/recommended

The following table provides a link to retrieve the current Amazon ECS-optimized Amazon Linux 2 (Inferentia) AMI IDs by Region.

Task definition requirements

To deploy Neuron on Amazon ECS, your task definition must contain the container definition for a pre-built container serving the inference model for TensorFlow. It's provided by Amazon Deep Learning Containers. This container contains the Amazon Neuron runtime and the TensorFlow Serving application. At start up, this container fetches your model from Amazon S3, launches Neuron TensorFlow Serving with the saved model, and waits for prediction requests. In the following example, the container image has TensorFlow 1.15 and Ubuntu 18.04. A complete list of pre-built Deep Learning Containers optimized for Neuron is maintained on GitHub. For more information, see Neuron Inference Containers.

763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-neuron:1.15.4-neuron-py37-ubuntu18.04

Alternatively, you can build your own Neuron sidecar container image. For more information, see Tutorial: Neuron TensorFlow Serving on GitHub.

The following is an example Linux containers on Fargate task definition, displaying the syntax to use.

{ "family": "ecs-neuron", "executionRoleArn": "${YOUR_EXECUTION_ROLE}", "containerDefinitions": [ { "entryPoint": [ "/usr/local/bin/entrypoint.sh", "--port=8500", "--rest_api_port=9000", "--model_name=resnet50_neuron", "--model_base_path=s3://your-bucket-of-models/resnet50_neuron/" ], "portMappings": [ { "hostPort": 8500, "protocol": "tcp", "containerPort": 8500 }, { "hostPort": 8501, "protocol": "tcp", "containerPort": 8501 }, { "hostPort": 0, "protocol": "tcp", "containerPort": 80 } ], "linuxParameters": { "devices": [ { "containerPath": "/dev/neuron0", "hostPath": "/dev/neuron0", "permissions": [ "read", "write" ] } ], "capabilities": { "add": [ "IPC_LOCK" ] } }, "cpu": 0, "memoryReservation": 1000, "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-neuron:1.15.4-neuron-py37-ubuntu18.04", "essential": true, "name": "resnet50" } ] }