Manage Neuron devices on Amazon EKS - Amazon EKS
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Help improve this page

To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.

Manage Neuron devices on Amazon EKS

Amazon Trainium and Amazon Inferentia are purpose-built machine learning chips designed by Amazon. Amazon EKS supports two mechanisms for managing Neuron devices in EKS clusters: the Neuron DRA driver and the Neuron Kubernetes device plugin.

It’s recommended to use the Neuron DRA driver for new deployments on EKS clusters running Kubernetes version 1.34 or later. The Neuron DRA driver provides topology-aware allocation, connected device subset scheduling, Logical NeuronCore (LNC) configuration, and UltraServer multi-node allocation without requiring custom scheduler extensions. The Neuron device plugin remains supported.

Neuron DRA driver vs. Neuron device plugin

Feature Neuron DRA driver Neuron device plugin

Minimum Kubernetes version

1.34

All EKS-supported Kubernetes versions

Karpenter and EKS Auto Mode

Not supported

Supported

EKS-optimized AMI support

AL2023

AL2023, Bottlerocket

Device advertisement

Rich attributes via ResourceSlice objects including device ID, instance type, topology, driver version, and EFA locality

Integer count of aws.amazon.com/neuron and aws.amazon.com/neuroncore extended resources

Connected device subsets

Allocate subsets of 1, 4, 8, or 16 connected Neuron devices using topology constraints

Requires the Neuron scheduler extension for contiguous device allocation

LNC configuration

Per-workload Logical NeuronCore configuration (LNC=1 or LNC=2) through ResourceClaimTemplate parameters

Requires pre-configuration in EC2 launch templates

Attribute-based selection

Filter devices by instance type, driver version, and other attributes using CEL expressions

Not supported

Install the Neuron DRA driver

The Neuron DRA driver advertises Neuron devices as ResourceSlice objects with the DeviceClass name neuron.aws.com. The driver runs as a DaemonSet and automatically discovers Neuron devices and their topology attributes.

Detailed information about the Neuron DRA driver is available in the Neuron DRA documentation.

Using the Neuron DRA driver with Bottlerocket is not currently supported.

Prerequisites

  • An Amazon EKS cluster running Kubernetes version 1.34 or later.

  • Nodes with Amazon Trainium or Inferentia2 instance types.

  • Helm installed in your command-line environment, see the Setup Helm instructions for more information.

  • kubectl configured to communicate with your cluster, see Install or update kubectl for more information.

Procedure

Important

Do not install the Neuron DRA driver on nodes where the Neuron device plugin is running. The two mechanisms cannot coexist on the same node. See upstream Kubernetes KEP-5004 for updates.

  1. Install the Neuron DRA driver using Helm.

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \ --namespace neuron-dra-driver \ --create-namespace \ --set "devicePlugin.enabled=false" \ --set "npd.enabled=false" \ --set "draDriver.enabled=true"

    The driver is deployed as a DaemonSet in the neuron-dra-driver namespace by default with the DeviceClass neuron.aws.com.

  2. Verify that the DRA driver DaemonSet is running.

    kubectl get ds -n neuron-dra-driver neuron-dra-driver-kubelet-plugin
    NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE neuron-dra-driver-kubelet-plugin 1 1 1 1 1 <none> 60s
  3. Verify that the DeviceClass was created.

    kubectl get deviceclass neuron.aws.com
    NAME AGE neuron.aws.com 60s
  4. Verify that ResourceSlice objects are advertised for your nodes.

    kubectl get resourceslice

See the Neuron DRA documentation for information on the available ResourceSlice object attributes.

Request Neuron devices in a Pod

To request Neuron devices using the DRA driver, create a ResourceClaimTemplate that references the neuron.aws.com DeviceClass and reference it in your Pod specification.

The following example requests all Neuron devices on a trn2.48xlarge instance:

apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: all-neurons spec: spec: devices: requests: - name: neurons exactly: deviceClassName: neuron.aws.com selectors: - cel: expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'" allocationMode: All --- apiVersion: v1 kind: Pod metadata: name: neuron-workload spec: containers: - name: app ... resources: claims: - name: neurons resourceClaims: - name: neurons resourceClaimTemplateName: all-neurons

Allocate connected device subsets

The Neuron DRA driver can allocate subsets of connected Neuron devices without requiring the Neuron scheduler extension. Supported subset sizes are 1, 4, 8, or 16 devices. Use the matchAttribute constraint with a topology group ID to ensure devices are connected.

The following example requests 4 connected Neuron devices:

apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: 1x4-connected-neurons spec: spec: devices: requests: - name: neurons exactly: deviceClassName: neuron.aws.com allocationMode: ExactCount count: 4 selectors: - cel: expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'" constraints: - requests: ["neurons"] matchAttribute: "resource.aws.com/devicegroup4_id"

The supported matchAttribute values for connected subsets are resource.aws.com/devicegroup1_id, resource.aws.com/devicegroup4_id, resource.aws.com/devicegroup8_id, and resource.aws.com/devicegroup16_id.

Configure Logical NeuronCores (LNC)

The Neuron DRA driver allows per-workload Logical NeuronCore configuration through ResourceClaimTemplate parameters. This eliminates the need to pre-configure LNC in EC2 Launch Templates.

The following example requests all Neuron devices with LNC set to 1:

apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: all-neurons-lnc-1 spec: spec: devices: requests: - name: neurons exactly: deviceClassName: neuron.aws.com selectors: - cel: expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'" allocationMode: All config: - requests: ["neurons"] opaque: driver: neuron.aws.com parameters: apiVersion: neuron.aws.com/v1 kind: NeuronConfig logicalNeuronCore: 1

Install the Neuron Kubernetes device plugin

The Neuron Kubernetes device plugin advertises Neuron devices as aws.amazon.com/neuron and NeuronCores as aws.amazon.com/neuroncore extended resources. You request Neuron devices in container resource requests and limits.

Prerequisites

  • An Amazon EKS cluster.

  • Nodes with host-level components installed for Amazon Trainium or Amazon Inferentia instances. These are included if using the EKS AL2023 accelerated AMIs or the EKS Bottlerocket AMIs.

  • Helm installed in your command-line environment, see the Setup Helm instructions for more information.

  • kubectl configured to communicate with your cluster, see Install or update kubectl for more information.

Procedure

  1. Install the Neuron Kubernetes device plugin using Helm.

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \ --set "npd.enabled=false"
  2. Verify the Neuron device plugin DaemonSet is running.

    kubectl get ds -n kube-system neuron-device-plugin
    NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE neuron-device-plugin 1 1 1 1 1 <none> 60s
  3. Verify that your nodes have allocatable Neuron devices.

    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
    NAME NeuronDevice NeuronCore ip-192-168-47-173.us-west-2.compute.internal 1 2

Verify Neuron devices with a test Pod

You can verify that Neuron devices are accessible by running the neuron-ls tool in a test Pod.

  1. Create a file named neuron-ls.yaml with the following contents. This manifest launches an Neuron Monitor container that has the neuron-ls tool installed.

    apiVersion: v1 kind: Pod metadata: name: neuron-ls spec: restartPolicy: Never containers: - name: neuron-container image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0 command: ["/bin/sh"] args: ["-c", "neuron-ls"] resources: limits: aws.amazon.com/neuron: 1 tolerations: - key: "aws.amazon.com/neuron" operator: "Exists" effect: "NoSchedule"
  2. Apply the manifest.

    kubectl apply -f neuron-ls.yaml
  3. After the Pod has finished running, view its logs.

    kubectl logs neuron-ls

    An example output is as follows.

    instance-type: inf2.xlarge instance-id: ... +--------+--------+--------+---------+ | NEURON | NEURON | NEURON | PCI | | DEVICE | CORES | MEMORY | BDF | +--------+--------+--------+---------+ | 0 | 2 | 32 GB | 00:1f.0 | +--------+--------+--------+---------+
Note

When using the Neuron device plugin, contiguous device allocation on instances with multiple Neuron devices (such as trn2.48xlarge) requires the Neuron Kubernetes scheduler extension. The Neuron DRA driver handles this automatically through topology constraints.

For more information about using Neuron devices with Amazon EKS, see the Neuron documentation for running on EKS.