Resources for using Triton Inference Server with Amazon SageMaker AI

SageMaker AI enables customers to deploy a model using custom code with NVIDIA Triton Inference Server. Use the following resources to learn how to use Triton Inference Server with SageMaker AI.

This functionality is available through the development of Triton Inference Server Containers. These containers include NVIDIA Triton Inference Server, support for common ML frameworks, and useful environment variables that let you optimize performance on SageMaker AI. For a list of all available Deep Learning Containers images, see Available Deep Learning Containers Images. Deep Learning Containers images are maintained and regularly updated with security patches.

You can use the Triton Inference Server Container with SageMaker Python SDK as you would any other container in your SageMaker AI models. However, using the SageMaker Python SDK is optional. You can use Triton Inference Server Containers with the Amazon CLI and Amazon SDK for Python (Boto3).

For more information on NVIDIA Triton Inference Server see the Triton documentation.

Inference

Note

The Triton Python backend uses shared memory (SHMEM) to connect your code to Triton. SageMaker AI Inference provides up to half of the instance memory as SHMEM so you can use an instance with more memory for larger SHMEM size.

For inference, you can use your trained ML models with Triton Inference Server to deploy an inference job with SageMaker AI.

Some of the key features of Triton Inference Server Container are:

Support for multiple frameworks: Triton can be used to deploy models from all major ML frameworks. Triton supports TensorFlow GraphDef and SavedModel, ONNX, PyTorch TorchScript, TensorRT, and custom Python/C++ model formats.
Model pipelines: Triton model ensemble represents a pipeline of one model with pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble triggers the execution of the entire pipeline.
Concurrent model execution: Multiple instances of the same model can run simultaneously on the same GPU or on multiple GPUs.
Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
Diverse CPU and GPU support: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

What do you want to do?

I want to deploy my trained PyTorch model in SageMaker AI.: For a sample Jupyter Notebook, see the Deploy your PyTorch Resnet50 model with Triton Inference Server example.
I want to deploy my trained Hugging Face model in SageMaker AI.: For a sample Jupyter Notebook, see the Deploy your PyTorch BERT model with Triton Inference Server example.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

TensorFlow

API Reference