Custom Containers Contract for Multi-Model Endpoints
To handle multiple models, your container must support a set of APIs that enable
        Amazon SageMaker AI to communicate with the container for loading, listing, getting, and unloading
        models as required. The model_name is used in the new set of APIs as the key
        input parameter. The customer container is expected to keep track of the loaded models using
          model_name as the mapping key. Also, the model_name is an opaque
        identifier and is not necessarily the value of the TargetModel parameter passed
        into the InvokeEndpoint API. The original TargetModel value in the
          InvokeEndpoint request is passed to container in the APIs as a
          X-Amzn-SageMaker-Target-Model header that can be used for logging
        purposes.
Note
Multi-model endpoints for GPU backed instances are currently supported only with SageMaker AI's NVIDIA Triton Inference Server container. This container already implements the contract defined below. Customers can directly use this container with their multi-model GPU endpoints, without any additional work.
You can configure the following APIs on your containers for CPU backed multi-model endpoints.
Load Model API
Instructs the container to load a particular model present in the url
          field of the body into the memory of the customer container and to keep track of it with
          the assigned model_name. After a model is loaded, the container should be
          ready to serve inference requests using this model_name.
POST /models HTTP/1.1 Content-Type: application/json Accept: application/json { "model_name" : "{model_name}", "url" : "/opt/ml/models/{model_name}/model", }
Note
If model_name is already loaded, this API should return 409. Any time a
            model cannot be loaded due to lack of memory or to any other resource, this API should
            return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to
            reclaim.
List Model API
Returns the list of models loaded into the memory of the customer container.
GET /models HTTP/1.1 Accept: application/json Response = { "models": [ { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, .... ] }
This API also supports pagination.
GET /models HTTP/1.1 Accept: application/json Response = { "models": [ { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, .... ] }
SageMaker AI can initially call the List Models API without providing a value for
            next_page_token. If a nextPageToken field is returned as part
          of the response, it will be provided as the value for next_page_token in a
          subsequent List Models call. If a nextPageToken is not returned, it means
          that there are no more models to return.
Get Model API
This is a simple read API on the model_name entity.
GET /models/{model_name} HTTP/1.1 Accept: application/json { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }
Note
If model_name is not loaded, this API should return 404.
Unload Model API
Instructs the SageMaker AI platform to instruct the customer container to unload a model from
          memory. This initiates the eviction of a candidate model as determined by the platform
          when starting the process of loading a new model. The resources provisioned to
            model_name should be reclaimed by the container when this API returns a
          response.
DELETE /models/{model_name}
Note
If model_name is not loaded, this API should return 404.
Invoke Model API
Makes a prediction request from the particular model_name supplied. The
          SageMaker AI Runtime InvokeEndpoint request supports
            X-Amzn-SageMaker-Target-Model as a new header that takes the relative path
          of the model specified for invocation. The SageMaker AI system constructs the absolute path of the
          model by combining the prefix that is provided as part of the CreateModel API
          call with the relative path of the model.
POST /models/{model_name}/invoke HTTP/1.1 Content-Type: ContentType Accept: Accept X-Amzn-SageMaker-Custom-Attributes: CustomAttributes X-Amzn-SageMaker-Target-Model: [relativePath]/{artifactName}.tar.gz
Note
If model_name is not loaded, this API should return 404.
Additionally, on GPU instances, if InvokeEndpoint fails due to a lack of
          memory or other resources, this API should return a 507 HTTP status code to SageMaker AI, which
          then initiates unloading unused models to reclaim.