Use a multi-container endpoint with direct invocation
SageMaker multi-container endpoints enable customers to deploy multiple containers to deploy different models on a SageMaker endpoint. You can host up to 15 different inference containers on a single endpoint. By using direct invocation, you can send a request to a specific inference container hosted on a multi-container endpoint.
Topics
Invoke a multi-container endpoint with direct invocation
To invoke a multi-container endpoint with direct invocation, call invoke_endpointTargetContainerHostname
parameter.
The following example directly invokes the secondContainer
of a
multi-container endpoint to get a prediction.
import boto3 runtime_sm_client = boto3.Session().client('sagemaker-runtime') response = runtime_sm_client.invoke_endpoint( EndpointName ='my-endpoint', ContentType = 'text/csv', TargetContainerHostname='secondContainer', Body = body)
For each direct invocation request to a multi-container endpoint, only the container
with the TargetContainerHostname
processes the invocation request. You will
get validation errors if you do any of the following:
-
Specify a
TargetContainerHostname
that does not exist in the endpoint -
Do not specify a value for
TargetContainerHostname
in a request to an endpoint configured for direct invocation -
Specify a value for
TargetContainerHostname
in a request to an endpoint that is not configured for direct invocation.
Security with multi-container endpoints with direct invocation
For multi-container endpoints with direct invocation, there are multiple containers hosted in a single instance by sharing memory and a storage volume. It's your responsibility to use secure containers, maintain the correct mapping of requests to target containers, and provide users with the correct access to target containers. SageMaker uses IAM roles to provide IAM identity-based policies that you use to specify whether access to a resource is allowed or denied to that role, and under what conditions. For information about IAM roles, see IAM roles in the Amazon Identity and Access Management User Guide. For information about identity-based policies, see Identity-based policies and resource-based policies.
By default, an IAM principal with InvokeEndpoint
permissions on a
multi-container endpoint with direct invocation can invoke any container inside the
endpoint with the endpoint name that you specify when you call
invoke_endpoint
. If you need to restrict invoke_endpoint
access to a limited set of containers inside a multi-container endpoint, use the
sagemaker:TargetContainerHostname
IAM condition key. The following
policies show how to limit calls to specific containers within an endpoint.
The following policy allows invoke_endpoint
requests only when the value of
the TargetContainerHostname
field matches one of the specified regular
expressions.
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "sagemaker:InvokeEndpoint" ], "Effect": "Allow", "Resource": "arn:aws:sagemaker:
region
:account-id
:endpoint/endpoint_name
", "Condition": { "StringLike": { "sagemaker:TargetContainerHostname": ["customIps*", "common*"] } } } ] }
The following policy denies invoke_endpoint
requests when the value of the
TargetContainerHostname
field matches one of the specified regular
expressions in the Deny
statement.
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "sagemaker:InvokeEndpoint" ], "Effect": "Allow", "Resource": "arn:aws:sagemaker:
region
:account-id
:endpoint/endpoint_name
", "Condition": { "StringLike": { "sagemaker:TargetContainerHostname": ["*"] } } }, { "Action": [ "sagemaker:InvokeEndpoint" ], "Effect": "Deny", "Resource": "arn:aws:sagemaker:region
:account-id
:endpoint/endpoint_name
", "Condition": { "StringLike": { "sagemaker:TargetContainerHostname": ["special*"] } } } ] }
For information about SageMaker condition keys, see Condition Keys for SageMaker in the Amazon Identity and Access Management User Guide.
Metrics for multi-container endpoints with direct invocation
In addition to the endpoint metrics that are listed in Monitor Amazon SageMaker with Amazon CloudWatch, SageMaker also provides per-container metrics.
Per-container metrics for multi-container endpoints with direct invocation are located
in CloudWatch and categorized into two namespaces: AWS/SageMaker
and
aws/sagemaker/Endpoints
. The AWS/SageMaker
namespace
includes invocation-related metrics, and the aws/sagemaker/Endpoints
namespace includes memory and CPU utilization metrics.
The following table lists the per-container metrics for multi-container endpoints with
direct invocation. All the metrics use the [EndpointName, VariantName,
ContainerName
] dimension, which filters metrics at a specific endpoint, for a
specific variant and corresponding to a specific container. These metrics share the same
metric names as in those for inference pipelines, but at a per-container level
[EndpointName, VariantName, ContainerName
].
Metric Name | Description | Dimension | NameSpace |
Invocations
|
The number of InvokeEndpoint requests sent to a
container inside an endpoint. To get the total number of requests sent
to that container, use the Sum statistic. Units: None Valid
statistics: Sum , Sample Count |
EndpointName , VariantName ,
ContainerName
|
AWS/SageMaker |
Invocation4XX Errors
|
The number of InvokeEndpoint requests that the model
returned a 4xx HTTP response code for on a specific
container. For each 4xx response, SageMaker sends a
1 . Units: None Valid statistics: Average ,
Sum
|
EndpointName , VariantName ,
ContainerName
|
AWS/SageMaker |
Invocation5XX Errors
|
The number of InvokeEndpoint requests that the model
returned a 5xx HTTP response code for on a specific
container. For each 5xx response, SageMaker sends a
1 . Units: None Valid statistics: Average ,
Sum
|
EndpointName , VariantName ,
ContainerName
|
AWS/SageMaker |
ContainerLatency
|
The time it took for the target container to respond as viewed from
SageMaker. ContainerLatency includes the time it took to send
the request, to fetch the response from the model's container, and to
complete inference in the container. Units: Microseconds Valid
statistics: Average , Sum , Min ,
Max , Sample Count |
EndpointName , VariantName ,
ContainerName
|
AWS/SageMaker |
OverheadLatency
|
The time added to the time taken to respond to a client request by
SageMaker for overhead. OverheadLatency is measured from the
time that SageMaker receives the request until it returns a response to the
client, minus theModelLatency . Overhead latency can vary
depending on request and response payload sizes, request frequency, and
authentication or authorization of the request, among other factors.
Units: Microseconds Valid statistics: Average ,
Sum , Min , Max , `Sample Count
` |
EndpointName , VariantName ,
ContainerName
|
AWS/SageMaker |
CPUUtilization
|
The percentage of CPU units that are used by each container running
on an instance. The value ranges from 0% to 100%, and is multiplied by
the number of CPUs. For example, if there are four CPUs,
CPUUtilization can range from 0% to 400%. For endpoints
with direct invocation, the number of CPUUtilization metrics equals the
number of containers in that endpoint. Units: Percent |
EndpointName , VariantName ,
ContainerName
|
aws/sagemaker/Endpoints |
MemoryUtilizaton
|
The percentage of memory that is used by each container running on an instance. This value ranges from 0% to 100%. Similar as CPUUtilization, in endpoints with direct invocation, the number of MemoryUtilization metrics equals the number of containers in that endpoint. Units: Percent |
EndpointName , VariantName ,
ContainerName
|
aws/sagemaker/Endpoints |
All the metrics in the previous table are specific to multi-container endpoints with
direct invocation. Besides these special per-container metrics, there are also metrics
at the variant level with dimension [EndpointName, VariantName]
for all the
metrics in the table expect ContainerLatency
.
Autoscale multi-container endpoints
If you want to configure automatic scaling for a multi-container endpoint using the
InvocationsPerInstance
metric, we recommend that the model in each
container exhibits similar CPU utilization and latency on each inference request.
This is recommended because if traffic to the multi-container endpoint shifts from a
low CPU utilization model to a high CPU utilization model, but the overall call
volume remains the same, the endpoint does not scale out and there may not be enough
instances to handle all the requests to the high CPU utilization model. For
information about automatically scaling endpoints, see Automatically Scale Amazon SageMaker Models.
Troubleshoot multi-container endpoints
The following sections can help you troubleshoot errors with multi-container endpoints.
Ping Health Check Errors
With multiple containers, endpoint memory and CPU are under higher pressure
during endpoint creation. Specifically, the MemoryUtilization
and
CPUUtilization
metrics are higher than for single-container
endpoints, because utilization pressure is proportional to the number of containers.
Because of this, we recommend that you choose instance types with enough memory and
CPU to ensure that there is enough memory on the instance to have all the models
loaded (the same guidance applies to deploying an inference pipeline). Otherwise,
your endpoint creation might fail with an error such as XXX did not pass the
ping health check
.
Missing accept-bind-to-port=true Docker label
The containers in a multi-container endpoints listen on the port specified in the
SAGEMAKER_BIND_TO_PORT
environment variable instead of port 8080.
When a container runs in a multi-container endpoint, SageMaker automatically provides
this environment variable to the container. If this environment variable isn't
present, containers default to using port 8080. To indicate that your container
complies with this requirement, use the following command to add a label to your
Dockerfile:
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
Otherwise, You will see an error message such as Your Ecr Image XXX does
not contain required
com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker
label(s).
If your container needs to listen on a second port, choose a port in the range
specified by the SAGEMAKER_SAFE_PORT_RANGE
environment variable.
Specify the value as an inclusive range in the format
XXXX
-YYYY
, where XXXX and
YYYY are multi-digit integers. SageMaker provides this value automatically when you run
the container in a multi-container endpoint.