Deploy Amazon PrivateLink Deploy SageMaker AI endpoint in a VPC Invoke the SageMaker AI endpoint

Low latency real-time inference with Amazon PrivateLink

Amazon SageMaker AI provides low latency for real-time inferences while maintaining high availability and resiliency using multi-AZ deployment. The application latency is made up of two primary components: infrastructure or overhead latency and model inference latency. Reduction of overhead latency opens up new possibilities such as deploying more complex, deep, and accurate models or splitting monolithic applications into scalable and maintainable microservice modules. You can reduce the latency for real-time inferences with SageMaker AI using an Amazon PrivateLink deployment. With Amazon PrivateLink, you can privately access all SageMaker API operations from your Virtual Private Cloud (VPC) in a scalable manner by using interface VPC endpoints. An interface VPC endpoint is an elastic network interface in your subnet with private IP addresses that serves as an entry point for all SageMaker API calls.

By default, a SageMaker AI endpoint with 2 or more instances is deployed in at least 2 Amazon Availability Zones (AZs) and instances in any AZ can process invocations. This results in one or more AZ “hops” that contribute to the overhead latency. An Amazon PrivateLink deployment with the privateDNSEnabled option set as true alleviates this by achieving two objectives:

It keeps all inference traffic within your VPC.
It keeps invocation traffic in the same AZ as the client that originated it when using SageMaker Runtime. This avoids the “hops” between AZs reducing the overhead latency.

The following sections of this guide demonstrate how you can reduce the latency for real-time inferences with Amazon PrivateLink deployment.

Deploy Amazon PrivateLink

To deploy Amazon PrivateLink, first create an interface endpoint for the VPC from which you connect to the SageMaker AI endpoints. Please follow the steps in Access an Amazon service using an interface VPC endpoint to create the interface endpoint. While creating the endpoint, select the following settings in the console interface:

Select the Enable DNS name checkbox under Additional Settings
Select the appropriate security groups and the subnets to be used with the SageMaker AI endpoints.

Also make sure that the VPC has DNS hostnames turned on. For more information on how to change DNS attributes for your VPC, see View and update DNS attributes for your VPC.

Deploy SageMaker AI endpoint in a VPC

To achieve low overhead latency, create a SageMaker AI endpoint using the same subnets that you specified when deploying Amazon PrivateLink. These subnets should match the AZs of your client application, as shown in the following code snippet.


model_name = '<the-name-of-your-model>'

vpc = 'vpc-0123456789abcdef0'
subnet_a = 'subnet-0123456789abcdef0'
subnet_b = 'subnet-0123456789abcdef1'
security_group = 'sg-0123456789abcdef0'

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url
    },
    VpcConfig = {
        'SecurityGroupIds': [security_group],
        'Subnets': [subnet_a, subnet_b],
    },
)

The aforementioned code snippet assumes that you have followed the steps in Before you begin.

Invoke the SageMaker AI endpoint

Finally, specify the SageMaker Runtime client and invoke the SageMaker AI endpoint as shown in the following code snippet.


endpoint_name = '<endpoint-name>'
  
runtime_client = boto3.client('sagemaker-runtime')
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                          ContentType='text/csv', 
                                          Body=payload)

For more information on endpoint configuration, see Deploy models for real-time inference.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Monitor Security Best Practices

Migrate inference workload from x86 to Amazon Graviton