Interpret recommendation results - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Interpret recommendation results

Each Inference Recommender job result includes InstanceType, InitialInstanceCount, and EnvironmentParameters, which are tuned environment variable parameters for your container to improve its latency and throughput. The results also include performance and cost metrics such as MaxInvocations, ModelLatency, CostPerHour, CostPerInference, CpuUtilization, and MemoryUtilization.

In the table below we provide a description of these metrics. These metrics can help you narrow down your search for the best endpoint configuration that suits your use case. For example, if your motivation is overall price performance with an emphasis on throughput, then you should focus on CostPerInference.

Metric Description Use case

ModelLatency

The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

Units: Milliseconds

Latency sensitive workloads such as ad serving and medical diagnosis

MaximumInvocations

The maximum number of InvokeEndpoint requests sent to a model endpoint in a minute.

Units: None

Throughput-focused workloads such as video processing or batch inference

CostPerHour

The estimated cost per hour for your real-time endpoint.

Units: US Dollars

Cost sensitive workloads with no latency deadlines

CostPerInference

The estimated cost per inference call for your real-time endpoint.

Units: US Dollars

Maximize overall price performance with a focus on throughput

CpuUtilization

The expected CPU utilization at maximum invocations per minute for the endpoint instance.

Units: Percent

Understand instance health during benchmarking by having visibility into core CPU utilization of the instance

MemoryUtilization

The expected memory utilization at maximum invocations per minute for the endpoint instance.

Units: Percent

Understand instance health during benchmarking by having visibility into core memory utilization of the instance

In some cases you might want to explore other SageMaker Endpoint Invocation metrics such as CPUUtilization. Every Inference Recommender job result includes the names of endpoints spun up during the load test. You can use CloudWatch to review the logs for these endpoints even after they’ve been deleted.

The following image is an example of CloudWatch metrics and charts you can review for a single endpoint from your recommendation result. This recommendation result is from a Default job. The way to interpret the scalar values from the recommendation results is that they are based on the time point when the Invocations graph first begins to level out. For example, the ModelLatency value reported is at the beginning of the plateau around 03:00:31.

Charts for the following CloudWatch metrics: Invocations, ModelLatency, OverheadLatency, CPUUtilization, MemoryUtilization, DiskUtilization, Invocation4XXErrors, Invocation5XXErrors, and InvocationsPerInstance.

For full descriptions of the CloudWatch metrics used in the preceding charts, see SageMaker Endpoint Invocation metrics.

You can also see performance metrics like ClientInvocations and NumberOfUsers published by Inference Recommender in the /aws/sagemaker/InferenceRecommendationsJobs namespace. For a full list of metrics and descriptions published by Inference Recommender, see SageMaker Inference Recommender jobs metrics.

See the Amazon SageMaker Inference Recommender - CloudWatch Metrics Jupyter notebook in the amazon-sagemaker-examples Github repository for an example of how to use the Amazon SDK for Python (Boto3) to explore CloudWatch metrics for your endpoints.