Optimize model inference with Amazon SageMaker - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Optimize model inference with Amazon SageMaker

With Amazon SageMaker, you can improve the performance of your generative AI models by applying inference optimization techniques. By optimizing your models, you can attain better cost-performance for your use case. When you optimize a model, you choose which of the supported optimization techniques to apply, including quantization, speculative decoding, and compilation. After your model is optimized, you can run an evaluation to see performance metrics for latency, throughput, and price.

For many models, SageMaker also provides several pre-optimized versions, where each caters to different applications needs for latency and throughput. For such models, you can deploy one of the optimized versions without first optimizing the model yourself.

Optimization techniques

Amazon SageMaker supports the following optimization techniques.

Speculative decoding

Speculative decoding is a technique to speed up the decoding process of large LLMs. It optimizes models for latency without compromising the quality of the generated text.

This technique uses a smaller but faster model called the draft model. The draft model generates candidate tokens, which are then validated by the larger but slower target model. At each iteration, the draft model generates multiple candidate tokens. The target model verifies the tokens, and if it finds that a particular token is not acceptable, it rejects the token and regenerates it. So, the target model both verifies tokens and generates a small amount of them.

The draft model is significantly faster than the target model. It generates all the tokens quickly and then sends batches of them to the target model for verification. The target model evaluate them all in parallel, which speeds up the final response.

SageMaker offers a pre-built draft model that you can use, so you don't have to build your own. If you prefer to use your own custom draft model, SageMaker also supports this option.

Quantization

Quantization is a technique to reduce the hardware requirements of a model by using a less precise data type for the weights and activations. After you optimize a model with quantization, you can host it on less expensive and more available GPUs. However, the quantized model might be less accurate than the source model that you optimized.

SageMaker supports Activation-aware Weight Quantization (AWQ) for GPUs. AWQ is a quantization technique for LLMs that is efficient, accurate, low-bit, and weight-only.

Compilation

Compilation optimizes the model for the best available performance on the chosen hardware type without a loss in accuracy. You can apply model compilation to optimize LLMs for accelerated hardware, such as Amazon Trainium, or Amazon Inferentia.

When you optimize a model with compilation, you benefit from ahead-of-time compilation. You reduce the model's deployment time and auto-scaling latency because the model weights don't require just-in-time compilation when the model deploys to a new instance.

Deploy a pre-optimized model

Some models in JumpStart are pre-optimized by SageMaker, which means that you can deploy optimized versions of these models without first creating an inference optimization job. For the list of models with pre-optimized options, see Supported models reference.

To deploy a pre-optimized model
  1. In SageMaker Studio, in the navigation menu on the left, choose JumpStart.

  2. On the All public models page, choose one of the models that are pre-optimized.

  3. On the model details page, choose Deploy.

  4. On the deployment page, some JumpStart models require you to sign an end-user license agreement (EULA) before you can proceed. If requested, review the license terms in the License agreement section. If the terms are acceptable for your use case, select the checkbox for I accept the EULA, and read the terms and conditions.

    For more information, see End-user license agreements.

  5. For Endpoint name and Initial instance count, accept the default values or set custom ones.

  6. For Instance type, keep the default value. Otherwise, you can't deploy a pre-optimized configuration.

  7. Under Models, expand the model configuration. Studio shows a table that provides the pre-optimized configurations that you can choose from. Each option has metrics for latency and throughput. Choose the option that best suits your application needs.

  8. Choose Deploy.

The code examples that follow demonstrate how to deploy a pre-optimized model with the Amazon SageMaker Python SDK.

Define a model to SageMaker by using the ModelBuilder class:

# sample payload response = "Hello, I'm a language model, and I'm here to help you with your English." sample_input = { "inputs": "Hello, I'm a language model,", "parameters": {"max_new_tokens":128, "do_sample":True} } sample_output = [ { "generated_text": response } ] # specify the Model ID for JumpStart model_builder = ModelBuilder( model="meta-textgeneration-llama-3-8b", schema_builder=SchemaBuilder(sample_input, sample_output), sagemaker_session=sagemaker_session, role_arn=my_role, )

List pre-benchmarked configurations for the model:

model_builder.display_benchmark_metrics() # displays pre-benchmarking results

Set a deployment configuration by using the preferred instance_type and config_name values that were returned by the display_benchmark_metrics() call:

model_builder.set_deployment_config() # set pre-optimized config bulder.set_deployment_config( instance_type="ml.g5.12xlarge", config_name="lmi-optimized" )

Call .build() to build the model, and call .deploy to deploy to an endpoint. Then, test the model predictions:

# build the deployable model model = model_builder.build() # deploy the model to a SageMaker endpoint predictor = model.deploy(accept_eula=True) # use sample input payload to test the deployed endpoint predictor.predict(sample_input)

Create an inference optimization job

You can create an inference optimization job by using Studio or the SageMaker Python SDK.

Instance pricing for inference optimization jobs

When you create a inference optimization job that applies quantization or compilation, SageMaker chooses which instance type to use to run the job. You are charged based on the instance used.

For the possible instance types and their pricing details, see the inference optimization pricing information on the Amazon SageMaker pricing page.

You incur no additional costs for jobs that apply speculative decoding.

Complete the following steps to create an inference optimization job in Studio.

To begin creating an optimization job
  1. In SageMaker Studio, create an optimization job through any of the following paths:

    • To create a job for a JumpStart model, do the following:

      1. In the navigation menu, choose JumpStart.

      2. On the All public models page, choose a model provider, and then choose one of the models that supports optimization.

      3. On the model details page, choose Optimize. This button is enabled only for models that support optimization.

      4. On the Create inference optimization job page, some JumpStart models require you to sign an end-user license agreement (EULA) before you can proceed. If requested, review the license terms in the License agreement section. If the terms are acceptable for your use case, select the checkbox for I accept the EULA, and read the terms and conditions.

    • To create a job for a fine-tuned JumpStart model, do the following:

      1. In the navigation menu, under Jobs, choose Training.

      2. On the Training Jobs page, choose the name of a job that you used to fine tune a JumpStart model. These jobs have the type JumpStart training in the Job type column.

      3. On the details page for the training job, choose Optimize.

    • To create a job for a custom model, do the following:

      1. In the navigation menu, under Jobs, choose Inference optimization.

      2. Choose Create new job.

      3. On the Create inference optimization job page, choose Add model.

      4. In the Add model window, choose Custom Model.

      5. For Custom model name, enter a name.

      6. For S3 URI, enter the URI for the location in Amazon S3 where you've stored your model artifacts.

  2. On the Create inference optimization job page, for Job name, you can accept the default name that SageMaker assigns. Or, to enter a custom job name, choose the Job name field, and choose Enter job name.

To set the optimization configurations
  1. For Deployment instance type, choose the instance type that you want to optimize the model for.

    The instance type affects what optimization techniques you can choose. For most types that use GPU hardware, the supported techniques are Quantization and Speculative decoding. If you choose an instance that uses custom silicon, like the Amazon Inferentia instance ml.inf2.8xlarge, the supported technique is Compilation, which you can use to compile the model for that specific hardware type.

  2. Select one or more of the optimization techniques that Studio provides:

    • If you select Quantization, choose a data type for Precision data type.

    • If you select Speculative decoding, choose SageMaker draft model if you want to use the draft model that SageMaker provides. Or, if you want to use your own draft model, choose Use your own draft model, and provide the S3 URI that locates it.

    • If you chose an instance that uses custom silicon, Studio might show that Compilation is the one supported option. In that case, Studio selects this option for you.

  3. For Output, enter the URI of a location in Amazon S3. There, SageMaker stores the artifacts of the optimized model that your job creates.

  4. (Optional) Expand Advanced options for more fine-grained control over settings such as the IAM role, VPC, and environment variables. For more information, see Advanced options below.

  5. When you're finished configuring the job, choose Create job.

    Studio shows the job details page, which shows the job status and all of its settings.

Advanced options

You can set the following advanced options when you create an inference optimization job.

Under Configurations, you can set the following options:

Tensor parallel degree

A value for the degree of tensor parallelism. Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. The value must evenly divide the number of GPUs in your cluster.

Maximum token length

The limit for the number of tokens to be generated by the model. Note that the model might not always generate the maximum number of tokens.

Concurrency

The ability to run multiple instances of a model on the same underlying hardware. Use concurrency to serve predictions to multiple users and to maximize hardware utilization.

Batch size

If your model does batch inferencing, use this option to control the size of the batches that your model processes.

Batch inferencing generates model predictions on a batch of observations. It's a good option for large datasets or if you don't need an immediate response to an inference request.

Under Security, you can set the following options:

IAM Role

An IAM role that enables SageMaker to perform tasks on your behalf. During model optimization, SageMaker needs your permission to:

  • Read input data from an S3 bucket

  • Write model artifacts to an S3 bucket

  • Write logs to Amazon CloudWatch Logs

  • Publish metrics to Amazon CloudWatch

You grant permissions for all of these tasks to an IAM role.

For more information, see How to use SageMaker execution roles.

Encryption KMS key

A key in Amazon Key Management Service (Amazon KMS). SageMaker uses they key to encrypt the artifacts of the optimized model when SageMaker uploads the model to Amazon S3.

VPC

SageMaker uses this information to create network interfaces and attach them to your model containers. The network interfaces provide your model containers with a network connection within your VPC that is not connected to the internet. They also enable your model to connect to resources in your private VPC.

For more information, see Give SageMaker Hosted Endpoints Access to Resources in Your Amazon VPC.

Enable network isolation

Activate this option if you want to restrict your container's internet access. Containers that run with network isolation can’t make any outbound network calls.

Under Advanced container definition, you can set the following options:

Stopping condition

Specifies a limit to how long a job can run. When the job reaches the time limit, SageMaker ends the job. Use this option to cap costs.

Tags

Key-value pairs associated with the optimization job.

For more information about tags, see Tagging your Amazon resources in the Amazon Web Services General Reference.

Environment variables

Key-value pairs that define the environment variables to set in the model container.

The code examples that follow demonstrate how to optimize model inference with the Amazon SageMaker Python SDK.

Example code to define a SageMaker model with ModelBuilder
# sample payload response = "Hello, I'm a language model, and I'm here to help you with your English." sample_input = { "inputs": "Hello, I'm a language model,", "parameters": {"max_new_tokens":128, "do_sample":True} } sample_output = [ { "generated_text": response } ] # specify the Model ID for JumpStart model_builder = ModelBuilder( model="meta-textgeneration-llama-3-8b", schema_builder=SchemaBuilder(sample_input, sample_output), sagemaker_session=sagemaker_session, role_arn=my_role, )
Example code to optimize with quantization
optimized_model = model_builder.optimize( instance_type="ml.g5.12xlarge", accept_eula=True, quantization_config={ "OverrideEnvironment": { "OPTION_QUANTIZE": "awq" } }, output_path=f"s3://{output_bucket_name}/quantized/" ) # deploy the optimized model to a SageMaker endpoint predictor = optimized_model.deploy(accept_eula=True) # use sample input payload to test the deployed endpoint predictor.predict(sample_input)
Example code to optimize with speculative decoding
optimized_model = model_builder.optimize( instance_type="ml.g5.12xlarge", accept_eula=True, speculative_decoding_config={ # Use SageMaker provided draft model "ModelProvider": "SAGEMAKER", }, ) # deploy the optimized model to a SageMaker endpoint predictor = optimized_model.deploy(accept_eula=True) # use sample input payload to test the deployed endpoint predictor.predict(sample_input)
Example code to optimize with compilation
optimized_model = model_builder.optimize( accept_eula=True, instance_type="ml.inf2.48xlarge", # config options for Inferentia2 instances compilation_config={ "OverrideEnvironment": { "OPTION_TENSOR_PARALLEL_DEGREE": "2", "OPTION_N_POSITIONS": "2048", "OPTION_DTYPE": "fp16", "OPTION_ROLLING_BATCH": "auto", "OPTION_MAX_ROLLING_BATCH_SIZE": "4", "OPTION_NEURON_OPTIMIZE_LEVEL": "2" } }, output_path=f"s3://<Enter your bucket name here>", ) # deploy the compiled model to a SageMaker endpoint predictor = compiled_model.deploy(accept_eula=True) # use sample input payload to test the deployed endpoint predictor.predict(sample_input)

View the optimization job results

After you've created one or more optimization jobs, you can use Studio to view a summary table of all of your jobs, and you can view the details for any individual job.

To view the optimization job summary table
  • In the Studio navigation menu, under Jobs, choose Inference optimization.

    The Inference optimization page shows a table that displays the jobs that you've created. For each job, it shows the optimization configurations that you applied and the job status.

To view the details for a job
  • On the Inference optimization page, in the summary table, choose the name of the job.

    Studio shows the job details page, which shows the job status and all of the settings that you applied when you created the job. If the job completed successfully, SageMaker stored the optimized model artifacts in the Amazon S3 location under Optimized model S3 URI.

Evaluate the performance of optimized models

After you use an optimization job to create an optimized model, you can run an evaluation of model performance. This evaluation yields metrics for latency, throughput, and price. Use these metrics to determine whether the optimized model meets the needs of your use case or whether it requires further optimization.

You can run performance evaluations only by using Studio. This feature is not provided through the Amazon SageMaker API or Python SDK.

Before you begin

Before you can create a performance evaluation, you must first optimize a model by creating an inference optimization job. In Studio, you can evaluate only the models that you create with these jobs.

Create the performance evaluation

Complete the following steps in Studio to create a performance evaluation for an optimized model.

  1. In the Studio navigation menu, under Jobs, choose Inference optimization.

  2. Choose the name of the job that created the optimized model that you want to evaluate.

  3. On the job details page, choose Evaluate performance.

  4. On the Evaluate performance page, some JumpStart models require you to sign an end-user license agreement (EULA) before you can proceed. If requested, review the license terms in the License agreement section. If the terms are acceptable for your use case, select the checkbox for I accept the EULA, and read the terms and conditions.

  5. For Select a model for tokenizer, accept the default, or a choose a specific model to act as the tokenizer for your evaluation.

  6. For Input datasets, choose whether to:

    • Use the default sample datasets from SageMaker.

    • Provide an S3 URI that points to your own sample datasets.

  7. For S3 URI for performance results, provide a URI that points to the location in Amazon S3 where you want to store the evaluation results.

  8. Choose Evaluate.

    Studio shows the Performance evaluations page, where your evaluation job is shown in the table. The Status column shows the status of your evaluation.

  9. When the status is Completed, choose the name of the job to see the evaluation results.

    The evaluation details page shows tables that provide the performance metrics for latency, throughput, and price.

Metrics reference for inference performance evaluations

After you successfully evaluate the performance of an optimized model, the evaluation details page in Studio shows the following metrics.

Latency metrics

The Latency section shows the following metrics

Concurrency

The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

Time to first token (ms)

The time that elapsed between when request is sent and when the first token of a streaming response is received.

Inter-token latency (ms)

The time to generate an output token for each request.

Client latency (ms)

The request latency from the time the request is sent to the time the entire response is received.

Input tokens/sec (count)

The total number of generated input tokens, across all requests, divided by the total duration in seconds for the concurrency.

Output tokens/sec (count)

The total number of generated output tokens, across all requests, divided by total duration in seconds for the concurrency.

Client invocations (count)

The total number of inference requests sent to the endpoint across all users at a concurrency.

Client invocation errors (count)

The total number of inference requests sent to the endpoint across all users at a given concurrency that resulted in an invocation error.

Tokenizer failed (count)

The total number of inference requests where the tokenizer failed to parse the request or the response.

Empty inference response (count)

The total number of inference requests that resulted in zero output tokens or the tokenizer failing to parse the response.

Throughput metrics

The Throughput section shows the following metrics.

Concurrency

The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

Input tokens/sec/req (count)

The total number of generated input tokens per second per request.

Output tokens/sec/req (count)

The total number of generated output tokens per second per request.

Input tokens (count)

The total number of generated input tokens per request.

Output tokens (count)

The total number of generated output tokens per request.

Price metrics

The Price section shows the following metrics.

Concurrency

The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

Price per million input tokens

Cost of processing 1M input tokens.

Price per million output tokens

Cost of generating 1M output tokens.

Supported models reference

The following table shows the models for which SageMaker support inference optimization, and it shows the supported optimization techniques.

Models that support inference optimization
Model name JumpStart Model ID Supports Quantization Supports Speculative Decoding Speculative Decoding with SageMaker Draft Model
Falcon huggingface-llm-falcon-40b-bf16 Yes Yes No
huggingface-llm-falcon-40b-instruct-bf16 Yes Yes No
huggingface-llm-falcon-180b-chat-bf16 No Yes No
huggingface-llm-falcon-180b-bf16 No Yes No
huggingface-llm-amazon-falconlite Yes Yes No
huggingface-llm-amazon-falconlite2 Yes Yes No
huggingface-llm-tiiuae-falcon-rw-1b Yes Yes No
huggingface-llm-falcon-7b-bf16 Yes Yes No
huggingface-llm-falcon-7b-instruct-bf16 Yes Yes No
huggingface-llm-falcon2-11b Yes Yes No
gpt-neox huggingface-textgeneration2-gpt-neoxt-chat-base-20b-fp16 Yes Yes No
huggingface-textgeneration2-gpt-neox-20b-fp16 Yes Yes No
LLaMA meta-textgeneration-llama-3-70b-instruct Yes Yes Yes
meta-textgeneration-llama-3-70b Yes Yes Yes
meta-textgeneration-llama-3-8b Yes Yes Yes
meta-textgeneration-llama-3-8b-instruct Yes Yes Yes
meta-textgeneration-llama-2-7b Yes Yes Yes
meta-textgeneration-llama-2-7b-f Yes Yes Yes
meta-textgeneration-llama-2-13b Yes Yes Yes
meta-textgeneration-llama-2-13b-f Yes Yes Yes
meta-textgeneration-llama-2-70b Yes Yes Yes
meta-textgeneration-llama-2-70b-f Yes Yes Yes
meta-textgeneration-llama-codellama-7b Yes Yes Yes
meta-textgeneration-llama-codellama-7b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-7b-python Yes Yes Yes
meta-textgeneration-llama-codellama-13b Yes Yes Yes
meta-textgeneration-llama-codellama-13b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-13b-python Yes Yes Yes
meta-textgeneration-llama-codellama-34b Yes Yes Yes
meta-textgeneration-llama-codellama-34b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-34b-python Yes Yes Yes
meta-textgeneration-llama-codellama-70b Yes Yes Yes
meta-textgeneration-llama-codellama-70b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-70b-python Yes Yes Yes
meta-textgeneration-llama-guard-7b Yes Yes Yes
Bloom huggingface-textgeneration-bloom-1b7 Yes Yes No
huggingface-textgeneration-bloom-1b1 Yes Yes No
huggingface-textgeneration-bloom-560m Yes Yes No
huggingface-textgeneration-bloomz-560m Yes Yes No
huggingface-textgeneration-bloomz-1b1 Yes Yes No
huggingface-textgeneration-bloomz-1b7 Yes Yes No
huggingface-textgeneration1-bloomz-7b1-fp16 Yes Yes No
huggingface-textgeneration1-bloom-7b1 Yes Yes No
huggingface-textgeneration1-bloomz-3b-fp16 Yes Yes No
huggingface-textgeneration1-bloom-3b Yes Yes No
huggingface-textembedding-bloom-7b1 Yes Yes No
huggingface-textembedding-bloom-7b1-fp16 Yes Yes No
Cohere huggingface-llm-cohereforai-c4ai-command-r-plus Yes
Gemma huggingface-llm-gemma-7b Yes Yes No
huggingface-llm-gemma-7b-instruct Yes Yes No
huggingface-llm-gemma-2b Yes Yes No
huggingface-llm-gemma-2b-instruct Yes Yes No
huggingface-llm-zephyr-7b-gemma Yes Yes No
gpt2 huggingface-textgeneration-gpt2 Yes No No
huggingface-textgeneration-distilgpt2 Yes No No
Mistral huggingface-llm-mistral-7b Yes Yes Yes
huggingface-llm-mistral-7b-instruct Yes Yes Yes
huggingface-llm-mistral-7b-openorca-gptq Yes Yes Yes
huggingface-llm-amazon-mistrallite Yes Yes Yes
huggingface-llm-thebloke-mistral-7b-openorca-awq Yes Yes Yes
huggingface-llm-huggingfaceh4-mistral-7b-sft-beta Yes Yes Yes
huggingface-llm-huggingfaceh4-mistral-7b-sft-alpha Yes Yes Yes
huggingface-llm-teknium-openhermes-2-mistral-7b Yes Yes Yes
huggingface-llm-nousresearch-yarn-mistral-7b-128k Yes Yes Yes
huggingface-llm-dolphin-2-2-1-mistral-7b Yes Yes Yes
huggingface-llm-cultrix-mistraltrix-v1 Yes Yes Yes
Mixtral huggingface-llm-mixtral-8x7b-instruct Yes Yes Yes
huggingface-llm-mixtral-8x7b-instruct-gptq Yes Yes Yes
huggingface-llm-mixtral-8x7b Yes Yes Yes
huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1 Yes Yes Yes
huggingface-llm-dolphin-2-5-mixtral-8x7b Yes Yes Yes
huggingface-llm-dolphin-2-7-mixtral-8x7b Yes Yes Yes
Phi huggingface-llm-phi-2 Yes

Pre-optimized JumpStart models

The following are the JumpStart models that have pre-optimized configurations.

Meta
  • Llama 3 8B Instruct

  • Llama 3 8B

  • Llama 3 70B Instruct

  • Llama 3 70B

  • Llama 2 70B Chat

  • Llama 2 7B Chat

  • Llama 2 13B Chat

HuggingFace
  • Mixtral 8x7B Instruct

  • Mixtral 8x7B

  • Mistral 7B Instruct

  • Mistral 7B

Pre-compiled JumpStart models

For some models and configurations, SageMaker provides models that are pre-compiled for specific Amazon Inferentia and Amazon Trainium instances. For these, if you create a compilation or optimization job, and you choose ml.inf2.48xlarge or ml.trn1.32xlarge as the deployment instance type, SageMaker fetches the compiled artifacts. Because the job uses a model that’s already compiled, it completes quickly without running the compilation from scratch.

The following are the JumpStart models for which SageMaker has pre-compiled models:

Meta
  • Llama3 8B

  • Llama3 70B

  • Llama2 7B

  • Llama2 70B

  • Llama2 13B

  • Code Llama 7B

  • Code Llama 70B

HuggingFace
  • Mistral 7B