

# Automatic scaling of Amazon SageMaker AI models
Automatic scaling

Amazon SageMaker AI supports automatic scaling (auto scaling) for your hosted models. *Auto scaling* dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

**Topics**
+ [

# Auto scaling policy overview
](endpoint-auto-scaling-policy.md)
+ [

# Auto scaling prerequisites
](endpoint-auto-scaling-prerequisites.md)
+ [

# Configure model auto scaling with the console
](endpoint-auto-scaling-add-console.md)
+ [

# Register a model
](endpoint-auto-scaling-add-policy.md)
+ [

# Define a scaling policy
](endpoint-auto-scaling-add-code-define.md)
+ [

# Apply a scaling policy
](endpoint-auto-scaling-add-code-apply.md)
+ [

# Instructions for editing a scaling policy
](endpoint-auto-scaling-edit.md)
+ [

# Temporarily turn off scaling policies
](endpoint-auto-scaling-suspend-scaling-activities.md)
+ [

# Delete a scaling policy
](endpoint-auto-scaling-delete.md)
+ [

# Check the status of a scaling activity by describing scaling activities
](endpoint-scaling-query-history.md)
+ [

# Scale an endpoint to zero instances
](endpoint-auto-scaling-zero-instances.md)
+ [

# Load testing your auto scaling configuration
](endpoint-scaling-loadtest.md)
+ [

# Use Amazon CloudFormation to create a scaling policy
](endpoint-scaling-cloudformation.md)
+ [

# Update endpoints that use auto scaling
](endpoint-scaling-update.md)
+ [

# Delete endpoints configured for auto scaling
](endpoint-delete-with-scaling.md)

# Auto scaling policy overview


To use auto scaling, you define a scaling policy that adds and removes the number of instances for your production variant in response to actual workloads.

To automatically scale as workload changes occur, you have two options: target tracking and step scaling policies. 

In most cases, we recommend using target tracking scaling policies. With target tracking, you choose an Amazon CloudWatch metric and target value. Auto scaling creates and manages the CloudWatch alarms for the scaling policy and calculates the scaling adjustment based on the metric and the target value. The policy adds and removes the number of instances as required to keep the metric at, or close to, the specified target value. For example, a scaling policy that uses the predefined `InvocationsPerInstance` metric with a target value of 70 can keep `InvocationsPerInstance` at, or close to 70. For more information, see [Target tracking scaling policies](https://docs.amazonaws.cn/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) in the *Application Auto Scaling User Guide*.

You can use step scaling when you require an advanced configuration, such as specifying how many instances to deploy under what conditions. For example, you must use step scaling if you want to enable an endpoint to scale out from zero active instances. For an overview of step scaling policies and how they work, see [Step scaling policies](https://docs.amazonaws.cn/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

To create a target tracking scaling policy, you specify the following:
+ **Metric** — The CloudWatch metric to track, such as average number of invocations per instance. 
+ **Target value** — The target value for the metric, such as 70 invocations per instance per minute.

You can create target tracking scaling policies with either predefined metrics or custom metrics. A predefined metric is defined in an enumeration so that you can specify it by name in code or use it in the SageMaker AI console. Alternatively, you can use either the Amazon CLI or the Application Auto Scaling API to apply a target tracking scaling policy based on a predefined or custom metric.

Note that scaling activities are performed with cooldown periods between them to prevent rapid fluctuations in capacity. You can optionally configure the cooldown periods for your scaling policy. 

For more information about the key concepts of auto scaling, see the following section.

## Schedule-based scaling


You can also create scheduled actions to perform scaling activities at specific times. You can create scheduled actions that scale one time only or that scale on a recurring schedule. After a scheduled action runs, your scaling policy can continue to make decisions about whether to scale dynamically as workload changes occur. Scheduled scaling can be managed only from the Amazon CLI or the Application Auto Scaling API. For more information, see [Scheduled scaling](https://docs.amazonaws.cn/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

## Minimum and maximum scaling limits


When configuring auto scaling, you must specify your scaling limits before creating a scaling policy. You set limits separately for the minimum and maximum values.

The minimum value must be at least 1, and equal to or less than the value specified for the maximum value.

The maximum value must be equal to or greater than the value specified for the minimum value. SageMaker AI auto scaling does not enforce a limit for this value.

To determine the scaling limits that you need for typical traffic, test your auto scaling configuration with the expected rate of traffic to your model.

If a variant’s traffic becomes zero, SageMaker AI automatically scales in to the minimum number of instances specified. In this case, SageMaker AI emits metrics with a value of zero.

There are three options for specifying the minimum and maximum capacity:

1. Use the console to update the **Minimum instance count** and **Maximum instance count** settings.

1. Use the Amazon CLI and include the `--min-capacity` and `--max-capacity` options when running the [register-scalable-target](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/register-scalable-target.html) command.

1. Call the [RegisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_RegisterScalableTarget.html) API and specify the `MinCapacity` and `MaxCapacity` parameters.

**Tip**  
You can manually scale out by increasing the minimum value, or manually scale in by decreasing the maximum value.

## Cooldown period


A *cooldown period* is used to protect against over-scaling when your model is scaling in (reducing capacity) or scaling out (increasing capacity). It does this by slowing down subsequent scaling activities until the period expires. Specifically, it blocks the deletion of instances for scale-in requests, and limits the creation of instances for scale-out requests. For more information, see [Define cooldown periods](https://docs.amazonaws.cn/autoscaling/application/userguide/target-tracking-scaling-policy-overview.html#target-tracking-cooldown) in the *Application Auto Scaling User Guide*. 

You configure the cooldown period in your scaling policy. 

If you don't specify a scale-in or a scale-out cooldown period, your scaling policy uses the default, which is 300 seconds for each.

If instances are being added or removed too quickly when you test your scaling configuration, consider increasing this value. You might see this behavior if the traffic to your model has a lot of spikes, or if you have multiple scaling policies defined for a variant.

If instances are not being added quickly enough to address increased traffic, consider decreasing this value.

## Related resources


For more information about configuring auto scaling, see the following resources:
+ [application-autoscaling](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling) section of the *Amazon CLI Command Reference*
+ [Application Auto Scaling API Reference](https://docs.amazonaws.cn/autoscaling/application/APIReference/)
+ [Application Auto Scaling User Guide](https://docs.amazonaws.cn/autoscaling/application/userguide/)

**Note**  
SageMaker AI recently introduced new inference capabilities built on real-time inference endpoints. You create a SageMaker AI endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. Then, create an inference component, which is a SageMaker AI hosting object that you can use to deploy a model to an endpoint. For information about scaling inference components, see [SageMaker AI adds new inference capabilities to help reduce foundation model deployment costs and latency](https://amazonaws-china.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/) and [Reduce model deployment costs by 50% on average using the latest features of SageMaker AI](https://amazonaws-china.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/) on the Amazon Blog.

# Auto scaling prerequisites
Prerequisites

Before you can use auto scaling, you must have already created an Amazon SageMaker AI model endpoint. You can have multiple model versions for the same endpoint. Each model is referred to as a [production (model) variant](model-ab-testing.md). For more information about deploying a model endpoint, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

To activate auto scaling for a model, you can use the SageMaker AI console, the Amazon Command Line Interface (Amazon CLI), or an Amazon SDK through the Application Auto Scaling API. 
+ If this is your first time configuring scaling for a model, we recommend you [Configure model auto scaling with the console](endpoint-auto-scaling-add-console.md). 
+ When using the Amazon CLI or the Application Auto Scaling API, the flow is to register the model as a scalable target, define the scaling policy, and then apply it. On the SageMaker AI console, under **Inference** in the navigation pane, choose **Endpoints**. Find your model's endpoint name and then choose it to find the variant name. You must specify both the endpoint name and the variant name to activate auto scaling for a model.

Auto scaling is made possible by a combination of the Amazon SageMaker AI, Amazon CloudWatch, and Application Auto Scaling APIs. For information about the minimum required permissions, see [Application Auto Scaling identity-based policy examples](https://docs.amazonaws.cn/autoscaling/application/userguide/security_iam_id-based-policy-examples.html) in the *Application Auto Scaling User Guide*.

The `SagemakerFullAccessPolicy` IAM policy has all the IAM permissions required to perform auto scaling. For more information about SageMaker AI IAM permissions, see [How to use SageMaker AI execution roles](sagemaker-roles.md).

If you manage your own permission policy, you must include the following permissions:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:DescribeEndpoint",
        "sagemaker:DescribeEndpointConfig",
        "sagemaker:UpdateEndpointWeightsAndCapacities"
      ],
      "Resource": "*"
    },
    {    
        "Effect": "Allow",
        "Action": [
            "application-autoscaling:*"
        ],
        "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws-cn:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint",
      "Condition": {
        "StringLike": { "iam:AWSServiceName": "sagemaker.application-autoscaling.amazonaws.com"	}
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:DeleteAlarms"
      ],
      "Resource": "*"
    }
  ]
}
```

------

## Service-linked role


Auto scaling uses the `AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint` service-linked role. This service-linked role grants Application Auto Scaling permission to describe the alarms for your policies, to monitor current capacity levels, and to scale the target resource. This role is created for you automatically. For automatic role creation to succeed, you must have permission for the `iam:CreateServiceLinkedRole` action. For more information, see [Service-linked roles](https://docs.amazonaws.cn/autoscaling/application/userguide/application-auto-scaling-service-linked-roles.html) in the *Application Auto Scaling User Guide*.

# Configure model auto scaling with the console


**To configure auto scaling for a model (console)**

1. Open the Amazon SageMaker AI console at [https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/).

1. On the navigation pane, choose **Inference**, and then choose **Endpoints**. 

1. Choose your endpoint, and then for **Endpoint runtime settings**, choose the variant.

1. Choose **Configure auto scaling**.

1. On the **Configure variant automatic scaling** page, for **Variant automatic scaling**, do the following:

   1. For **Minimum instance count**, type the minimum number of instances that you want the scaling policy to maintain. At least 1 instance is required.

   1. For **Maximum instance count**, type the maximum number of instances that you want the scaling policy to maintain.

1. For **Built-in scaling policy**, do the following:

   1. For the **Target metric**, `SageMakerVariantInvocationsPerInstance` is automatically selected for the metric and cannot be changed.

   1. For the **Target value**, type the average number of invocations per instance per minute for the model. To determine this value, follow the guidelines in [Load testing](endpoint-scaling-loadtest.md).

   1. (Optional) For **Scale-in cool down (seconds)** and **Scale-out cool down (seconds)**, enter the amount of time, in seconds, for each cool down period.

   1. (Optional) Select **Disable scale in** if you don’t want auto scaling to terminate instances when traffic decreases.

1. Choose **Save**.

This procedure registers a model as a scalable target with Application Auto Scaling. When you register a model, Application Auto Scaling performs validation checks to ensure the following:
+ The model exists
+ The permissions are sufficient
+ You aren't registering a variant with an instance that is a burstable performance instance such as T2
**Note**  
SageMaker AI doesn't support auto scaling for burstable instances such as T2, because they already allow for increased capacity under increased workloads. For information about burstable performance instances, see [Amazon EC2 instance types](http://www.amazonaws.cn/ec2/instance-types/).

# Register a model


Before you add a scaling policy to your model, you first must register your model for auto scaling and define the scaling limits for the model.

The following procedures cover how to register a model (production variant) for auto scaling using the Amazon Command Line Interface (Amazon CLI) or Application Auto Scaling API.

**Topics**
+ [

## Register a model (Amazon CLI)
](#endpoint-auto-scaling-add-cli)
+ [

## Register a model (Application Auto Scaling API)
](#endpoint-auto-scaling-add-api)

## Register a model (Amazon CLI)


To register your production variant, use the [register-scalable-target](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/register-scalable-target.html) command with the following parameters:
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--resource-id`—The resource identifier for the model (specifically, the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `--min-capacity`—The minimum number of instances. This value must be set to at least 1 and must be equal to or less than the value specified for `max-capacity`.
+ `--max-capacity`—The maximum number of instances. This value must be set to at least 1 and must be equal to or greater than the value specified for `min-capacity`.

**Example**  
The following example shows how to register a variant named `my-variant`, running on the `my-endpoint` endpoint, that can be dynamically scaled to have one to eight instances.  

```
aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --min-capacity 1 \
  --max-capacity 8
```

## Register a model (Application Auto Scaling API)


To register your model with Application Auto Scaling, use the [RegisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_RegisterScalableTarget.html) Application Auto Scaling API action with the following parameters:
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the production variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `MinCapacity`—The minimum number of instances. This value must be set to at least 1 and must be equal to or less than the value specified for `MaxCapacity`.
+ `MaxCapacity`—The maximum number of instances. This value must be set to at least 1 and must be equal to or greater than the value specified for `MinCapacity`.

**Example**  
The following example shows how to register a variant named `my-variant`, running on the `my-endpoint` endpoint, that can be dynamically scaled to use one to eight instances.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.RegisterScalableTarget
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "MinCapacity": 1,
    "MaxCapacity": 8
}
```

# Define a scaling policy


Before you add a scaling policy to your model, save your policy configuration as a JSON block in a text file. You use that text file when invoking the Amazon Command Line Interface (Amazon CLI) or the Application Auto Scaling API. You can optimize scaling by choosing an appropriate CloudWatch metric. However, before using a custom metric in production, you must test auto scaling with your custom metric.

**Topics**
+ [

## Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)
](#endpoint-auto-scaling-add-code-predefined)
+ [

## Specify a high-resolution predefined metric (CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy)
](#endpoint-auto-scaling-add-code-high-res)
+ [

## Define a custom metric (CloudWatch metric: CPUUtilization)
](#endpoint-auto-scaling-add-code-custom)
+ [

## Define a custom metric (CloudWatch metric: ExplanationsPerInstance)
](#endpoint-auto-scaling-online-explainability)
+ [

## Specify cooldown periods
](#endpoint-auto-scaling-add-code-cooldown)

This section shows you example policy configurations for target tracking scaling policies.

## Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)


**Example**  
The following is an example target tracking policy configuration for a variant that keeps the average invocations per instance at 70. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    }
}
```
For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*.

## Specify a high-resolution predefined metric (CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy)


With the following high-resolution CloudWatch metrics, you can set scaling policies for the volume of concurrent requests that your models receive:

**ConcurrentRequestsPerModel**  
The number of concurrent requests being received by a model container.

**ConcurrentRequestsPerCopy**  
The number of concurrent requests being received by an inference component.

These metrics track the number of simultaneous requests that your model containers handle, including the requests that are queued inside the containers. For models that send their inference response as a stream of tokens, these metrics track each request until the model sends the last token for the request.

As high-resolution metrics, they emit data more frequently than standard CloudWatch metrics. Standard metrics, such as the `InvocationsPerInstance` metric, emit data once every minute. However, these high-resolution metrics emit data every 10 seconds. Therefore, as the concurrent traffic to your models increases, your policy reacts by scaling out much more quickly than it would for standard metrics. However, as the traffic to your models decreases, your policy scales in at the same speed as it would for standard metrics.

The following is an example target tracking policy configuration that adds instances if the number of concurrent requests per model exceeds 5. Save this configuration in a file named `config.json`.

```
{
    "TargetValue": 5.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantConcurrentRequestsPerModelHighResolution"
    }
}
```

If you use inference components to deploy multiple models to the same endpoint, you can create an equivalent policy. In that case, set `PredefinedMetricType` to `SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution`.

For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*.

## Define a custom metric (CloudWatch metric: CPUUtilization)


To create a target tracking scaling policy with a custom metric, specify the metric's name, namespace, unit, statistic, and zero or more dimensions. A dimension consists of a dimension name and a dimension value. You can use any production variant metric that changes in proportion to capacity. 

**Example**  
The following example configuration shows a target tracking scaling policy with a custom metric. The policy scales the variant based on an average CPU utilization of 50 percent across all instances. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 50.0,
    "CustomizedMetricSpecification":
    {
        "MetricName": "CPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "VariantName","Value": "my-variant"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```
For more information, see [CustomizedMetricSpecification](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html) in the *Application Auto Scaling API Reference*. 

## Define a custom metric (CloudWatch metric: ExplanationsPerInstance)


When the endpoint has online explainability activated, it emits a `ExplanationsPerInstance` metric that outputs the average number of records explained per minute, per instance, for a variant. The resource utilization of explaining records can be more different than that of predicting records. We strongly recommend using this metric for target tracking scaling of endpoints with online explainability activated.

You can create multiple target tracking policies for a scalable target. Consider adding the `InvocationsPerInstance` policy from the [Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)](#endpoint-auto-scaling-add-code-predefined) section (in addition to the `ExplanationsPerInstance` policy). If most invocations don't return an explanation because of the threshold value set in the `EnableExplanations` parameter, then the endpoint can choose the `InvocationsPerInstance` policy. If there is a large number of explanations, the endpoint can use the `ExplanationsPerInstance` policy. 

**Example**  
The following example configuration shows a target tracking scaling policy with a custom metric. The policy scale adjusts the number of variant instances so that each instance has an `ExplanationsPerInstance` metric of 20. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 20.0,
    "CustomizedMetricSpecification":
    {
        "MetricName": "ExplanationsPerInstance",
        "Namespace": "AWS/SageMaker",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "VariantName","Value": "my-variant"}
        ],
        "Statistic": "Sum"
    }
}
```

For more information, see [CustomizedMetricSpecification](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html) in the *Application Auto Scaling API Reference*. 

## Specify cooldown periods


You can optionally define cooldown periods in your target tracking scaling policy by specifying the `ScaleOutCooldown` and `ScaleInCooldown` parameters. 

**Example**  
The following is an example target tracking policy configuration for a variant that keeps the average invocations per instance at 70. The policy configuration provides a scale-in cooldown period of 10 minutes (600 seconds) and a scale-out cooldown period of 5 minutes (300 seconds). Save this configuration in a file named `config.json`.   

```
{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 600,
    "ScaleOutCooldown": 300
}
```
For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*. 

# Apply a scaling policy


After you register your model and define a scaling policy, apply the scaling policy to the registered model. This section shows how to apply a scaling policy using the the Amazon Command Line Interface (Amazon CLI) or the Application Auto Scaling API. 

**Topics**
+ [

## Apply a target tracking scaling policy (Amazon CLI)
](#endpoint-auto-scaling-add-code-apply-cli)
+ [

## Apply a scaling policy (Application Auto Scaling API)
](#endpoint-auto-scaling-add-code-apply-api)

## Apply a target tracking scaling policy (Amazon CLI)


To apply a scaling policy to your model, use the [put-scaling-policy](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/put-scaling-policy.html) Amazon CLI command with the following parameters:
+ `--policy-name`—The name of the scaling policy.
+ `--policy-type`—Set this value to `TargetTrackingScaling`.
+ `--resource-id`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `--target-tracking-scaling-policy-configuration`—The target-tracking scaling policy configuration to use for the model.

**Example**  
The following example applies a target tracking scaling policy named `my-scaling-policy` to a variant named `my-variant`, running on the `my-endpoint` endpoint. For the `--target-tracking-scaling-policy-configuration` option, specify the `config.json` file that you created previously.   

```
aws application-autoscaling put-scaling-policy \
  --policy-name my-scaling-policy \
  --policy-type TargetTrackingScaling \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --target-tracking-scaling-policy-configuration file://config.json
```

## Apply a scaling policy (Application Auto Scaling API)


To apply a scaling policy to a variant with the Application Auto Scaling API, use the [PutScalingPolicy](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_PutScalingPolicy.html) Application Auto Scaling API action with the following parameters:
+ `PolicyName`—The name of the scaling policy.
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `PolicyType`—Set this value to `TargetTrackingScaling`.
+ `TargetTrackingScalingPolicyConfiguration`—The target-tracking scaling policy configuration to use for the variant.

**Example**  
The following example applies a target tracking scaling policy named `my-scaling-policy` to a variant named `my-variant`, running on the `my-endpoint` endpoint. The policy configuration keeps the average invocations per instance at 70.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "my-scaling-policy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "PolicyType": "TargetTrackingScaling",
    "TargetTrackingScalingPolicyConfiguration": {
        "TargetValue": 70.0,
        "PredefinedMetricSpecification":
        {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        }
    }
}
```

# Instructions for editing a scaling policy


After creating a scaling policy, you can edit any of its settings except the name.

 To edit a target tracking scaling policy with the Amazon Web Services Management Console, use the same procedure that you used to [Configure model auto scaling with the console](endpoint-auto-scaling-add-console.md).

You can use the Amazon CLI or the Application Auto Scaling API to edit a scaling policy in the same way that you create a new scaling policy. For more information, see [Apply a scaling policy](endpoint-auto-scaling-add-code-apply.md).

# Temporarily turn off scaling policies


After you configure auto scaling, you have the following options if you need to investigate an issue without interference from scaling policies (dynamic scaling):
+ Temporarily suspend and then resume scaling activities by calling the [register-scalable-target](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/register-scalable-target.html) CLI command or [RegisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_RegisterScalableTarget.html) API action, specifying a Boolean value for both `DynamicScalingInSuspended` and `DynamicScalingOutSuspended`.   
**Example**  

  The following example shows how to suspend scaling policies for a variant named `my-variant`, running on the `my-endpoint` endpoint.

  ```
  aws application-autoscaling register-scalable-target \
    --service-namespace sagemaker \
    --resource-id endpoint/my-endpoint/variant/my-variant \
    --scalable-dimension sagemaker:variant:DesiredInstanceCount \
    --suspended-state '{"DynamicScalingInSuspended":true,"DynamicScalingOutSuspended":true}'
  ```
+ Prevent specific target tracking scaling policies from scaling in your variant by disabling the policy's scale-in portion. This method prevents the scaling policy from deleting instances, while still allowing it to create them as needed.

  Temporarily disable and then enable scale-in activities by editing the policy using the [put-scaling-policy](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/put-scaling-policy.html) CLI command or the [PutScalingPolicy](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_PutScalingPolicy.html) API action, specifying a Boolean value for `DisableScaleIn`.  
**Example**  

  The following is an example of a target tracking configuration for a scaling policy that will scale out but not scale in. 

  ```
  {
      "TargetValue": 70.0,
      "PredefinedMetricSpecification":
      {
          "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
      },
      "DisableScaleIn": true
  }
  ```

# Delete a scaling policy


If you no longer need a scaling policy, you can delete it at any time.

**Topics**
+ [

## Delete all scaling policies and deregister the model (console)
](#endpoint-auto-scaling-delete-console)
+ [

## Delete a scaling policy (Amazon CLI or Application Auto Scaling API)
](#endpoint-auto-scaling-delete-code)

## Delete all scaling policies and deregister the model (console)


**To delete all scaling policies and deregister the variant as a scalable target**

1. Open the Amazon SageMaker AI console at [https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/).

1. On the navigation pane, choose **Endpoints**.

1. Choose your endpoint, and then for **Endpoint runtime settings**, choose the variant.

1. Choose **Configure auto scaling**.

1. Choose **Deregister auto scaling**.

## Delete a scaling policy (Amazon CLI or Application Auto Scaling API)


You can use the Amazon CLI or the Application Auto Scaling API to delete a scaling policy from a variant.

### Delete a scaling policy (Amazon CLI)


To delete a scaling policy from a variant, use the [delete-scaling-policy](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/delete-scaling-policy.html) command with the following parameters:
+ `--policy-name`—The name of the scaling policy.
+ `--resource-id`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.

**Example**  
The following example deletes a target tracking scaling policy named `my-scaling-policy` from a variant named `my-variant`, running on the `my-endpoint` endpoint.  

```
aws application-autoscaling delete-scaling-policy \
  --policy-name my-scaling-policy \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount
```

### Delete a scaling policy (Application Auto Scaling API)


To delete a scaling policy from your variant, use the [DeleteScalingPolicy](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_DeleteScalingPolicy.html) Application Auto Scaling API action with the following parameters:
+ `PolicyName`—The name of the scaling policy.
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.

**Example**  
The following example deletes a target tracking scaling policy named `my-scaling-policy` from a variant named `my-variant`, running on the `my-endpoint` endpoint.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeleteScalingPolicy
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "my-scaling-policy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount"
}
```

# Check the status of a scaling activity by describing scaling activities
Check the status of a scaling activity

You can check the status of a scaling activity for your auto scaled endpoint by describing scaling activities. Application Auto Scaling provides descriptive information about the scaling activities in the specified namespace from the previous six weeks. For more information, see [Scaling activities for Application Auto Scaling](https://docs.amazonaws.cn/autoscaling/application/userguide/application-auto-scaling-scaling-activities.html) in the *Application Auto Scaling User Guide*.

To check the status of a scaling activity, use the [describe-scaling-activities](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command. You can't check the status of a scaling activity using the console.

**Topics**
+ [

## Describe scaling activities (Amazon CLI)
](#endpoint-how-to)
+ [

## Identify blocked scaling activities from instance quotas (Amazon CLI)
](#endpoint-identify-blocked-autoscaling)

## Describe scaling activities (Amazon CLI)


To describe scaling activities for all SageMaker AI resources that registered with Application Auto Scaling, use the [describe-scaling-activities](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command, specifying `sagemaker` for the `--service-namespace` option.

```
aws application-autoscaling describe-scaling-activities \
  --service-namespace sagemaker
```

To describe scaling activities for a specific resource, include the `--resource-id` option. 

```
aws application-autoscaling describe-scaling-activities \
  --service-namespace sagemaker \
  --resource-id endpoint/my-endpoint/variant/my-variant
```

The following example shows the output produced when you run this command.

```
{
    "ActivityId": "activity-id",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "Description": "string",
    "Cause": "string",
    "StartTime": timestamp,
    "EndTime": timestamp,
    "StatusCode": "string",
    "StatusMessage": "string"
}
```

## Identify blocked scaling activities from instance quotas (Amazon CLI)


When you scale out (add more instances), you might reach your account-level instance quota. You can use the [describe-scaling-activities](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command to check whether you have reached your instance quota. When you exceed your quota, auto scaling is blocked. 

To check if you have reached your instance quota, use the [describe-scaling-activities](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command and specify the resource ID for the `--resource-id` option. 

```
aws application-autoscaling describe-scaling-activities \
    --service-namespace sagemaker \
    --resource-id endpoint/my-endpoint/variant/my-variant
```

Within the return syntax, check the [StatusCode](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_ScalingActivity.html#autoscaling-Type-ScalingActivity-StatusCode) and [StatusMessage](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_ScalingActivity.html#autoscaling-Type-ScalingActivity-StatusMessage) keys and their associated values. `StatusCode` returns `Failed`. Within `StatusMessage` there is a message indicating that the account-level service quota was reached. The following is an example of what that message might look like: 

```
{
    "ActivityId": "activity-id",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "Description": "string",
    "Cause": "minimum capacity was set to 110",
    "StartTime": timestamp,
    "EndTime": timestamp,
    "StatusCode": "Failed",
    "StatusMessage": "Failed to set desired instance count to 110. Reason: The 
    account-level service limit 'ml.xx.xxxxxx for endpoint usage' is 1000 
    Instances, with current utilization of 997 Instances and a request delta 
    of 20 Instances. Please contact Amazon support to request an increase for this 
    limit. (Service: AmazonSageMaker; Status Code: 400; 
    Error Code: ResourceLimitExceeded; Request ID: request-id)."
}
```

# Scale an endpoint to zero instances


When you set up auto scaling for an endpoint, you can allow the scale-in process to reduce the number of in-service instances to zero. By doing so, you save costs during periods when your endpoint isn't serving inference requests and therefore doesn't require any active instances. 

However, after scaling in to zero instances, your endpoint can't respond to any incoming inference requests until it provisions at least one instance. To automate the provisioning process, you create a step scaling policy with Application Auto Scaling. Then, you assign the policy to an Amazon CloudWatch alarm.

After you set up the step scaling policy and the alarm, your endpoint will automatically provision an instance soon after it receives an inference request that it can't respond to. Be aware that the provisioning process takes several minutes. During that time, any attempts to invoke the endpoint will produce an error.

The following procedures explain how to set up auto scaling for an endpoint so that it scales in to, and out from, zero instances. The procedures use commands with the Amazon CLI.

**Before you begin**

Before your endpoint can scale in to, and out from, zero instances, it must meet the following requirements:
+ It is in service.
+ It hosts one or more inference components. An endpoint can scale to and from zero instances only if it hosts inference components.

  For information about hosting inference components on SageMaker AI endpoints, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).
+ In the endpoint configuration, for the production variant `ManagedInstanceScaling` object, you've set the `MinInstanceCount` parameter to `0`.

  For reference information about this parameter, see [ProductionVariantManagedInstanceScaling](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ProductionVariantManagedInstanceScaling.html).

**To enable an endpoint to scale in to zero instances (Amazon CLI)**

For each inference component that the endpoint hosts, do the following:

1. Register the inference component as a scalable target. When you register it, set the minimum capacity to `0`, as shown by the following command:

   ```
   aws application-autoscaling register-scalable-target \
     --service-namespace sagemaker \
     --resource-id inference-component/inference-component-name \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --min-capacity 0 \
     --max-capacity n
   ```

   In this example, replace *inference-component-name* with the name of your inference component. Replace *n* with the maximum number of inference component copies to provision when scaling out.

   For more information about this command and each of its parameters, see [register-scalable-target](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/register-scalable-target.html) in the *Amazon CLI Command Reference*.

1. Apply a target tracking policy to the inference component, as shown by the following command:

   ```
   aws application-autoscaling put-scaling-policy \
     --policy-name my-scaling-policy \
     --policy-type TargetTrackingScaling \
     --resource-id inference-component/inference-component-name \
     --service-namespace sagemaker \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --target-tracking-scaling-policy-configuration file://config.json
   ```

   In this example, replace *inference-component-name* with the name of your inference component.

   In the example, the `config.json` file contains a target tracking policy configuration, such as the following:

   ```
   {
     "PredefinedMetricSpecification": {
         "PredefinedMetricType": "SageMakerInferenceComponentInvocationsPerCopy"
     },
     "TargetValue": 1,
     "ScaleInCooldown": 300,
     "ScaleOutCooldown": 300
   }
   ```

   For more example tracking policy configurations, see [Define a scaling policy](endpoint-auto-scaling-add-code-define.md).

   For more information about this command and each of its parameters, see [put-scaling-policy](https://docs.amazonaws.cn/cli/latest/reference/application-autoscaling/put-scaling-policy.html) in the *Amazon CLI Command Reference*.

**To enable an endpoint to scale out from zero instances (Amazon CLI)**

For each inference component that the endpoint hosts, do the following:

1. Apply a step scaling policy to the inference component, as shown by the following command:

   ```
   aws application-autoscaling put-scaling-policy \
     --policy-name my-scaling-policy \
     --policy-type StepScaling \
     --resource-id inference-component/inference-component-name \
     --service-namespace sagemaker \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --step-scaling-policy-configuration file://config.json
   ```

   In this example, replace *my-scaling-policy* with a unique name for your policy. Replace *inference-component-name* with the name of your inference component.

   In the example, the `config.json` file contains a step scaling policy configuration, such as the following:

   ```
   {
       "AdjustmentType": "ChangeInCapacity",
       "MetricAggregationType": "Maximum",
       "Cooldown": 60,
       "StepAdjustments":
         [
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
         ]
   }
   ```

   When this step scaling policy is triggered, SageMaker AI provisions the necessary instances to support the inference component copies.

   After you create the step scaling policy, take note of its Amazon Resource Name (ARN). You need the ARN for the CloudWatch alarm in the next step.

   For more information about step scaling polices, see [Step scaling policies](https://docs.amazonaws.cn/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

1. Create a CloudWatch alarm and assign the step scaling policy to it, as shown by the following example:

   ```
   aws cloudwatch put-metric-alarm \
   --alarm-actions step-scaling-policy-arn \
   --alarm-description "Alarm when SM IC endpoint invoked that has 0 instances." \
   --alarm-name ic-step-scaling-alarm \
   --comparison-operator GreaterThanThreshold  \
   --datapoints-to-alarm 1 \
   --dimensions "Name=InferenceComponentName,Value=inference-component-name" \
   --evaluation-periods 1 \
   --metric-name NoCapacityInvocationFailures \
   --namespace AWS/SageMaker \
   --period 60 \
   --statistic Sum \
   --threshold 1
   ```

   In this example, replace *step-scaling-policy-arn* with the ARN of your step scaling policy. Replace *ic-step-scaling-alarm* with a name of your choice. Replace *inference-component-name* with the name of your inference component. 

   This example sets the `--metric-name` parameter to `NoCapacityInvocationFailures`. SageMaker AI emits this metric when an endpoint receives an inference request, but the endpoint has no active instances to serve the request. When that event occurs, the alarm initiates the step scaling policy in the previous step.

   For more information about this command and each of its parameters, see [put-metric-alarm](https://docs.amazonaws.cn/cli/latest/reference/cloudwatch/put-metric-alarm.html) in the *Amazon CLI Command Reference*.

# Load testing your auto scaling configuration
Load testing

Perform load tests to choose a scaling configuration that works the way you want.

The following guidelines for load testing assume you are using a scaling policy that uses the predefined target metric `SageMakerVariantInvocationsPerInstance`.

**Topics**
+ [

## Determine the performance characteristics
](#endpoint-scaling-loadtest-variant)
+ [

## Calculate the target load
](#endpoint-scaling-loadtest-calc)

## Determine the performance characteristics


Perform load testing to find the peak `InvocationsPerInstance` that your model's production variant can handle, and the latency of requests, as concurrency increases.

This value depends on the instance type chosen, payloads that clients of your model typically send, and the performance of any external dependencies your model has.

**To find the peak requests-per-second (RPS) your model's production variant can handle and latency of requests**

1. Set up an endpoint with your model using a single instance. For information about how to set up an endpoint, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

1. Use a load testing tool to generate an increasing number of parallel requests, and monitor the RPS and model latency in the out put of the load testing tool. 
**Note**  
You can also monitor requests-per-minute instead of RPS. In that case don't multiply by 60 in the equation to calculate `SageMakerVariantInvocationsPerInstance` shown below.

   When the model latency increases or the proportion of successful transactions decreases, this is the peak RPS that your model can handle.

## Calculate the target load


After you find the performance characteristics of the variant, you can determine the maximum RPS we should allow to be sent to an instance. The threshold used for scaling must be less than this maximum value. Use the following equation in combination with load testing to determine the correct value for the `SageMakerVariantInvocationsPerInstance` target metric in your scaling configuration.

```
SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60
```

Where `MAX_RPS` is the maximum RPS that you determined previously, and `SAFETY_FACTOR` is the safety factor that you chose to ensure that your clients don't exceed the maximum RPS. Multiply by 60 to convert from RPS to invocations-per-minute to match the per-minute CloudWatch metric that SageMaker AI uses to implement auto scaling (you don't need to do this if you measured requests-per-minute instead of requests-per-second).

**Note**  
SageMaker AI recommends that you start testing with a `SAFETY_FACTOR` of 0.5. Test your scaling configuration to ensure it operates in the way you expect with your model for both increasing and decreasing customer traffic on your endpoint.

# Use Amazon CloudFormation to create a scaling policy


The following example shows how to configure model auto scaling on an endpoint using Amazon CloudFormation.

```
  Endpoint:
    Type: "AWS::SageMaker::Endpoint"
    Properties:
      EndpointName: yourEndpointName
      EndpointConfigName: yourEndpointConfigName

  ScalingTarget:
    Type: "AWS::ApplicationAutoScaling::ScalableTarget"
    Properties:
      MaxCapacity: 10
      MinCapacity: 2
      ResourceId: endpoint/my-endpoint/variant/my-variant
      RoleARN: arn
      ScalableDimension: sagemaker:variant:DesiredInstanceCount
      ServiceNamespace: sagemaker

  ScalingPolicy:
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: my-scaling-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId:
        Ref: ScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 70.0
        ScaleInCooldown: 600
        ScaleOutCooldown: 30
        PredefinedMetricSpecification:
          PredefinedMetricType: SageMakerVariantInvocationsPerInstance
```

For more information, see [Create Application Auto Scaling resources with Amazon CloudFormation](https://docs.amazonaws.cn/autoscaling/application/userguide/creating-resources-with-cloudformation.html) in the *Application Auto Scaling User Guide*.

# Update endpoints that use auto scaling


When you update an endpoint, Application Auto Scaling checks to see whether any of the models on that endpoint are targets for auto scaling. If the update would change the instance type for any model that is a target for auto scaling, the update fails. 

In the Amazon Web Services Management Console, you see a warning that you must deregister the model from auto scaling before you can update it. If you are trying to update the endpoint by calling the [UpdateEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, the call fails. Before you update the endpoint, delete any scaling policies configured for it and deregister the variant as a scalable target by calling the [DeregisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_DeregisterScalableTarget.html) Application Auto Scaling API action. After you update the endpoint, you can register the updated variant as a scalable target and attach a scaling policy.

There is one exception. If you change the model for a variant that is configured for auto scaling, Amazon SageMaker AI auto scaling allows the update. This is because changing the model doesn't typically affect performance enough to change scaling behavior. If you do update a model for a variant configured for auto scaling, ensure that the change to the model doesn't significantly affect performance and scaling behavior.

When you update SageMaker AI endpoints that have auto scaling applied, complete the following steps:

**To update an endpoint that has auto scaling applied**

1. Deregister the endpoint as a scalable target by calling [DeregisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_DeregisterScalableTarget.html).

1. Because auto scaling is blocked while the update operation is in progress (or if you turned off auto scaling in the previous step), you might want to take the additional precaution of increasing the number of instances for your endpoint during the update. To do this, update the instance counts for the production variants hosted at the endpoint by calling [UpdateEndpointWeightsAndCapacities](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html).

1. Call [ DescribeEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeEndpoint.html) repeatedly until the value of the `EndpointStatus` field of the response is `InService`.

1. Call [ DescribeEndpointConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) to get the values of the current endpoint config.

1. Create a new endpoint config by calling [ CreateEndpointConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). For the production variants where you want to keep the existing instance count or weight, use the same variant name from the response from the call to [ DescribeEndpointConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the previous step. For all other values, use the values that you got as the response when you called [ DescribeEndpointConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the previous step.

1. Update the endpoint by calling [ UpdateEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateEndpoint.html). Specify the endpoint config you created in the previous step as the `EndpointConfig` field. If you want to retain the variant properties like instance count or weight, set the value of the `RetainAllVariantProperties` parameter to `True`. This specifies that production variants with the same name will are updated with the most recent `DesiredInstanceCount` from the response from the call to `DescribeEndpoint`, regardless of the values of the `InitialInstanceCount` field in the new `EndpointConfig`.

1. (Optional) Re-activate auto scaling by calling [RegisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_RegisterScalableTarget.html) and [PutScalingPolicy](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_PutScalingPolicy.html).

**Note**  
Steps 1 and 7 are required only if you are updating an endpoint with the following changes:  
Changing the instance type for a production variant that has auto scaling configured
Removing a production variant that has auto scaling configured.

# Delete endpoints configured for auto scaling


If you delete an endpoint, Application Auto Scaling checks to see whether any of the models on that endpoint are targets for auto scaling. If any are and you have permission to deregister the model, Application Auto Scaling deregisters those models as scalable targets without notifying you. If you use a custom permission policy that doesn't provide permission for the [DeregisterScalableTarget](https://docs.amazonaws.cn/autoscaling/application/APIReference/API_DeregisterScalableTarget.html) action, you must request access to this action before deleting the endpoint.

**Note**  
As an IAM user, you might not have sufficient permission to delete an endpoint if another user configured auto scaling for a variant on that endpoint.