Configuring provisioned concurrency - Amazon Lambda
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Configuring provisioned concurrency

In Lambda, concurrency is the number of in-flight requests that your function is currently handling. There are two types of concurrency controls available:

  • Reserved concurrency – This represents the maximum number of concurrent instances allocated to your function. When a function has reserved concurrency, no other function can use that concurrency. Configuring reserved concurrency for a function incurs no additional charges.

  • Provisioned concurrency – This is the number of pre-initialized execution environments allocated to your function. These execution environments are ready to respond immediately to incoming function requests. Configuring provisioned concurrency incurs additional charges to your Amazon Web Services account.

This topic details how to manage and configure provisioned concurrency. For a conceptual overview of these two types of concurrency controls, see Reserved concurrency and provisioned concurrency. For more information on configuring reserved concurrency, see Configuring reserved concurrency.

Note

Lambda functions linked to an Amazon MQ event source mapping have a default maximum concurrency. For Apache Active MQ, the maximum number of concurrent instances is 5. For Rabbit MQ, the maximum number of concurrent instances is 1. Setting reserved or provisioned concurrency for your function doesn't change these limits. To request an increase in the default maximum concurrency when using Amazon MQ, contact Amazon Web Services Support.

Configuring provisioned concurrency

You can configure provisioned concurrency settings for a function using the Lambda console or the Lambda API.

To allocate provisioned concurrency for a function (console)
  1. Open the Functions page of the Lambda console.

  2. Choose the function you want to allocate provisioned concurrency for.

  3. Choose Configuration and then choose Concurrency.

  4. Under Provisioned concurrency configurations, choose Add configuration.

  5. Choose Reserve concurrency. Enter the amount of concurrency to reserve for the function.

  6. Choose the qualifier type, and alias or version.

    Note

    You cannot use provisioned concurrency with the $LATEST version of any function.

    If your function has an event source, make sure that event source points to the correct function alias or version. Otherwise, your function won't use provisioned concurrency environments.

  7. Enter a number under Provisioned concurrency. Lambda provides an estimate of monthly costs.

  8. Choose Save.

You can configure up to the Unreserved account concurrency in your account, minus 100. The remaining 100 units of concurrency are for functions that aren't using reserved concurrency. For example, if your account has a concurrency limit of 1,000, and you haven't assigned any reserved or provisioned concurrency to any of your other functions, you can configure a maximum of 900 provisioned concurrency units for a single function.


        An error occurs if you try to allocate too much provisioned concurrency.

Configuring provisioned concurrency for a function has an impact on the concurrency pool available to other functions. For instance, if you configure 100 units of provisioned concurrency for function-a, other functions in your account must share the remaining 900 units of concurrency. This is true even if function-a doesn't use all 100 units.

It's possible to allocate both reserved concurrency and provisioned concurrency for the same function. In such cases, the provisioned concurrency cannot exceed the reserved concurrency.

This limitation extends to function versions. The maximum provisioned concurrency you can assign to a specific function version is the function's reserved concurrency minus the provisioned concurrency on other function versions.

To configure provisioned concurrency with the Lambda API, use the following API operations.

For example, to configure provisioned concurrency with the Amazon Command Line Interface (CLI), use the put-provisioned-concurrency-config command. The following command allocates 100 units of provisioned concurrency for the BLUE alias of a function named my-function:

aws lambda put-provisioned-concurrency-config --function-name my-function \ --qualifier BLUE \ --provisioned-concurrent-executions 100

You should see output that looks like the following:

{ "Requested ProvisionedConcurrentExecutions": 100, "Allocated ProvisionedConcurrentExecutions": 0, "Status": "IN_PROGRESS", "LastModified": "2023-01-21T11:30:00+0000" }

Accurately estimating required provisioned concurrency

You can view any active function's concurrency metrics using CloudWatch metrics. Specifically, the ConcurrentExecutions metric shows you the number of concurrent invocations for functions in your account.


        Graph showing concurrency for a function over time.

The previous graph suggests that this function serves an average of 5 to 10 concurrent requests at any given time, and peaks at 20 requests. Suppose that there are many other functions in your account. If this function is critical to your application and you need a low-latency response on every invocation, configure at least 20 units of provisioned concurrency.

Recall that you can also calculate concurrency using the following formula:

Concurrency = (average requests per second) * (average request duration in seconds)

To estimate how much concurrency you need, multiply average requests per second with the average request duration in seconds. You can estimate average requests per second using the Invocation metric, and the average request duration in seconds using the Duration metric.

When configuring provisioned concurrency, Lambda suggests adding a 10% buffer on top of the amount of concurrency your function typically needs. For example, if your function usually peaks at 200 concurrent requests, set the provisioned concurrency to 220 (200 concurrent requests + 10% = 220 provisioned concurrency).

Optimizing latency with provisioned concurrency

To optimize for latency, the structure of your function code can vary based on whether you use provisioned concurrency or on-demand environments. For functions running on provisioned concurrency, Lambda runs any initialization code, such as loading libraries and instantiating clients, during allocation time. Therefore, it's advisable to move as much initialization outside of the main function handler to avoid impacting latency during actual function invocations. In contrast, initializing libraries or instantiating clients within your main handler code means your function must run this each time it's invoked, regardless of whether you're using provisioned concurrency.

For on-demand invocations, Lambda may need to rerun your initialization code every time your function experiences a cold start. For such functions, you may choose to defer initialization of a specific capability until your function needs it. For example, consider the following control flow for a Lambda handler:

def handler(event, context): ... if ( some_condition ): // Initialize CLIENT_A to perform a task else: // Do nothing

In the previous example, instead of initializing CLIENT_A outside of the main handler, the developer initialized it within the if statement. By doing this, Lambda runs this code only if some_condition is met. If you initialize CLIENT_A outside the main handler, Lambda runs that code on every cold start. This can increase overall latency.

It's possible for your function to use up all of its provisioned concurrency. Lambda uses on-demand instances to handle any excess traffic. To determine the type of initialization Lambda used for a specific environment, check the value of the AWS_LAMBDA_INITIALIZATION_TYPE environment variable. This variable has two possible values: provisioned-concurrency or on-demand. The value of AWS_LAMBDA_INITIALIZATION_TYPE is immutable and remains constant throughout the lifetime of the environment.

If you're using the .NET 6 or .NET 7 runtimes, you can configure the AWS_LAMBDA_DOTNET_PREJIT environment variable to improve the latency for functions, even if they don't use provisioned concurrency. The .NET runtime employs lazy compilation and initialization for each library that your code calls for the first time. As a result, the first invocation of a Lambda function may take longer than subsequent ones. To mitigate this, you can choose one of three values for AWS_LAMBDA_DOTNET_PREJIT:

  • ProvisionedConcurrency: Lambda performs ahead-of-time JIT compilation for all environments using provisioned concurrency. This is the default value.

  • Always: Lambda performs ahead-of-time JIT compilation for every environment, even if the function doesn't use provisioned concurrency.

  • Never: Lambda disables ahead-of-time JIT compilation for all environments.

For provisioned concurrency environments, your function's initialization code runs during allocation, and periodically as Lambda recycles instances of your environment. You can see the initialization time in logs and traces after an environment instance processes a request. It's important to note that Lambda bills you for initialization even if the environment instance never processes a request. Provisioned concurrency runs continually and incurs separate billing from initialization and invocation costs. For more details, see Amazon Lambda Pricing.

Also, when you configure a Lambda function with provisioned concurrency, Lambda pre-initializes that execution environment so that it's available in advance of function invocation requests. However, your function publishes invocation logs to CloudWatch only when the function is actually invoked. Therefore, the Init Duration field appears in the REPORT log line of the first function invocation, even though the initialization happened ahead of time. This does not mean the function experienced a cold start.

For additional guidance on optimizing functions using provisioned concurrency, see Lambda execution environments in Serverless Land.

Managing provisioned concurrency with Application Auto Scaling

You can use Application Auto Scaling to manage provisioned concurrency on a schedule or based on utilization. If your function receives predictable traffic patterns, use scheduled scaling. If you want your function to maintain a specific utilization percentage, use a target tracking scaling policy.

Scheduled scaling

With Application Auto Scaling, you can set your own scaling schedule according to predictable load changes. For more information and examples, see Scheduled scaling for Application Auto Scaling in the Application Auto Scaling User Guide, and Scheduling Amazon Lambda Provisioned Concurrency for recurring peak usage on the Amazon Compute Blog.

Target tracking

With target tracking, Application Auto Scaling creates and manages a set of CloudWatch alarms based on how you define your scaling policy. When these alarms activate, Application Auto Scaling automatically adjusts the amount of environments allocated using provisioned concurrency. Use target tracking for applications that don't have predictable traffic patterns.

To scale provisioned concurrency using target tracking, use the RegisterScalableTarget and PutScalingPolicy Application Auto Scaling API operations. For example, if you're using the Amazon Command Line Interface (CLI), follow these steps:

  1. Register a function's alias as a scaling target. The following example registers the BLUE alias of a function named my-function:

    aws application-autoscaling register-scalable-target --service-namespace lambda \ --resource-id function:my-function:BLUE --min-capacity 1 --max-capacity 100 \ --scalable-dimension lambda:function:ProvisionedConcurrency
  2. Apply a scaling policy to the target. The following example configures Application Auto Scaling to adjust the provisioned concurrency configuration for an alias to keep utilization near 70 percent, but you can apply any value between 10% and 90%.

    aws application-autoscaling put-scaling-policy \ --service-namespace lambda \ --scalable-dimension lambda:function:ProvisionedConcurrency \ --resource-id function:my-function:BLUE \ --policy-name my-policy \ --policy-type TargetTrackingScaling \ --target-tracking-scaling-policy-configuration '{ "TargetValue": 0.7, "PredefinedMetricSpecification": { "PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization" }}'

You should see output that looks like this:

{ "PolicyARN": "arn:aws:autoscaling:us-east-2:123456789012:scalingPolicy:12266dbb-1524-xmpl-a64e-9a0a34b996fa:resource/lambda/function:my-function:BLUE:policyName/my-policy", "Alarms": [ { "AlarmName": "TargetTracking-function:my-function:BLUE-AlarmHigh-aed0e274-xmpl-40fe-8cba-2e78f000c0a7", "AlarmARN": "arn:aws:cloudwatch:us-east-2:123456789012:alarm:TargetTracking-function:my-function:BLUE-AlarmHigh-aed0e274-xmpl-40fe-8cba-2e78f000c0a7" }, { "AlarmName": "TargetTracking-function:my-function:BLUE-AlarmLow-7e1a928e-xmpl-4d2b-8c01-782321bc6f66", "AlarmARN": "arn:aws:cloudwatch:us-east-2:123456789012:alarm:TargetTracking-function:my-function:BLUE-AlarmLow-7e1a928e-xmpl-4d2b-8c01-782321bc6f66" } ] }

Application Auto Scaling creates two alarms in CloudWatch. The first alarm triggers when the utilization of provisioned concurrency consistently exceeds 70%. When this happens, Application Auto Scaling allocates more provisioned concurrency to reduce utilization. The second alarm triggers when utilization is consistently less than 63% (90 percent of the 70% target). When this happens, Application Auto Scaling reduces the alias's provisioned concurrency.

In the following example, a function scales between a minimum and maximum amount of provisioned concurrency based on utilization.


          Autoscaling provisioned concurrency with Application Auto Scaling target tracking.
Legend
  • Function instances

  • Open requests

  • Provisioned concurrency

  • Standard concurrency

When the number of open requests increase, Application Auto Scaling increases provisioned concurrency in large steps until it reaches the configured maximum. After this, the function can continue to scale on standard, unreserved concurrency if you haven't reached your account concurrency limit. When utilization drops and stays low, Application Auto Scaling decreases provisioned concurrency in smaller periodic steps.

Both of the Application Auto Scaling alarms use the average statistic by default. Functions that experience quick bursts of traffic may not trigger these alarms. For example, suppose your Lambda function executes quickly (i.e. 20-100 ms) and your traffic comes in quick bursts. In this case, the number of requests exceeds the allocated provisioned concurrency during the burst. However, Application Auto Scaling requires the burst load to sustain for at least 3 minutes in order to provision additional environments. Additionally, both CloudWatch alarms require 3 data points that hit the target average to activate the auto scaling policy. If your function experiences quick bursts of traffic, using the Maximum statistic instead of the Average statistic can be more effective at scaling provisioned concurrency to minimize cold starts.

For more information on target tracking scaling policies, see Target tracking scaling policies for Application Auto Scaling.