Troubleshooting Lambda Managed Instances
Throttling and scaling issues
High error rates during scale-up
Problem: You experience throttling errors (HTTP 429) when traffic increases rapidly.
Cause: Lambda Managed Instances scale asynchronously based on CPU resource utilization and multi-concurrency saturation. If your traffic more than doubles within 5 minutes, you may see throttles as Lambda scales up instances and execution environments to meet demand.
Solution:
-
Adjust target resource utilization: If your workload has predictable traffic patterns, set a lower target resource utilization to maintain additional headroom for traffic bursts.
-
Pre-warm capacity: For planned traffic increases, gradually ramp up traffic over a longer period to allow scaling to keep pace.
-
Monitor scaling metrics: Track throttle error metrics to understand the reason for throttles and capacity scaling issues.
-
Review function configuration: Ensure your function memory and vCPU settings support multi-concurrent executions. Increase function memory or vCPU allocation if needed.
Slow scale-down
Problem: Instances take a long time to scale down after traffic decreases.
Cause: Lambda Managed Instances scale down gradually to maintain availability and avoid rapid capacity changes that could impact performance.
Solution:
This is expected behavior. Lambda scales down instances conservatively to ensure stability. Monitor your CloudWatch metrics to track the number of running instances.
Concurrency issues
Execution environments with low concurrency experience throttles
Problem: Your functions experience throttling despite having available capacity.
Cause: Execution environments with very low maximum concurrency may have difficulty scaling effectively. Lambda Managed Instances are designed for multi-concurrent applications.
Solution:
-
Increase maximum concurrency: If your function invocations use very little CPU, increase the maximum concurrency setting up to 64 per vCPU.
-
Optimize function code: Review your function code to reduce CPU consumption per invocation, allowing higher concurrency.
-
Adjust function memory and vCPU: Ensure your function has sufficient resources to handle multiple concurrent invocations.
Thread safety issues (Java runtime)
Problem: Your Java function produces incorrect results or experiences race conditions under load.
Cause: Multiple threads execute the handler method simultaneously, and shared state is not thread-safe.
Solution:
-
Use
AtomicIntegerorAtomicLongfor counters instead of primitive types -
Replace
HashMapwithConcurrentHashMap -
Use
Collections.synchronizedList()to wrapArrayList -
Use
ThreadLocalfor request-specific state -
Access trace IDs from the Lambda Context object, not environment variables
For detailed guidance, see the Java runtime for Lambda Managed Instances documentation.
State isolation issues (Node.js runtime)
Problem: Your Node.js function returns data from different requests or experiences data corruption.
Cause: Global variables are shared across concurrent invocations on the same worker thread. When async operations yield control, other invocations can modify shared state.
Solution:
-
Install and use
@aws/lambda-invoke-storefor all request-specific state -
Replace global variables with
InvokeStore.set()andInvokeStore.get() -
Use unique file names in
/tmpwith request IDs -
Access trace IDs using
InvokeStore.getXRayTraceId()instead of environment variables
For detailed guidance, see the Node.js runtime for Lambda Managed Instances documentation.
File conflicts (Python runtime)
Problem: Your Python function reads incorrect data from files in /tmp.
Cause: Multiple processes share the /tmp directory. Concurrent writes to the same file can cause data corruption.
Solution:
-
Use unique file names with request IDs:
/tmp/request_{context.request_id}.txt -
Use file locking with
fcntl.flock()for shared files -
Clean up temporary files with
os.remove()after use
For detailed guidance, see the Python runtime for Lambda Managed Instances documentation.
Performance issues
High memory utilization
Problem: Your functions experience high memory utilization or out-of-memory errors.
Cause: Each concurrent request in Python runs in a separate process with its own memory space. Total memory usage equals per-process memory multiplied by concurrent processes.
Solution:
-
Monitor the
MemoryUtilizationmetric in CloudWatch -
Reduce the
MaxConcurrencysetting if memory usage approaches the function's memory limit -
Increase function memory allocation to support higher concurrency
-
Optimize memory usage by loading data on-demand instead of during initialization
Inconsistent performance
Problem: Function performance varies significantly between invocations.
Cause: Lambda may select different instance types based on availability, or functions may be running on instances with varying resource availability.
Solution:
-
Specify allowed instance types: If you have specific performance requirements, configure allowed instance types in your capacity provider to limit the instance types Lambda can select.
-
Monitor instance-level metrics: Track
CPUUtilizationandMemoryUtilizationat the capacity provider level to identify resource constraints. -
Review capacity metrics: Check
vCPUAvailableandMemoryAvailableto ensure sufficient resources are available on your instances.
Capacity provider issues
Function version fails to become ACTIVE
Problem: Your function version remains in a pending state after publishing.
Cause: Lambda is launching Managed Instances and starting execution environments. This process takes time, especially for the first function version on a new capacity provider.
Solution:
Wait for Lambda to complete the initialization process. Lambda launches three instances by default for AZ resiliency and starts three execution environments before marking your function version ACTIVE. This typically takes several minutes.
Cannot delete capacity provider
Problem: You receive an error when attempting to delete a capacity provider.
Cause: You cannot delete a capacity provider that has function versions attached to it.
Solution:
-
Identify all function versions using the capacity provider with the
ListFunctionVersionsByCapacityProviderAPI. -
Delete or update those function versions to remove the capacity provider association.
-
Retry deleting the capacity provider.
Generic error messages during function publishing
Problem: You encounter generic error messages such as "Internal error occurred during publishing" when publishing functions.
Solution:
-
Check IAM permissions: Ensure you have the
lambda:PassCapacityProviderpermission for the capacity provider you're trying to use. -
Verify capacity provider configuration: Confirm that your capacity provider is in the ACTIVE state using the
GetCapacityProviderAPI. -
Review VPC configuration: Ensure the subnets and security groups specified in your capacity provider are correctly configured and accessible.
-
Check Amazon CloudTrail logs: Review CloudTrail logs for detailed error information about the failed operation.
Monitoring and observability issues
Missing CloudWatch metrics
Problem: You don't see expected metrics in CloudWatch for your capacity provider or functions.
Cause: Metrics are published at 5-minute intervals. New capacity providers or functions may not have metrics available immediately.
Solution:
Wait at least 5-10 minutes after publishing a function version before expecting metrics to appear in CloudWatch. Verify you're looking at the correct namespace (AWS/Lambda) and dimensions (CapacityProviderName, FunctionName, or InstanceType).
Cannot find CloudWatch logs
Problem: Your function executes successfully, but you cannot find logs in CloudWatch Logs.
Cause: Lambda Managed Instances run in your VPC and require network connectivity to send logs to CloudWatch Logs. Without proper VPC connectivity configuration, your functions cannot reach the CloudWatch Logs service endpoint.
Solution:
Configure VPC connectivity to enable your functions to send logs to CloudWatch Logs. You have three options:
Option 1: VPC endpoint for CloudWatch Logs (recommended for production)
-
Open the Amazon VPC console at console.aws.amazon.com/vpc/
. -
In the navigation pane, choose Endpoints.
-
Choose Create endpoint.
-
For Service category, choose Amazon services.
-
For Service name, select
com.amazonaws.region.logs(replaceregionwith your Amazon Region). -
For VPC, select the VPC used by your capacity provider.
-
For Subnets, select the subnets where you want to create endpoint network interfaces. For high availability, select subnets in multiple Availability Zones.
-
For Security groups, select security groups that allow inbound HTTPS traffic (port 443) from your function's security group.
-
Enable Private DNS for the endpoint.
-
Choose Create endpoint.
Option 2: Public subnet with internet gateway
If your capacity provider uses public subnets, ensure:
-
An internet gateway is attached to your VPC
-
The route table routes
0.0.0.0/0traffic to the internet gateway -
Security groups allow outbound HTTPS traffic on port 443
Option 3: Private subnet with NAT gateway
If your capacity provider uses private subnets, ensure:
-
A NAT gateway exists in a public subnet
-
The private subnet route table routes
0.0.0.0/0traffic to the NAT gateway -
The public subnet route table routes
0.0.0.0/0traffic to an internet gateway -
Security groups allow outbound HTTPS traffic on port 443
For detailed guidance on VPC connectivity options, see VPC connectivity for Lambda Managed Instances.
Difficulty correlating logs from concurrent requests
Problem: Logs from different requests are interleaved, making it difficult to trace individual requests.
Cause: Log interleaving is expected and standard behavior in multi-concurrent systems.
Solution:
-
Use structured logging with JSON format: Include request ID in all log statements
-
Java: Use Log4j with
ThreadContextto automatically include request ID -
Node.js: Use
console.log()with JSON formatting and includeInvokeStore.getRequestId() -
Python: Use the standard logging module with JSON formatting and include
context.request_id
For detailed guidance, see the runtime-specific documentation pages.
Getting additional help
If you continue to experience issues after trying these solutions:
-
Review CloudWatch metrics: Check capacity provider and execution environment metrics to identify resource constraints or scaling issues.
-
Check Amazon CloudTrail logs: Review CloudTrail logs for detailed information about API calls and errors.
-
Contact Amazon Support: If you cannot resolve the issue, contact Amazon Support with details about your capacity provider configuration, function configuration, and the specific error messages you're encountering.
Next steps
-
Learn about capacity providers for Lambda Managed Instances
-
Understand scaling for Lambda Managed Instances
-
Review runtime-specific guides for Java, Node.js, and Python
-
Monitor Lambda Managed Instances with CloudWatch metrics