Amazon Managed Service for Apache Flink was previously known as Amazon Kinesis Data Analytics for Apache Flink.
Troubleshoot performance issues
This section contains a list of symptoms that you can check to diagnose and fix performance issues.
If your data source is a Kinesis stream, performance issues typically present as a high
or increasing millisbehindLatest
metric. For other sources, you can check a similar
metric that represents lag in reading from the source.
Understand the data path
When investigating a performance issue with your application, consider the entire path that your data takes. The following application components may become performance bottlenecks and create backpressure if they are not properly designed or provisioned:
Data sources and destinations: Ensure that the external resources your application interacts with are property provisioned for the throughput your application will experience.
State data: Ensure that your application doesn't interact with the state store too frequently.
You can optimize the serializer your application is using. The default Kryo serializer can handle any serializable type, but you can use a more performant serializer if your application only stores data in POJO types. For information about Apache Flink serializers, see Data Types & Serialization
in the Apache Flink documentation. Operators: Ensure that the business logic implemented by your operators isn't too complicated, or that you aren't creating or using resources with every record processed. Also ensure that your application isn't creating sliding or tumbling windows too frequently.
Performance troubleshooting solutions
This section contains potential solutions to performance issues.
Topics
CloudWatch monitoring levels
Verify that the CloudWatch Monitoring Levels are not set to too verbose a setting.
The Debug
Monitoring Log Level setting generates a large amount
of traffic, which can create backpressure. You should
only use it while actively investigating issues with the application.
If your application has a high Parallelism
setting, using the
Parallelism
Monitoring Metrics Level will similarly generate a large amount of
traffic that can lead to backpressure. Only use this metrics level when
Parallelism
for your application is low, or while investigating issues
with the application.
For more information, see Control application monitoring levels.
Application CPU metric
Check the application's CPU
metric. If this metric is above 75 percent,
you can allow the application to allocate more resources for itself by enabling auto
scaling.
If auto scaling is enabled, the application allocates more resources if CPU usage is over 75 percent for 15 minutes. For more information about scaling, see the Manage scaling properly section following, and the Implement application scaling in Managed Service for Apache Flink.
Note
An application will only scale automatically in response to CPU usage. The application
will not auto scale in response to other system metrics, such as
heapMemoryUtilization
. If your application has a high level of usage for
other metrics, increase your application's parallelism manually.
Application parallelism
Increase the application's parallelism. You update the application's parallelism using
the ParallelismConfigurationUpdate
parameter of the
UpdateApplication action.
The maximum KPUs for an application is 64 by default, and can be increased by requesting a limit increase.
It is important to also assign parallelism to each operator based on its workload, rather than just increasing application parallelism alone. See Operator parallelism following.
Application logging
Check if the application is logging an entry for every record being processed. Writing a log entry for each record during times when the application has high throughput will cause severe bottlenecks in data processing. To check for this condition, query your logs for log entries that your application writes with every record it processes. For more information about reading application logs, see Analyze logs with CloudWatch Logs Insights.
Operator parallelism
Verify that your application's workload is distributed evenly among worker processes.
For information about tuning the workload of your application's operators, see Operator scaling.
Application logic
Examine your application logic for inefficient or non-performant operations, such as
accessing an external dependency (such as a database or a web service), accessing
application state, etc. An external dependency can also hinder performance if it is not
performant or not reliably accessible, which may lead to the external dependency returing
HTTP 500
errors.
If your application uses an external dependency to enrich or otherwise
process incoming data, consider using asynchronous IO instead. For more information, see
Async I/O
Application memory
Check your application for resource leaks. If your application is not properly disposing of threads or memory, you might see the
millisbehindLatest
, CheckpointSize
, and CheckpointDuration
metric spiking or gradually increasing. This condition may also lead to task manager
or job manager failures.