Amazon Managed Service for Apache Flink (Amazon MSF) was previously known as Amazon Kinesis Data Analytics for Apache Flink.
Troubleshoot performance issues
This section contains a list of symptoms that you can check to diagnose and fix performance issues.
If your data source is a Kinesis stream, performance issues typically present as a high 
            or increasing millisbehindLatest metric. For other sources, you can check a similar
        metric that represents lag in reading from the source.
Understand the data path
When investigating a performance issue with your application, consider the entire path that your data takes. The following application components may become performance bottlenecks and create backpressure if they are not properly designed or provisioned:
Data sources and destinations: Ensure that the external resources your application interacts with are properly provisioned for the throughput your application will experience.
State data: Ensure that your application doesn't interact with the state store too frequently.
You can optimize the serializer your application is using. The default Kryo serializer can handle any serializable type, but you can use a more performant serializer if your application only stores data in POJO types. For information about Apache Flink serializers, see Data Types & Serialization
in the Apache Flink documentation. Operators: Ensure that the business logic implemented by your operators isn't too complicated, or that you aren't creating or using resources with every record processed. Also ensure that your application isn't creating sliding or tumbling windows too frequently.
Performance troubleshooting solutions
This section contains potential solutions to performance issues.
Topics
CloudWatch monitoring levels
Verify that the CloudWatch Monitoring Levels are not set to too verbose a setting.
The Debug Monitoring Log Level setting generates a large amount 
                            of traffic, which can create backpressure. You should 
                            only use it while actively investigating issues with the application. 
If your application has a high Parallelism setting, using the 
                    Parallelism Monitoring Metrics Level will similarly generate a large amount of 
                        traffic that can lead to backpressure. Only use this metrics level when 
                    Parallelism for your application is low, or while investigating issues
                    with the application.
For more information, see Control application monitoring levels.
Application CPU metric
Check the application's CPU metric. If this metric is above 75 percent,
                        you can allow the application to allocate more resources for itself by enabling auto 
                        scaling.
If auto scaling is enabled, the application allocates more resources if CPU usage is over 75 percent for 15 minutes. For more information about scaling, see the Manage scaling properly section following, and the Implement application scaling.
Note
An application will only scale automatically in response to CPU usage. The application
                        will not auto scale in response to other system metrics, such as 
                        heapMemoryUtilization. If your application has a high level of usage for 
                    other metrics, increase your application's parallelism manually.
Application parallelism
Increase the application's parallelism. You update the application's parallelism using 
                        the ParallelismConfigurationUpdate parameter of the 
                        UpdateApplication action.
The maximum KPUs for an application is 64 by default, and can be increased by requesting a limit increase.
It is important to also assign parallelism to each operator based on its workload, rather than just increasing application parallelism alone. See Operator parallelism following.
Application logging
Check if the application is logging an entry for every record being processed. Writing a log entry for each record during times when the application has high throughput will cause severe bottlenecks in data processing. To check for this condition, query your logs for log entries that your application writes with every record it processes. For more information about reading application logs, see Analyze logs with CloudWatch Logs Insights.
Operator parallelism
Verify that your application's workload is distributed evenly among worker processes.
For information about tuning the workload of your application's operators, see Operator scaling.
Application logic
Examine your application logic for inefficient or non-performant operations, such as 
                        accessing an external dependency (such as a database or a web service), accessing 
                        application state, etc. An external dependency can also hinder performance if it is not
                        performant or not reliably accessible, which may lead to the external dependency returing
                        HTTP 500 errors. 
If your application uses an external dependency to enrich or otherwise 
                        process incoming data, consider using asynchronous IO instead. For more information, see 
                        Async I/O
Application memory
Check your application for resource leaks. If your application is not properly disposing of threads or memory, you might see the 
                    millisbehindLatest, CheckpointSize, and CheckpointDurationmetric spiking or gradually increasing. This condition may also lead to task manager
                or job manager failures.