Production readiness - Managed Service for Apache Flink
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Managed Service for Apache Flink was previously known as Amazon Kinesis Data Analytics for Apache Flink.

Production readiness

This is a collection of important aspects of running production applications on Managed Service for Apache Flink. It's not an exhaustive list, but rather the bare minimum of what you should pay attention to before putting an application into production.

Load testing applications

Some problems with applications only manifest under heavy load. We have seen cases where applications seemed healthy and an operational event substantially amplified the load on the application. This can happen completely independent of the application itself: If the data source or the data sink is unavailable for a couple of hours, the Flink application cannot make progress. Once that issue is fixed there is a backlog of unprocessed data has accumulated that can completely exhaust the available resources. The load can then amplify bugs or performance issues that have not been emerging before.

It is therefore essential to run proper load tests for production applications. Questions that should be answered during those load tests include:

  • Is the application stable under sustained high load?

  • Can the application still take a savepoint under peak load?

  • How long does it take to process a backlog of 1 hour? And how long for 24 hours (depending on the max retention of the data in the stream)?

  • Does the throughput of the application increase when the application is scaled?

When consuming from a data stream, these scenarios can be simulated by producing into the stream for some time. Then start the application and have it consume data from the beginning of time, e.g., use a start position of TRIM_HORIZON in the case of a Kinesis Data Stream.

Setting an explicit max parallelism for each of your operators within your DAG

This should be a function of your total parallelism for the application. An example would be if your total parallelism is 10, an individual operator could be set to half of the total parallelism, such as:

.setParallelism(MAX_PARALLELISM/2)

where MAX_PARALLELISM is a variable equal to the result of env.getParallelism().

Set a UUID for all operators

A UUID is used in the operation in which Flink maps a savepoint back to an individual operator. Setting a specific UUID for each operator gives a stable mapping for the savepoint process to restore.

.map(...).uid("my-map-function")

For more information, see Production Readiness Checklist.