Amazon Managed Service for Apache Flink was previously known as Amazon Kinesis Data Analytics for Apache Flink.
Application issues
This section contains solutions for error conditions that you may encounter with your Managed Service for Apache Flink application.
Topics
- Application is stuck in a transient status
- Snapshot creation fails
- Cannot access resources in a VPC
- Data is lost when writing to an Amazon S3 bucket
- Application is in the RUNNING status but isn't processing data
- Snapshot, application update, or application stop error: InvalidApplicationConfigurationException
- java.nio.file.NoSuchFileException: /usr/local/openjdk-8/lib/security/cacerts
Application is stuck in a transient status
If your application stays in a transient status (STARTING
,
UPDATING
, STOPPING
, or AUTOSCALING
), you
can stop your application by using the StopApplication action with the Force
parameter set to true
. You can't force stop an application in the
DELETING
status. Alternatively, if the application is in the
UPDATING
or AUTOSCALING
status, you can roll it back
to the previous running version. When you roll back an application, it loads state
data from the last successful snapshot. If the application has no snapshots, Managed Service for Apache Flink
rejects the rollback request. For more information about rolling back an
application, see RollbackApplication action.
Note
Force-stopping your application may lead to data loss or duplication. To prevent data loss or duplicate processing of data during application restarts, we recommend you to take frequent snapshots of your application.
Causes for stuck applications include the following:
-
Application state is too large: Having an application state that is too large or too persistent can cause the application to become stuck during a checkpoint or snapshot operation. Check your application's
lastCheckpointDuration
andlastCheckpointSize
metrics for steadily increasing values or abnormally high values. -
Application code is too large: Verify that your application JAR file is smaller than 512 MB. JAR files larger than 512 MB are not supported.
-
Application snapshot creation fails: Managed Service for Apache Flink takes a snapshot of the application during an
UpdateApplication
orStopApplication
request. The service then uses this snapshot state and restores the application using the updated application configuration to provide exactly-once processing semantics.If automatic snapshot creation fails, see Snapshot creation fails following. -
Restoring from a snapshot fails: If you remove or change an operator in an application update and attempt to restore from a snapshot, the restore will fail by default if the snapshot contains state data for the missing operator. In addition, the application will be stuck in either the
STOPPED
orUPDATING
status. To change this behavior and allow the restore to succeed, change the AllowNonRestoredState parameter of the application's FlinkRunConfiguration totrue
. This will allow the resume operation to skip state data that cannot be mapped to the new program. -
Application initialization taking longer: Managed Service for Apache Flink uses an internal timeout of 5 minutes (soft setting) while waiting for a Flink job to start. If your job is failing to start within this timeout, you will see a CloudWatch log as follows:
Flink job did not start within a total timeout of 5 minutes for application: %s under account: %s
If you encounter the above error, it means that your operations defined under Flink job’s
main
method are taking more than 5 minutes, causing the Flink job creation to time out on the Managed Service for Apache Flink end. We suggest you check the Flink JobManager logs as well as your application code to see if this delay in themain
method is expected. If not, you need to take steps to address the issue so it completes in under 5 minutes.
You can check your application status using either the ListApplications
or the DescribeApplication
actions.
Snapshot creation fails
The Managed Service for Apache Flink service can't take a snapshot under the following circumstances:
The application exceeded the snapshot limit. The limit for snapshots is 1,000. For more information, see Manage application backups using snapshots.
The application doesn't have permissions to access its source or sink.
The application code isn't functioning properly.
The application is experiencing other configuration issues.
If you get an exception while taking a snapshot during an application update or while
stopping the application, set the SnapshotsEnabled
property of your application's ApplicationSnapshotConfiguration
to false
and
retry the request.
Snapshots can fail if your application's operators are not properly provisioned. For information about tuning operator performance, see Operator scaling.
After the application returns to a healthy state, we recommend that you
set the application's SnapshotsEnabled
property to true
.
Cannot access resources in a VPC
If your application uses a VPC running on Amazon VPC, do the following to verify that your application has access to its resources:
-
Check your CloudWatch logs for the following error. This error indicates that your application cannot access resources in your VPC:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
If you see this error, verify that your route tables are set up correctly, and that your connectors have the correct connection settings.
For information about setting up and analyzing CloudWatch logs, see Logging and monitoring in Amazon Managed Service for Apache Flink.
Data is lost when writing to an Amazon S3 bucket
Some data loss might occur when writing output to an Amazon S3 bucket using Apache Flink version 1.6.2. We recommend using the latest supported version of Apache Flink when using Amazon S3 for output directly. To write to an Amazon S3 bucket using Apache Flink 1.6.2, we recommend using Firehose. For more information about using Firehose with Managed Service for Apache Flink, see Firehose sink.
Application is in the RUNNING status but isn't processing data
You can check your application status by using either the ListApplications
or the DescribeApplication
actions. If your application enters the RUNNING
status but isn't writing data to your sink, you can troubleshoot the issue by adding an Amazon CloudWatch log stream to your application. For more information, see Work with application CloudWatch logging
options. The log stream contains messages that you can use to troubleshoot application issues.
Snapshot, application update, or application stop error: InvalidApplicationConfigurationException
An error similar to the following might occur during a snapshot operation, or during an operation that creates a snapshot, such as updating or stopping an application:
An error occurred (InvalidApplicationConfigurationException) when calling the UpdateApplication operation: Failed to take snapshot for the application xxxx at this moment. The application is currently experiencing downtime. Please check the application's CloudWatch metrics or CloudWatch logs for any possible errors and retry the request. You can also retry the request after disabling the snapshots in the Managed Service for Apache Flink console or by updating the ApplicationSnapshotConfiguration through the Amazon SDK
This error occurs when the application is unable to create a snapshot.
If you encounter this error during a snapshot operation or an operation that creates a snapshot, do the following:
-
Disable snapshots for your application. You can do this either in the Managed Service for Apache Flink console, or by using the
SnapshotsEnabledUpdate
parameter of the UpdateApplication action. -
Investigate why snapshots cannot be created. For more information, see Application is stuck in a transient status.
-
Reenable snapshots when the application returns to a healthy state.
java.nio.file.NoSuchFileException: /usr/local/openjdk-8/lib/security/cacerts
The location of the SSL truststore was updated in a previous deployment. Use the following value for the ssl.truststore.location
parameter instead:
/usr/lib/jvm/java-11-amazon-corretto/lib/security/cacerts