Resilience in Amazon Kinesis Data Streams
The Amazon global infrastructure is built around Amazon Regions and Availability Zones. Amazon Regions provide multiple physically separated and isolated Availability Zones, which are connected with low-latency, high-throughput, and highly redundant networking. With Availability Zones, you can design and operate applications and databases that automatically fail over between Availability Zones without interruption. Availability Zones are more highly available, fault tolerant, and scalable than traditional single or multiple data center infrastructures.
For more information about Amazon Regions and Availability Zones, see Amazon Global
Infrastructure
In addition to the Amazon global infrastructure, Kinesis Data Streams offers several features to help support your data resiliency and backup needs.
Disaster recovery in Amazon Kinesis Data Streams
Failure can occur at the following levels when you use an Amazon Kinesis Data Streams application to process data from a stream:
-
A record processor could fail
-
A worker could fail, or the instance of the application that instantiated the worker could fail
-
An EC2 instance that is hosting one or more instances of the application could fail
Record processor failure
The worker invokes record processor methods using Java ExecutorService
Worker or application failure
If a worker—or an instance of the Amazon Kinesis Data Streams application—fails, you should detect and
handle the situation. For example, if the Worker.run
method throws an
exception, you should catch and handle it.
If the application itself fails, you should detect this and restart it. When the application starts up, it instantiates a new worker, which in turn instantiates new record processors that are automatically assigned shards to process. These could be the same shards that these record processors were processing before the failure, or shards that are new to these processors.
In a situation where the worker or application fails, the failure isn't detected, and there are other instances of the application running on other EC2 instances, workers on these other instances handle the failure. They create additional record processors to process the shards that are no longer being processed by the failed worker. The load on these other EC2 instances increases accordingly.
The scenario described here assumes that although the worker or application has failed, the hosting EC2 instance is still running and is therefore not restarted by an Auto Scaling group.
Amazon EC2 instance failure
We recommend that you run the EC2 instances for your application in an Auto Scaling group. This way, if one of the EC2 instances fails, the Auto Scaling group automatically launches a new instance to replace it. You should configure the instances to launch your Amazon Kinesis Data Streams application at startup.