

# Fault testing on Amazon EBS
Fault testing

Amazon Fault Injection Service (Amazon FIS) is a fully managed service that helps you perform fault injection experiments on your Amazon workloads. With EBS actions in Amazon FIS, you can test how your applications respond to storage faults that can result in I/O interruptions and degraded performance on your volumes. This controlled testing environment enables you to observe how your applications respond to disruptions so you can identify weaknesses in your architecture and improve the overall resilience of your applications. Using the pause I/O action and the latency injection action, you can test your monitoring and recovery mechanisms such as Amazon CloudWatch alarms and failover workflows, and improve the resiliency of your mission-critical applications to storage faults. For more information about Amazon FIS, see the [Amazon Fault Injection Service User Guide](https://docs.amazonaws.cn/fis/latest/userguide/what-is.html).

## Available experiments


Amazon EBS currently supports two Amazon FIS fault injections:
+ [Pause I/O fault injection](ebs-fis-pause-io.md)
+ [Latency injection](ebs-fis-latency-injection.md)

## Considerations


The following considerations apply:
+ All Amazon EBS volume types are supported. Both root volumes and data volumes are supported. Instance store volumes are not supported.
+ Volumes must be attached to [ Nitro-based EC2 instances](https://docs.amazonaws.cn/AWSEC2/latest/UserGuide/instance-types.html#instance-hypervisor-type).
+ Your volumes will resume their original I/O performance once the experiment completes based on the duration. You can also stop a running experiment before it completes. Alternatively, you can create a stop condition to stop the experiment if it reaches a threshold that you define in a CloudWatch alarm.
+ You can use Amazon FIS with Multi-Attach enabled volumes. All of the attached instances are impacted. You can't select a specific volume-instance attachment for experiments.
+ FIS is currently not available in Local Zones, Outposts, or Wavelength Zones.
+ You can test up to 5 volumes in the same Availability Zone simultaneously when specifying volume ARNs in the console.
+ You can't use Amazon FIS with volumes created on an Outpost, in an Amazon Wavelength Zone, or in a Local Zone.

# Pause I/O fault injection


Use Amazon Fault Injection Service and the Pause I/O action to temporarily stop I/O between an Amazon EBS volume and the instances to which it is attached to test how your workloads handle I/O interruptions. 

For more information about Amazon FIS, see the [https://docs.amazonaws.cn/fis/latest/userguide/what-is.html](https://docs.amazonaws.cn/fis/latest/userguide/what-is.html).

**Considerations**

Keep in mind the following considerations for pausing volume I/O:
+ Pause I/O is supported on all [Nitro-based instance types](https://docs.amazonaws.cn/ec2/latest/instancetypes/ec2-nitro-instances.html).
+ To test your OS timeout configuration, set the experiment duration equal to or greater than the value specified for `nvme_core.io_timeout`. For more information, see [NVMe I/O operation timeout for Amazon EBS volumes](timeout-nvme-ebs-volumes.md).
+ If you drive I/O to a volume that has I/O paused, the following happens:
  + The volume's status transitions to `impaired` within 120 seconds. For more information, see [Amazon EBS volume status checks](monitoring-volume-checks.md).
  + The CloudWatch metric for `VolumeStalledIOCheck` will be `1` if volume I/O is paused for over 60 seconds. For more information see [Metrics for Amazon EBS volumes](using_cloudwatch_ebs.md#ebs-volume-metrics).
  + The CloudWatch metrics for queue length (`VolumeQueueLength`) will be non-zero. Any alarms or monitoring should monitor for a non-zero queue depth.
  + The CloudWatch metrics for `VolumeReadOps` or `VolumeWriteOps` will be `0`, which indicates that the volume is no longer processing I/O.

You can perform a basic experiment from the Amazon EC2 console, or you can perform more advanced experiments using the Amazon FIS console. For more information about performing advanced experiments using the Amazon FIS console, see [ Tutorials for Amazon FIS](https://docs.amazonaws.cn/fis/latest/userguide/fis-tutorials.html) in the *Amazon Fault Injection Service User Guide*.

**To perform a basic experiment using the Amazon EC2 console**

1. Open the Amazon EC2 console at [https://console.amazonaws.cn/ec2/](https://console.amazonaws.cn/ec2/).

1. In the navigation pane, choose **Volumes.**

1. Select the volume for which to pause I/O and choose **Actions**, **Fault injection**, **Pause volume I/O**.

1. For **Duration**, enter the duration for which to pause I/O between the volume and the instances. The field next to the Duration dropdown list shows the duration in ISO 8601 format.

1. In the **Service access** section, select the IAM service role for Amazon FIS to assume to perform the experiment. You can use either the default role, or an existing role that you created. For more information, see [Create an IAM role for Amazon FIS experiments](https://docs.amazonaws.cn/fis/latest/userguide/getting-started-iam-service-role.html).

1. Choose **Pause volume I/O**. When prompted, enter `start` in the confirmation field and choose **Start experiment**.

1. Monitor the progress and impact of your experiment. For more information, see [Monitoring Amazon FIS](https://docs.amazonaws.cn/fis/latest/userguide/monitoring-experiments.html) in the *Amazon FIS User Guide*.

# Latency injection


Use the Latency Injection action (`aws:ebs:volume-io-latency`) in Amazon FIS to simulate elevated I/O latency on your Amazon EBS volumes to test how your applications respond to storage performance degradation. This action allows you to specify the latency value to be injected as well as the percentage of I/O that will be impacted on the target volume. With Amazon FIS, you can use pre-configured latency experiment templates to get started with testing different I/O latency patterns that may be observed during storage faults. These templates are designed as an initial set of scenarios you can use to introduce disruptions to your applications to test resiliency. They are not designed to encompass all types of impact your applications can experience in the real world. We recommend that you to adapt them to run multiple different tests based on the performance needs of your applications. You can customize the available templates or create new experiment templates to test for your application specific requirements.

**Pre-configured latency experiment templates**  
Amazon EBS provides the following latency experiment templates through the EBS Console and the [Amazon FIS scenario library](https://docs.amazonaws.cn/fis/latest/userguide/scenario-library-scenarios.html). You can directly use these templates on your target volumes to run a latency injection experiment.
+ **Sustained Latency** — Simulates constant latency. This experiment utilizes one latency injection action and has a total duration of 15 minutes. This experiment simulates persistent latency on 50 percent of read I/O and 100 percent of write I/O: 500 ms for 15 minutes.
+ **Increasing Latency** — Simulates gradually increasing latency. This experiment utilizes five latency injection actions and has a total duration of 15 minutes. This experiment will simulate a gradual increase in latency on 10 percent of read I/O and 25 percent of write I/O: 50 ms for 3 minutes, 200 ms for 3 minutes, 700 ms for 3 minutes, 1 second for 3 minutes, and 15 seconds for 3 minutes.
+ **Intermittent Latency** — Simulates sharp intermittent latency spikes with periods of recovery in between. This experiment utilizes three latency injection actions and has a total duration of 15 minutes. This experiment will simulate three latency spikes on 0.1 percent of read and write I/O: 30 second spike that lasts for 1 minute, 10 second spike that lasts for 2 minutes, and 20 second spike that lasts for 2 minutes. There will be 5 minute periods of recovery between each latency spike. 
+ **Decreasing Latency** — Simulates gradually decreasing latency. This experiment utilizes five latency injection actions and has a total duration of 15 minutes. This experiment will simulate a gradual decrease in latency on 10 percent of read I/O and write I/O: 20 seconds for 3 minutes, 5 seconds for 3 minutes, 900 ms for 3 minutes, 300 ms for 3 minutes, and 40 ms for 3 minutes.

**Customize preconfigured scenarios**

You customize the preconfigured templates above or create your own new experiment templates using the following customizable parameters.
+ `readIOPercentage` — Percentage of read I/O operations that latency will be injected on. This is the percentage of all read I/O operations on the volume that will be impacted by the action.

  Range: Min 0.1%, Max 100%
+ `readIOLatencyMilliseconds` — Amount of latency injected on read I/O operations. This is the latency value that will be observed on the specified percentage of the read I/O during the experiment.

  Range: Min 1 ms (io2) / 10 ms (non-io2), Max 60 seconds
+ `writeIOPercentage` — Percentage of write I/O operations that latency will be injected on. This is the percentage of all write I/O operations on the volume that will be impacted by the action.

  Range: Min 0.1%, Max 100%
+ `writeIOLatencyMilliseconds` — Amount of latency injected on write I/O operations. This is the latency value that will be observed on the specified percentage of the write I/O during the experiment.

  Range: Min 1ms (io2) / 10ms (non-io2), Max 60 seconds
+ `duration` — Duration for which the latency will be injected on the percentage of I/O selected.

  Range: Min 1 second, Max 12 hours

**Monitoring latency injection**  
You can monitor the performance impact on your volumes in the following ways:
+ Use average latency metrics in CloudWatch to get per-minute average I/O latency. For more information, see [ Monitor your EBS volumes using CloudWatch](https://docs.amazonaws.cn/ebs/latest/userguide/using_cloudwatch_ebs.html).
+ Use EBS detailed performance statistics available through NVMe-CLI, CloudWatch agent, and Prometheus to get per-second average I/O latency. The detailed metrics also provide I/O latency histograms that you can use to analyze latency variance on your volumes. For more information, see [ NVMe detailed performance statistics](https://docs.amazonaws.cn/ebs/latest/userguide/nvme-detailed-performance-stats.html).
+ Use the [Amazon EBS volume status checks](monitoring-volume-checks.md). When you inject I/O latency, the volume's status transitions to the `warning` state.

**Considerations**  
Consider the following when using EBS latency injection:
+ Latency injection is supported on all [Nitro-based instance types](https://docs.amazonaws.cn/ec2/latest/instancetypes/ec2-nitro-instances.html), except: P4d, P5, P5e, Trn2u, G6, G6f, Gr6, Gr6f, M8i, M8i-flex, C8i-flex, R8i, R8i-flex, I8ge, Mac-m4pro, and Mac-m4.
+ You might see up to 5 percent variance in the latency value specified in the experiment and the resultant latency observed.
+ If you drive a very small number of I/O operations, the percentage of I/O specified in the action parameters might not match the actual percentage of I/O impacted by the action.

**To run a latency injection experiment on an Amazon EBS volume**

1. Open the Amazon EC2 console at [https://console.amazonaws.cn/ec2/](https://console.amazonaws.cn/ec2/).

1. In the navigation pane, choose **Volumes**.

1. Select the volumes on which to run the experiment and choose **Actions**, **Resilience testing**, **Inject volume I/O latency**.

   The Amazon Fault Injection Service console opens. 

1. In the **Create experiment** window, select the type of experiment to run: **Intermittent**, **Increasing**, **Sustained**, or **Decreasing**.

1. For **IAM role selection**, choose **Create a new role** to create a new role that Amazon FIS will use to conduct the experiments on your behalf. Alternatively, choose **Use an existing IAM role** if you previously created an IAM role with the required permissions.

1. The **Pricing estimate** section gives you an estimate of the cost of running the experiment. With Amazon FIS, you are charged per minute that an action runs, from start to finish, based on the number of target accounts for your experiment.

1. Choose **Start experiment**.