Fault tolerance
You can use the following checks for the fault tolerance category.
Check names
- Amazon DocumentDB Single AZ Clusters
- Amazon EBS Snapshots
- Amazon ECS AmazonLogs driver in blocking mode
- Amazon ElastiCache Multi-AZ clusters
- Amazon MemoryDB Multi-AZ clusters
- Amazon MSK Cluster Multi-AZ
- Amazon RDS Backups
- Amazon S3 Bucket Logging
- Auto Scaling Group Health Check
- Auto Scaling Group Resources
- Amazon Direct Connect Location Resiliency
- Amazon Outposts Single Rack deployment
- CLB Connection Draining
- ELB Target Imbalance
- Load Balancer Optimization
- Network Firewall Multi-AZ
Amazon DocumentDB Single AZ Clusters
- Description
-
Checks if there are Amazon DocumentDB clusters configured as Single-AZ.
Running Amazon DocumentDB workloads in a Single-AZ architecture is not sufficient for highly critical workloads and it can take up to 10 minutes to recover from a component failure. Customers should deploy replica instances in additional availability zones to ensure availability during maintenance, instance failures, component failures, or availability zone failures.
Note
Results for this check are automatically refreshed one or more times each day, and refresh requests are not allowed. It might take a few hours for changes to appear. Currently, you can’t exclude resources from this check.
- Check ID
-
c15vnddn2x
- Alert Criteria
-
Yellow: Amazon DocumentDB cluster has instances in less than three availability zones.
Green: Amazon DocumentDB cluster has instances in three availability zones.
- Recommended Action
If your application requires high availability, modify your DB instance to enable Multi-AZ using replica instances. See Amazon DocumentDB High Availability and Replication
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Availability Zone
-
DB Cluster Identifier
-
DB Cluster ARN
-
Last Updated Time
-
Amazon EBS Snapshots
- Description
-
Checks the age of the snapshots for your Amazon EBS volumes (either available or in-use). Failures can occur even if Amazon EBS volumes are replicated. Snapshots are persisted toAmazon S3 for durable storage and point-in-time recovery.
- Check ID
-
H7IgTzjTYb
- Alert Criteria
-
-
Yellow: The most recent volume snapshot is between 7 and 30 days old.
-
Red: The most recent volume snapshot is more than 30 days old.
-
Red: The volume does not have a snapshot.
-
- Recommended Action
-
Create weekly or monthly snapshots of your volumes. For more information, see Creating an Amazon EBS Snapshot.
To automate the creation of EBS snapshots, you can consider using Amazon Backup or Amazon Data Lifecycle Manager.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Volume ID
-
Volume Name
-
Snapshot ID
-
Snapshot Name
-
Snapshot Age
-
Volume Attachment
-
Reason
-
Amazon ECS AmazonLogs driver in blocking mode
- Description
-
Checks for Amazon ECS task definitions configured with the AmazonLogs logging driver in blocking mode. A driver configured in the blocking mode risks system availability.
Note
Results for this check are automatically refreshed one or more times each day, and refresh requests are not allowed. It might take a few hours for changes to appear. Currently, you can’t exclude resources from this check.
- Check ID
-
c1dvkm4z6b
- Alert Criteria
-
Yellow: The awslogs driver logging configuration parameter mode is set to blocking or missing. A missing mode parameter indicates a default blocking configurations.
Green: Amazon ECS task definition is not using the awslogs driver or the awslogs driver is configured in non-blocking mode.
- Recommended Action
To mitigate the availability risk, consider changing the task definition AmazonLogs driver configuration from blocking to non-blocking. With non-blocking mode, you will have to set a value for the max-buffer-size parameter. For more information and guidance on configuration parameters, see . See Preventing log loss with non-blocking mode in the AmazonLogs container log driver
- Additional Resources
-
Using the Amazon logs log driver
Choosing container logging options to avoid backpressure
Preventing log loss with non-blocking mode in the AmazonLogs container log driver
- Report columns
-
-
Status
-
Region
-
Task Definition ARN
-
Container Definition Names
-
Last Updated Time
-
Amazon ElastiCache Multi-AZ clusters
- Description
-
Checks for ElastiCache clusters that deploy in a single Availability Zone (AZ). This check alerts you if Multi-AZ is inactive in a cluster.
Deployments in multiple AZs enhance ElastiCache cluster availability by asynchronously replicating to read-only replicas in a different AZ. When planned cluster maintenance occurs, or a primary node is unavailable, ElastiCache automatically promotes a replica to primary. This failover allows cluster write operations to resume, and doesn't require an administrator to intervene.
Note
Results for this check are automatically refreshed several times daily, and refresh requests are not allowed. It might take a few hours for changes to appear. Currently, you can’t exclude resources from this check.
- Check ID
-
ECHdfsQ402
- Alert Criteria
-
-
Green: Multi-AZ is active in the cluster.
-
Yellow: Multi-AZ is inactive in the cluster.
-
- Recommended Action
-
Create at least one replica per shard, in an AZ that is different than the primary.
- Additional Resources
-
For more information, see Minimizing downtime in ElastiCache (Redis OSS) with Multi-AZ.
- Report columns
-
-
Status
-
Region
-
Cluster Name
-
Last Updated Time
-
Amazon MemoryDB Multi-AZ clusters
- Description
-
Checks for MemoryDB clusters that deploy in a single Availability Zone (AZ). This check alerts you if Multi-AZ is inactive in a cluster.
Deployments in multiple AZs enhance MemoryDB cluster availability by asynchronously replicating to read-only replicas in a different AZ. When planned cluster maintenance occurs, or a primary node is unavailable, MemoryDB automatically promotes a replica to primary. This failover allows cluster write operations to resume, and doesn't require an administrator to intervene.
Note
Results for this check are automatically refreshed several times daily, and refresh requests are not allowed. It might take a few hours for changes to appear. Currently, you can’t exclude resources from this check.
- Check ID
-
MDBdfsQ401
- Alert Criteria
-
-
Green: Multi-AZ is active in the cluster.
-
Yellow: Multi-AZ is inactive in the cluster.
-
- Recommended Action
-
Create at least one replica per shard, in an AZ that is different than the primary.
- Additional Resources
-
For more information, see Minimizing downtime in MemoryDB with Multi-AZ.
- Report columns
-
-
Status
-
Region
-
Cluster Name
-
Last Updated Time
-
Amazon MSK Cluster Multi-AZ
- Description
-
Checks the number of Availability Zones (AZs) for your Amazon MSK provisioned cluster. The Amazon MSK cluster is formed of several brokers that work together and distribute the data and load. Production might be interrupted during maintenance or broker issues in a 2-AZ cluster.
- Check ID
-
90046ff5b5
- Alert Criteria
-
-
Yellow: The Amazon MSK cluster is provisioned with brokers in only two AZs
-
Green: The Amazon MSK cluster is provisioned with brokers across three or more AZs
-
- Recommended Action
-
To increase availability of the cluster, you can create another cluster in a 3 AZs setup. Then migrate the existing cluster to the new cluster that you created. You can use Amazon MSK replication for this migration.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
MSK Cluster ARN
-
Number of AZs
-
Last Updated Time
-
Amazon RDS Backups
- Description
-
Checks for automated backups of Amazon RDS DB instances.
By default, backups are enabled with a retention period of one day. Backups reduce the risk of unexpected data loss and allow for point-in-time recovery.
- Check ID
-
opQPADkZvH
- Alert Criteria
-
Red: A DB instance has the backup retention period set to 0 days.
- Recommended Action
-
Set the retention period for the automated DB instance backup to 1 to 35 days as appropriate to the requirements of your application. See Working With Automated Backups.
- Additional Resources
- Report columns
-
-
Status
-
Region/AZ
-
DB Instance
-
VPC ID
-
Backup Retention Period
-
Amazon S3 Bucket Logging
- Description
-
Checks the logging configuration of Amazon Simple Storage Service (Amazon S3) buckets.
When server access logging is enabled, detailed access logs are delivered hourly to a bucket that you choose. An access log record contains details about each request, such as the request type, the resources specified in the request, and the time and date the request was processed. By default, bucket logging is not enabled. You should enable logging if you want to perform security audits or learn more about users and usage patterns.
When logging is initially enabled, the configuration is automatically validated. However, future modifications can result in logging failures. This check examines explicit Amazon S3 bucket permissions, but it does not examine associated bucket policies that might override the bucket permissions.
- Check ID
-
BueAdJ7NrP
- Alert Criteria
-
-
Yellow: The bucket does not have server access logging enabled.
-
Yellow: The target bucket permissions do not include the root account, so Trusted Advisor cannot check it.
-
Red: The target bucket does not exist.
-
Red: The target bucket and the source bucket have different owners.
-
Red: The log deliverer does not have write permissions for the target bucket.
-
- Recommended Action
-
Enable bucket logging for most buckets. See Enabling Logging Using the Console and Enabling Logging Programmatically.
If the target bucket permissions do not include the root account and you want Trusted Advisor to check the logging status, add the root account as a grantee. See Editing Bucket Permissions.
If the target bucket does not exist, select an existing bucket as a target or create a new one and select it. See Managing Bucket Logging.
If the target and source have different owners, change the target bucket to one that has the same owner as the source bucket. See Managing Bucket Logging.
If the log deliverer does not have write permissions for the target (write not enabled), grant Upload/Delete permissions to the Log Delivery group. See Editing Bucket Permissions.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Bucket Name
-
Target Name
-
Target Exists
-
Same Owner
-
Write Enabled
-
Reason
-
Auto Scaling Group Health Check
- Description
-
Examines the health check configuration for Auto Scaling groups.
If Elastic Load Balancing is being used for an Auto Scaling group, the recommended configuration is to enable an Elastic Load Balancing health check. If an Elastic Load Balancing health check is not used, Auto Scaling can only act upon the health of the Amazon Elastic Compute Cloud (Amazon EC2) instance. Auto Scaling will not act on the application running on the instance.
- Check ID
-
CLOG40CDO8
- Alert Criteria
-
-
Yellow: An Auto Scaling group has an associated load balancer, but the Elastic Load Balancing health check is not enabled.
-
Yellow: An Auto Scaling group does not have an associated load balancer, but the Elastic Load Balancing health check is enabled.
-
- Recommended Action
-
If the Auto Scaling group has an associated load balancer, but the Elastic Load Balancing health check is not enabled, see Add an Elastic Load Balancing Health Check to your Auto Scaling Group.
If the Elastic Load Balancing health check is enabled, but no load balancer is associated with the Auto Scaling group, see Set Up an Auto-Scaled and Load-Balanced Application.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Auto Scaling Group Name
-
Load Balancer Associated
-
Health Check
-
Auto Scaling Group Resources
- Description
-
Checks the availability of resources associated with launch configurations and your Auto Scaling groups.
Auto Scaling groups that point to unavailable resources cannot launch new Amazon Elastic Compute Cloud (Amazon EC2) instances. When properly configured, Auto Scaling causes the number of Amazon EC2 instances to increase seamlessly during demand spikes, and decrease automatically during demand lulls. Auto Scaling groups and launch configurations that point to unavailable resources do not operate as intended.
- Check ID
-
8CNsSllI5v
- Alert Criteria
-
-
Red: An Auto Scaling group is associated with a deleted load balancer.
-
Red: A launch configuration is associated with a deleted Amazon Machine Image (AMI).
-
- Recommended Action
-
If the load balancer has been deleted, either create a new load balancer or target group then associate it to the Auto Scaling group, or create a new Auto Scaling group without the load balancer. For information about creating a new Auto Scaling group with a new load balancer, see Set Up an Auto-Scaled and Load-Balanced Application. For information about creating a new Auto Scaling group without a load balancer, see Create Auto Scaling Group in Getting Started With Auto Scaling Using the Console.
If the AMI has been deleted, create a new launch template or launch template version using a valid AMI and associate it with an Auto Scaling group. See Create Launch Configuration in Getting Started With Auto Scaling Using the Console.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Auto Scaling Group Name
-
Launch Type
-
Resource Type
-
Resource Name
-
Amazon Direct Connect Location Resiliency
- Description
-
Checks the resilience of the Amazon Direct Connect used to connect your on-premises to each Direct Connect gateway or virtual private gateway.
This check alerts you if any Direct Connect gateway or virtual private gateway isn't configured with virtual interfaces across at least two distinct Direct Connect locations. Lack of location resiliency can result in unexpected downtime during maintenance, a fiber cut, a device failure, or a complete location failure.
Note
Results for this check are automatically refreshed several times daily, and refresh requests are not allowed. It might take a few hours for changes to appear.
Note
Direct Connect is implemented with Transit Gateway using Direct Connect gateway.
- Check ID
-
c1dfpnchv2
- Alert Criteria
-
Red: The Direct Connect gateway or virtual private gateway is configured with one or more virtual interfaces on a single Direct Connect device.
Yellow: The Direct Connect gateway or virtual private gateway is configured with virtual interfaces across multiple Direct Connect devices in a single Direct Connect location.
Green: The Direct Connect gateway or virtual private gateway is configured with virtual interfaces across two or more distinct Direct Connect locations.
- Recommended Action
To build Direct Connect location resiliency, you can configure the Direct Connect gateway or virtual private gateway to connect to at least two distinct Direct Connect locations. For more information, see Amazon Direct Connect Resiliency Recommendation
. - Additional Resources
- Report columns
-
-
Status
-
Region
-
Last Updated Time
-
Resiliency Status
-
Location
-
Connection ID
-
Gateway ID
-
Amazon Outposts Single Rack deployment
- Description
-
Checks for Outposts Racks balance. This evaluates if a customers Outposts instances are deployed across multiple Outposts Racks or to a single Outpost Rack. A single Outposts rack creates a single point of failure for issues that involve a single Rack (for example, environmental failures). These scenarios can be mitigated by deploying outposts across multiple Racks.
- Check ID
-
c243hjzrhn
- Alert Criteria
-
-
Yellow: Your Outpost is deployed on single Rack
-
Green: Your Outpost is deployed across multiple Racks.
-
- Recommended Action
-
If you are running production workloads on Amazon Outposts, then its a best practice to use the following resilient architecture. A single Amazon Outposts rack creates a single point of failure. Consider adding a second Amazon Outposts rack to that location with enough capacity for a failover event, and then distribute workloads across racks.
- Additional Resources
- Report columns
-
-
Status
-
Resource ARN
-
AZ
-
Number of Racks
-
Last Updated Time
-
CLB Connection Draining
- Description
-
Checks for Classic load balancers that do not have connection draining enabled.
When connection draining is not enabled and you deregister an Amazon EC2 instance from a Classic load balancer, the Classic load balancer stops routing traffic to that instance and closes the connection. When connection draining is enabled, the Classic load balancer stops sending new requests to the deregistered instance but keeps the connection open to serve active requests.
- Check ID
-
7qGXsKIUw
- Alert Criteria
-
-
Yellow: Connection draining is not enabled for a Classic load balancer.
-
Green: Connection draining is enabled for Classic load balancer. .
-
- Recommended Action
-
Enable connection draining for the Classic load balancer. For more information, see Connection Draining and Enable or Disable Connection Draining for Your Load Balancer.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Load Balancer Name
-
Reason
-
ELB Target Imbalance
- Description
-
Checks the target groups’ target distribution across Availability Zones (AZs) for Application Load Balancer (ALB), Network Load Balancer (NLB), and Gateway Load Balancer (GWLB).
This check doesn’t includes load balancers that are configured with a single AZ and where the difference in number of targets between the most and least populated AZ’s is equal to or lesser than 1.
- Check ID
-
b92b83d667
- Alert Criteria
-
-
Red: A single AZ represents more than 66% of the load balancer capacity.
-
Yellow: A single AZ represents more than 50% of the load balancer capacity.
-
Green: No AZs represents more than 50% of the load balancer capacity.
-
- Recommended Action
-
For better resilience, make sure that your targets groups have same number of targets across AZs.
- Additional Resources
-
Target groups for your Application Load Balancers
Register targets with your Application Load Balancer target group
- Report columns
-
-
Status
-
Region
-
Load Balancer Name
-
Load Balancer Type
-
Target Group ARN (arn)
-
Difference in registered targets across AZs
-
Last Updated Time
-
Load Balancer Optimization
- Description
-
Checks your load balancer configuration.
To help increase the level of fault tolerance in Amazon Elastic Compute Cloud (Amazon EC2) when using Elastic Load Balancing , we recommend running an equal number of instances across multiple Availability Zones in a Region. A load balancer that is configured accrues charges, so this is a cost-optimization check as well.
- Check ID
-
iqdCTZKCUp
- Alert Criteria
-
-
Yellow: A load balancer is enabled for a single Availability Zone.
-
Yellow: A load balancer is enabled for an Availability Zone that has no active instances.
-
Yellow: The Amazon EC2 instances that are registered with a load balancer are unevenly distributed across Availability Zones. (The difference between the highest and lowest instance counts in utilized Availability Zones is more than 1, and the difference is more than 20% of the highest count.)
-
- Recommended Action
-
Ensure that your load balancer points to active and healthy instances in at least two Availability Zones. For more information, see Add Availability Zone.
If your load balancer is configured for an Availability Zone with no healthy instances, or if there is an imbalance of instances across the Availability Zones, determine if all the Availability Zones are necessary. Omit any unnecessary Availability Zones and ensure there is a balanced distribution of instances across the remaining Availability Zones. For more information, see Remove Availability Zone.
- Additional Resources
- Report columns
-
-
Status
-
Region
-
Load Balancer Name
-
# of Zones
-
Zone a Instances
-
Zone b Instances
-
Zone c Instances
-
Zone d Instances
-
Zone e Instances
-
Zone f Instances
-
Reason
-
Network Firewall Multi-AZ
- Description
-
Checks if your Network Firewalls are configured to use more than one Availability Zone (AZ) for firewall endpoints.
An AZ is a distinct location that’s insulated from failures in other zones. If the Network Firewall endpoint is deployed in only 1 AZ, then it can be a single point of failure and can impair workloads from other AZs using the Network Firewall for traffic inspection. It’s a best practice to configure your Network Firewalls in multiple AZs in the same Region to mprove your workload availability.
- Check ID
-
c2vlfg0gqd
- Alert Criteria
-
-
Yellow: Network Firewall endpoint is deployed in 1 AZ.
-
Green: Network Firewall endpoints is deployed in at least two AZs.
-
- Recommended Action
-
Make sure that your Network Firewall is configured with at least two AZs for production workloads.
- Additional Resources
-
VPC subnet configuration forAmazon Network Firewall
Amazon Well-Architected Tool - Deploy the workload to multiple locations
- Report columns
-
-
Status
-
Region
-
Network Firewall Arn
-
VPC Id
-
Network Firewall Subnets
-
Network Firewall Subnets AZs
-
Last Updated Time
-