Unified Operations Getting started: Onboard critical alarms to rapid incident management - Amazon Web Services Support
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Unified Operations Getting started: Onboard critical alarms to rapid incident management

To help quickly notify you of critical incidents, complete the following steps to onboard your alarms to Amazon Incident Detection and Response

  1. Define and configure your critical alarms for rapid incident management. For detailed information, see Define and configure alarms in Incident Detection and Response in the Incident Detection and Response User Guide.

    1. For steps to set up alarms using Amazon CloudWatch, see Define and configure alarms in Incident Detection and Response in the Incident Detection and Response User Guide. For Amazon recommendations on critical alarm types for various Amazon Web Services services, see Incident Detection and Response (IDR). Contact your Amazon Unified Operations team if you want Amazon to automate the creation of critical Amazon alarms for your tagged Amazon resources.

    2. To redirect or ingest critical alarms from 3rd party APM tools with direct Amazon EventBridge integration, such as DataDog, NewRelic, and so on, see Ingest alarms from APMs that have direct integration with Amazon EventBridge in the Amazon Incident Detection and Response User Guide. You must deploy a set of Amazon resources (Amazon Lambda and Amazon EventBridge event bus rules) to transform and redirect your alarm (event) to Amazon Incident Detection and Response. Your Amazon Unified Operations team can help provide the Amazon CloudFormation template to install these resources.

    3. Redirect or ingest critical alarms from your custom monitoring tool through a 3rd party APM tool that doesn’t have direct integration with Amazon EventBridge, such as DataDog, NewRelic, and so on. For more information, see Ingest alarms from APMs that have direct integration with Amazon EventBridge in the Amazon Incident Detection and Response User Guide. You must deploy a set of Amazon resources (API Gateway Amazon Lambda functions, and Amazon EventBridge event bus rules) to transform and redirect your alarm (event) to Amazon Incident Detection and Response. Your Amazon Unified Operations team can help provide the Amazon CloudFormation template to install these resources.

  2. Provide workload architecture details, point of contact information and runbook information on mitigation actions for critical alarms. To do this, complete the following steps:

    1. Download and complete the Amazon Incident Detection and Response Workload onboarding questionnaire for each critical workload or application and the Alarm ingestion questionnaire related to each unique workload.

      The information in these questionnaires helps the Amazon team develop an incident remediation runbook. This runbook enables appropriate actions to be taken to quickly troubleshoot and remediate critical alarms before they cause business downtime. For examples and sample information, see Workload onboarding and alarm ingestion questionnaires in Amazon Incident Detection and Response.

  3. Provide access to onboard your critical alarms to Amazon Incident Detection and Response

    1. Deploy the AWSServiceRoleForHealth_EventProcessor service-linked role (SLR) in your Amazon Web Services account running the critical workload to be monitored by the Amazon incident management team. For more information, see Provision access for alert ingestion to Amazon Incident Detection and Response.

      Note

      To assist your with onboarding of large Amazon Web Services accounts, Amazon can provide you with a Amazon Command Line Interface script to fast track the provisioning of this SLR.

    2. (Optional) If your alarms are in Amazon CloudWatch, make sure that the Amazon Identity and Access Management user or role that's used for alarm testing (before go-live) has the cloudwatch:SetAlarmState IAM permission in your Amazon Web Services account that's running the critical workload. This is needed for alarm testing (gameday) post onboarding. For more information, see Test onboarded workloads in Amazon Incident Detection and Response.

  4. Create a Amazon Web Services Support case to subscribe a workload for rapid incident management. Note that your Amazon Web Services account is automatically enabled for inbound rapid incident management, which means you can raise a case to the Unified Operations Incident Detection and Response queue through the Support Center Console, the Amazon Command Line Interface, or the Amazon SDK for quick action. For Amazon to proactively monitor and create incidents with an outbound Amazon Web Services Support case, create an Amazon Web Services Support case for your critical workload. To do this, complete the following steps:

    1. Sign in to the Amazon Support Center Console, select Create case, and then select Technical support.

    2. For Service select Incident Detection and Response.

    3. For Category select Onboard new workload.

    4. For Severity select General guidance.

    5. Attached the Workload and Alarm questionnaires that you completed in the previous step.