使用 CloudWatch 针对 Apache Flink 的 Amazon Kinesis Data Analytics 进行警报 - Amazon Kinesis Data Analytics
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用 CloudWatch 针对 Apache Flink 的 Amazon Kinesis Data Analytics 进行警报

使用 Amazon CloudWatch 指标警报,你看 CloudWatch 指定的时间段内的指标。告警根据指标或表达式在多个时间段内相对于某阈值的值执行一项或多项操作。在操作中,将通知发送到 Amazon Simple Notification Service (Amazon SNS) 主题。

有关 的更多信息 CloudWatch 警报,请参阅使用 Amazon CloudWatch Alarms.

本节包含用于监控 Kinesis Data Analytics 应用程序的推荐警报。

该表介绍了推荐的警报并包含以下列:

  • 指标表达式:要根据阈值进行测试的指标或指标表达式。

  • 统计数据:用于检查指标的统计数据 — 例如,Average.

  • 阈值:使用此警报要求您确定定义应用程序预期性能限制的阈值。您需要通过在正常情况下监控应用程序来确定此阈值。

  • 描述:可能触发此警报的原因以及情况的可能解决方案。

指标表达式 统计数据 Threshold 描述
停机时间 > 0 Average 0 Recommended for all applications. The 停机时间 metric measures the duration of an outage. A downtime greater than zero indicates that the application has failed. For troubleshooting, see 应用程序正在重启.
RATE(失败的检查点数量) > 0 Average 0 Recommended for all applications. Use this metric to monitor application health and checkpointing progress. The application saves state data to checkpoints when it's healthy. Checkpointing can fail due to timeouts if the application isn't making progress in processing the input data. For troubleshooting, see 检查点已超时.
Operator .nums 每秒记录超出 < threshold Average The minimum number of records emitted from the application during normal conditions. Recommended for all applications. Falling below this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see 吞吐量太慢.
Regs_lag_MAX| MillsBehind最新的 > threshold Maximum The maximum expected latency during normal conditions. Recommended for all applications. Use the 记录/lag_max metric for a Kafka source, or the MillisBehindLatest for a Kinesis stream source. Rising above this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see 吞吐量太慢.
上次检查点持续时间 > threshold Maximum The maximum expected checkpoint duration during normal conditions. If the 上次检查点持续时间 continuously increases, rising above this threshold can indicate that the application isn't making expected progress on the input data, or that there are problems with application health such as backpressure. For troubleshooting, see 应用程序状态数据正在累积.
最后的检查点大小 > threshold Maximum The maximum expected checkpoint size during normal conditions. If the 最后的检查点大小 continuously increases, rising above this threshold can indicate that the application is accumulating state data. If the state data becomes too large, the application can run out of memory when recovering from a checkpoint, or recovering from a checkpoint might take too long. For troubleshooting, see 应用程序状态数据正在累积.
heap内存利用率 > threshold Maximum The maximum expected heap内存利用率 size during normal conditions, with a recommended value of 90 percent. You can use this metric to monitor the maximum memory utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources. You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see 扩展.
cpuUtilization > threshold Maximum The maximum expected cpuUtilization size during normal conditions, with a recommended value of 80 percent. You can use this metric to monitor the maximum CPU utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see 扩展.
ThreadsCount > threshold Maximum The maximum expected ThreadsCount size during normal conditions. You can use this metric to watch for thread leaks in task managers across the application. If this metric reaches this threshold, check your application code for threads being created without being closed.
(OldGarbage收集时间 * 100)/60_000 超过 1 分钟时间 ') > threshold Maximum The maximum expected OldGarbage 收集时间 duration. We recommend setting a threshold such that typical garbage collection time is 60 percent of the specified threshold, but the correct threshold for your application will vary. If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application.
RATE(旧垃圾收藏计数) > threshold Maximum The maximum expected 旧垃圾收藏计数 under normal conditions. The correct threshold for your application will vary. If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application.
Operator。当前输出水印-运算符。当前输入水印 > threshold Minimum The minimum expected watermark increment under normal conditions. The correct threshold for your application will vary. If this metric is continually increasing, this can indicate that either the application is processing increasingly older events, or that an upstream subtask has not sent a watermark in an increasingly long time.