使用 Amazon CloudWatch 指标监控 AWS Glue - AWS Glue
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

如果我们为英文版本指南提供翻译,那么如果存在任何冲突,将以英文版本指南为准。在提供翻译时使用机器翻译。

使用 Amazon CloudWatch 指标监控 AWS Glue

您可以使用 AWS Glue 作业分析器分析和监控 AWS Glue 操作。它从 AWS Glue 收集原始数据,并将数据处理为存储在 Amazon CloudWatch 中的易读且近乎实时的指标。这些统计数据将保留并聚合在 CloudWatch 中,以便您可以访问历史信息以更好地了解您的应用程序的运行状况。

AWS Glue 指标概述

当您与 AWS Glue 交互时,它将指标发送到 CloudWatch。您可以使用 AWS Glue 控制台(首选方法)、CloudWatch 控制台控制面板或 AWS Command Line Interface (AWS CLI) 查看这些指标。

使用 AWS Glue 控制台控制面板查看指标

您可以查看作业的指标的摘要或详细图表,或作业运行的详细图表。有关您可以在 AWS Glue 控制台控制面板中访问的图表和指标的详细信息,请参阅在 AWS Glue 控制台上处理作业

  1. 登录 AWS 管理控制台并通过以下网址打开 AWS Glue 控制台:https://console.amazonaws.cn/glue/

  2. 在导航窗格中,选择作业

  3. 作业列表中选择作业。

  4. 选择 Metrics (指标) 选项卡。

  5. 选择查看其他指标以查看更多详细指标。

使用 CloudWatch 控制台控制面板查看指标

指标的分组首先依据服务命名空间,然后依据每个命名空间内的各种维度组合。

  1. 通过以下网址打开 CloudWatch 控制台:https://console.amazonaws.cn/cloudwatch/

  2. 在导航窗格中,选择 Metrics

  3. 选择 Glue 命名空间。

使用 AWS CLI 查看指标

  • 在命令提示符处,使用以下命令。

    aws cloudwatch list-metrics --namespace "Glue"

AWS Glue 每隔 30 秒向 CloudWatch 报告一次指标,CloudWatch 指标控制面板配置为每分钟显示一次指标。AWS Glue 指标表示先前报告的值的增量值。在适当的情况下,指标控制面板聚合(汇总)30 秒的值,用于获取过去一分钟的值。AWS Glue 指标在初始化脚本中的 GlueContext 时启用,通常仅在 Apache Spark 任务结束时更新。它们表示迄今为止所有已完成的 Spark 任务的聚合值。

但是,AWS Glue 传递给 CloudWatch 的 Spark 指标通常是表示在报告它们时的当前状态的绝对值。AWS Glue 每隔 30 秒向 CloudWatch 报告一次指标,并且指标控制面板通常会显示在最后 1 分钟收到的数据点的平均值。

AWS Glue 指标名称均采用以下类型的前缀之一:

  • glue.driver. – 名称以此前缀开头的指标表示从 Spark 驱动程序上的所有执行程序聚合的 AWS Glue 指标,或对应于 Spark 驱动程序的 Spark 指标。

  • glue.executorId.executorId 是特定 Spark 执行程序的编号。它与日志中列出的执行程序相对应。

  • glue.ALL. – 名称以此前缀开头的指标聚合来自所有 Spark 执行程序的值。

AWS Glue Metrics

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:

Metric Description

glue.driver.aggregate.bytesRead

The number of bytes read from all data sources by all completed Spark tasks running in all executors..

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor:

  • Bytes read.

  • Job progress.

  • JDBC data sources.

  • Job Bookmark Issues.

  • Variance across Job Runs.

This metric can be used the same way as the glue.ALL.s3.filesystem.read_bytes metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.

glue.driver.aggregate.elapsedTime

The ETL elapsed time in milliseconds (does not include the job bootstrap times).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Milliseconds

Can be used to determine how long it takes a job run to run on average.

Some ways to use the data:

  • Set alarms for stragglers.

  • Measure variance across job runs.

glue.driver.aggregate.numCompletedStages

The number of completed stages in the job.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Job progress.

  • Per-stage timeline of job execution,when correlated with other metrics.

Some ways to use the data:

  • Identify demanding stages in the execution of a job.

  • Set alarms for correlated spikes (demanding stages) across job runs.

glue.driver.aggregate.numCompletedTasks

The number of completed tasks in the job.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Job progress.

  • Parallelism within a stage.

glue.driver.aggregate.numFailedTasks

The number of failed tasks.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Data abnormalities that cause job tasks to fail.

  • Cluster abnormalities that cause job tasks to fail.

  • Script abnormalities that cause job tasks to fail.

The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.

glue.driver.aggregate.numKilledTasks

The number of tasks killed.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Abnormalities in Data Skew that result in exceptions (OOMs) that kill tasks.

  • Script abnormalities that result in exceptions (OOMs) that kill tasks.

Some ways to use the data:

  • Set alarms for increased failures indicating data abnormalities.

  • Set alarms for increased failures indicating cluster abnormalities.

  • Set alarms for increased failures indicating script abnormalities.

glue.driver.aggregate.recordsRead

The number of records read from all data sources by all completed Spark tasks running in all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Records read.

  • Job progress.

  • JDBC data sources.

  • Job Bookmark Issues.

  • Skew in Job Runs over days.

This metric can be used in a similar way to the glue.ALL.s3.filesystem.read_bytes metric, with the difference that this metric is updated at the end of a Spark task.

glue.driver.aggregate.shuffleBytesWritten

The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce).

Some ways to use the data:

  • Repartition or decompress large input files before further processing.

  • Repartition data more uniformly to avoid hot keys.

  • Pre-filter data before joins or groupBy operations.

glue.driver.aggregate.shuffleLocalBytesRead

The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce).

Some ways to use the data:

  • Repartition or decompress large input files before further processing.

  • Repartition data more uniformly using hot keys.

  • Pre-filter data before joins or groupBy operations.

glue.driver.BlockManager.disk.diskSpaceUsed_MB

The number of megabytes of disk space used across all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Megabytes

Can be used to monitor:

  • Disk space used for blocks that represent cached RDD partitions.

  • Disk space used for blocks that represent intermediate shuffle outputs.

  • Disk space used for blocks that represent broadcasts.

Some ways to use the data:

  • Identify job failures due to increased disk usage.

  • Identify large partitions resulting in spilling or shuffling.

  • Increase provisioned DPU capacity to correct these issues.

glue.driver.ExecutorAllocationManager.executors.numberAllExecutors

The number of actively running job executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Count

Can be used to monitor:

  • Job activity.

  • Straggling executors (with a few executors running only)

  • Current executor-level parallelism.

Some ways to use the data:

  • Repartition or decompress large input files beforehand if cluster is under-utilized.

  • Identify stage or job execution delays due to straggler scenarios.

  • • Compare with numberMaxNeededExecutors to understand backlog for provisioning more DPUs.

glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors

The number of maximum (actively running and pending) job executors needed to satisfy the current load.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value.

Unit: Count

Can be used to monitor:

  • Job activity.

  • Current executor-level parallelism and backlog of pending tasks not yet scheduled because of unavailable executors due to DPU capacity or killed/failed executors.

Some ways to use the data:

  • Identify pending/backlog of scheduling queue.

  • Identify stage or job execution delays due to straggler scenarios.

  • Compare with numberAllExecutors to understand backlog for provisioning more DPUs.

  • Increase provisioned DPU capacity to correct the pending executor backlog.

glue.driver.jvm.heap.usage

glue.executorId.jvm.heap.usage

glue.ALL.jvm.heap.usage

The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Percentage

Can be used to monitor:

  • Driver out-of-memory conditions (OOM) using glue.driver.jvm.heap.usage.

  • Executor out-of-memory conditions (OOM) using glue.ALL.jvm.heap.usage.

Some ways to use the data:

  • Identify memory-consuming executor ids and stages.

  • Identify straggling executor ids and stages.

  • Identify a driver out-of-memory condition (OOM).

  • Identify an executor out-of-memory condition (OOM) and obtain the corresponding executor ID so as to be able to get a stack trace from the executor log.

  • Identify files or partitions that may have data skew resulting in stragglers or out-of-memory conditions (OOMs).

glue.driver.jvm.heap.used

glue.executorId.jvm.heap.used

glue.ALL.jvm.heap.used

The number of memory bytes used by the JVM heap for the driver, the executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Bytes

Can be used to monitor:

  • Driver out-of-memory conditions (OOM).

  • Executor out-of-memory conditions (OOM).

Some ways to use the data:

  • Identify memory-consuming executor ids and stages.

  • Identify straggling executor ids and stages.

  • Identify a driver out-of-memory condition (OOM).

  • Identify an executor out-of-memory condition (OOM) and obtain the corresponding executor ID so as to be able to get a stack trace from the executor log.

  • Identify files or partitions that may have data skew resulting in stragglers or out-of-memory conditions (OOMs).

glue.driver.s3.filesystem.read_bytes

glue.executorId.s3.filesystem.read_bytes

glue.ALL.s3.filesystem.read_bytes

The number of bytes read from Amazon S3 by the driver, an executor identified by executorId, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute).

Valid dimensions: JobName, JobRunId, and Type (gauge).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs.

Unit: Bytes.

Can be used to monitor:

  • ETL data movement.

  • Job progress.

  • Job bookmark issues (data processed, reprocessed, and skipped).

  • Comparison of reads to ingestion rate from external data sources.

  • Variance across job runs.

Resulting data can be used for:

  • DPU capacity planning.

  • Setting alarms for large spikes or dips in data read for job runs and job stages.

glue.driver.s3.filesystem.write_bytes

glue.executorId.s3.filesystem.write_bytes

glue.ALL.s3.filesystem.write_bytes

The number of bytes written to Amazon S3 by the driver, an executor identified by executorId, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute).

Valid dimensions: JobName, JobRunId, and Type (gauge).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs.

Unit: Bytes

Can be used to monitor:

  • ETL data movement.

  • Job progress.

  • Job bookmark issues (data processed, reprocessed, and skipped).

  • Comparison of reads to ingestion rate from external data sources.

  • Variance across job runs.

Some ways to use the data:

  • DPU capacity planning.

  • Setting alarms for large spikes or dips in data read for job runs and job stages.

glue.driver.system.cpuSystemLoad

glue.executorId.system.cpuSystemLoad

glue.ALL.system.cpuSystemLoad

The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This metric is reported as an absolute value.

Unit: Percentage

Can be used to monitor:

  • Driver CPU load.

  • Executor CPU load.

  • Detecting CPU-bound or IO-bound executors or stages in a Job.

Some ways to use the data:

  • DPU capacity Planning along with IO Metrics (Bytes Read/Shuffle Bytes, Task Parallelism) and the number of maximum needed executors metric.

  • Identify the CPU/IO-bound ratio. This allows for repartitionioning and increasing provisioned capacity for long-running jobs with splittable datasets having lower CPU utilization.

Dimensions for AWS Glue Metrics

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:

Dimension Description

JobName

This dimension filters for metrics of all job runs of a specific AWS Glue job.

JobRunId

This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or ALL.

Type

This dimension filters for metrics by either count (an aggregate number) or gauge (a value at a point in time).

有关更多信息,请参阅 Amazon CloudWatch 用户指南