Amazon CloudWatch
用户指南
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 Amazon AWS 入门

AWS Glue 指标与维度

AWS Glue 将指标发送到 CloudWatch。有关更多信息,请参阅 AWS Glue 开发人员指南 中的使用 CloudWatch 指标监控 AWS Glue

AWS Glue Metrics

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:

Metric Description

glue.driver.aggregate.bytesRead

The number of bytes read from all data sources by all completed Spark tasks running in all executors..

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor:

  • Bytes read.

  • Job progress.

  • JDBC data sources.

  • Job Bookmark Issues.

  • Variance across Job Runs.

This metric can be used the same way as the glue.ALL.s3.filesystem.read_bytes metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.

glue.driver.aggregate.elapsedTime

The ETL elapsed time in milliseconds (does not include the job bootstrap times).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Milliseconds

Can be used to determine how long it takes a job run to run on average.

Some ways to use the data:

  • Set alarms for stragglers.

  • Measure variance across job runs.

glue.driver.aggregate.numCompletedStages

The number of completed stages in the job.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Job progress.

  • Per-stage timeline of job execution,when correlated with other metrics.

Some ways to use the data:

  • Identify demanding stages in the execution of a job.

  • Set alarms for correlated spikes (demanding stages) across job runs.

glue.driver.aggregate.numCompletedTasks

The number of completed tasks in the job.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Job progress.

  • Parallelism within a stage.

glue.driver.aggregate.numFailedTasks

The number of failed tasks.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Data abnormalities that cause job tasks to fail.

  • Cluster abnormalities that cause job tasks to fail.

  • Script abnormalities that cause job tasks to fail.

The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.

glue.driver.aggregate.numKilledTasks

The number of tasks killed.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Abnormalities in Data Skew that result in exceptions (OOMs) that kill tasks.

  • Script abnormalities that result in exceptions (OOMs) that kill tasks.

Some ways to use the data:

  • Set alarms for increased failures indicating data abnormalities.

  • Set alarms for increased failures indicating cluster abnormalities.

  • Set alarms for increased failures indicating script abnormalities.

glue.driver.aggregate.recordsRead

The number of records read from all data sources by all completed Spark tasks running in all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Records read.

  • Job progress.

  • JDBC data sources.

  • Job Bookmark Issues.

  • Skew in Job Runs over days.

This metric can be used in a similar way to the glue.ALL.s3.filesystem.read_bytes metric, with the difference that this metric is updated at the end of a Spark task.

glue.driver.aggregate.shuffleBytesWritten

The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce).

Some ways to use the data:

  • Repartition or decompress large input files before further processing.

  • Repartition data more uniformly to avoid hot keys.

  • Pre-filter data before joins or groupBy operations.

glue.driver.aggregate.shuffleLocalBytesRead

The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce).

Some ways to use the data:

  • Repartition or decompress large input files before further processing.

  • Repartition data more uniformly using hot keys.

  • Pre-filter data before joins or groupBy operations.

glue.driver.BlockManager.disk.diskSpaceUsed_MB

The number of megabytes of disk space used across all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Megabytes

Can be used to monitor:

  • Disk space used for blocks that represent cached RDD partitions.

  • Disk space used for blocks that represent intermediate shuffle outputs.

  • Disk space used for blocks that represent broadcasts.

Some ways to use the data:

  • Identify job failures due to increased disk usage.

  • Identify large partitions resulting in spilling or shuffling.

  • Increase provisioned DPU capacity to correct these issues.

glue.driver.ExecutorAllocationManager.executors.numberAllExecutors

The number of actively running job executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Count

Can be used to monitor:

  • Job activity.

  • Straggling executors (with a few executors running only)

  • Current executor-level parallelism.

Some ways to use the data:

  • Repartition or decompress large input files beforehand if cluster is under-utilized.

  • Identify stage or job execution delays due to straggler scenarios.

  • • Compare with numberMaxNeededExecutors to understand backlog for provisioning more DPUs.

glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors

The number of maximum (actively running and pending) job executors needed to satisfy the current load.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value.

Unit: Count

Can be used to monitor:

  • Job activity.

  • Current executor-level parallelism and backlog of pending tasks not yet scheduled because of unavailable executors due to DPU capacity or killed/failed executors.

Some ways to use the data:

  • Identify pending/backlog of scheduling queue.

  • Identify stage or job execution delays due to straggler scenarios.

  • Compare with numberAllExecutors to understand backlog for provisioning more DPUs.

  • Increase provisioned DPU capacity to correct the pending executor backlog.

glue.driver.jvm.heap.usage

glue.executorId.jvm.heap.usage

glue.ALL.jvm.heap.usage

The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Percentage

Can be used to monitor:

  • Driver out-of-memory conditions (OOM) using glue.driver.jvm.heap.usage.

  • Executor out-of-memory conditions (OOM) using glue.ALL.jvm.heap.usage.

Some ways to use the data:

  • Identify memory-consuming executor ids and stages.

  • Identify straggling executor ids and stages.

  • Identify a driver out-of-memory condition (OOM).

  • Identify an executor out-of-memory condition (OOM) and obtain the corresponding executor ID so as to be able to get a stack trace from the executor log.

  • Identify files or partitions that may have data skew resulting in stragglers or out-of-memory conditions (OOMs).

glue.driver.jvm.heap.used

glue.executorId.jvm.heap.used

glue.ALL.jvm.heap.used

The number of memory bytes used by the JVM heap for the driver, the executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Bytes

Can be used to monitor:

  • Driver out-of-memory conditions (OOM).

  • Executor out-of-memory conditions (OOM).

Some ways to use the data:

  • Identify memory-consuming executor ids and stages.

  • Identify straggling executor ids and stages.

  • Identify a driver out-of-memory condition (OOM).

  • Identify an executor out-of-memory condition (OOM) and obtain the corresponding executor ID so as to be able to get a stack trace from the executor log.

  • Identify files or partitions that may have data skew resulting in stragglers or out-of-memory conditions (OOMs).

glue.driver.s3.filesystem.read_bytes

glue.executorId.s3.filesystem.read_bytes

glue.ALL.s3.filesystem.read_bytes

The number of bytes read from Amazon S3 by the driver, an executor identified by executorId, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute).

Valid dimensions: JobName, JobRunId, and Type (gauge).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs.

Unit: Bytes.

Can be used to monitor:

  • ETL data movement.

  • Job progress.

  • Job bookmark issues (data processed, reprocessed, and skipped).

  • Comparison of reads to ingestion rate from external data sources.

  • Variance across job runs.

Resulting data can be used for:

  • DPU capacity planning.

  • Setting alarms for large spikes or dips in data read for job runs and job stages.

glue.driver.s3.filesystem.write_bytes

glue.executorId.s3.filesystem.write_bytes

glue.ALL.s3.filesystem.write_bytes

The number of bytes written to Amazon S3 by the driver, an executor identified by executorId, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute).

Valid dimensions: JobName, JobRunId, and Type (gauge).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs.

Unit: Bytes

Can be used to monitor:

  • ETL data movement.

  • Job progress.

  • Job bookmark issues (data processed, reprocessed, and skipped).

  • Comparison of reads to ingestion rate from external data sources.

  • Variance across job runs.

Some ways to use the data:

  • DPU capacity planning.

  • Setting alarms for large spikes or dips in data read for job runs and job stages.

glue.driver.system.cpuSystemLoad

glue.executorId.system.cpuSystemLoad

glue.ALL.system.cpuSystemLoad

The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This metric is reported as an absolute value.

Unit: Percentage

Can be used to monitor:

  • Driver CPU load.

  • Executor CPU load.

  • Detecting CPU-bound or IO-bound executors or stages in a Job.

Some ways to use the data:

  • DPU capacity Planning along with IO Metrics (Bytes Read/Shuffle Bytes, Task Parallelism) and the number of maximum needed executors metric.

  • Identify the CPU/IO-bound ratio. This allows for repartitionioning and increasing provisioned capacity for long-running jobs with splittable datasets having lower CPU utilization.

Dimensions for AWS Glue Metrics

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:

Dimension Description

JobName

This dimension filters for metrics of all job runs of a specific AWS Glue job.

JobRunId

This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or ALL.

Type

This dimension filters for metrics by either count (an aggregate number) or gauge (a value at a point in time).