Monitoring Amazon Glue streaming jobs - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Monitoring Amazon Glue streaming jobs

Monitoring your streaming job is a critical part of building your ETL pipeline. Apart from using the Spark UI, you can also use Amazon CloudWatch to monitor the metrics. Below is a list of streaming metrics emitted by the Amazon Glue framework. For a complete list of all the Amazon Glue metrics, see Monitoring Amazon Glue using Amazon CloudWatch metrics.

Amazon Glue uses a structured streaming framework to process the input events. You can either use the Spark API directly in your code or leverage the ForEachBatch provided by GlueContext, which publishes these metrics. To understand these metrics, we need to first understand windowSize.

windowSize: windowSize is the micro-batch interval that you provide. If you specify a window size of 60 seconds, the Amazon Glue streaming job will wait for 60 seconds (or more if the previous batch hasn’t completed by then) before it will read data in a batch from the streaming source and apply the transformations provided in ForEachBatch. This is also referred to as the trigger interval.

Lets review the metrics in greater detail to understand the health and performance characteristics.

Note

The metrics are emitted every 30 seconds. If your windowSize is less than 30 seconds then the reported metrics are an aggregation. For example say your windowSize is 10 seconds and you are steadily processing 20 records per micro-batch. In this scenario, the emitted metric value for numRecords would be 60.

A metric is not emitted if there is no data available for it. Also, in case of the consumer lag metric, you have to enable the feature to get metrics for it.

Visualizing metrics

To plot visual metrics:

  1. Go to Metrics in the Amazon CloudWatch console and then choose the Browse tab. Then choose Glue under "Custom namespaces".

    The screenshot shows accessing metrics in the Amazon CloudWatch console when monitoring Amazon Glue streaming jobs.
  2. Choose Job Metrics to show you the metrics for all your jobs.

  3. Filter the metrics based on your JobName=glue-feb-monitoring and then JobRunId=ALL. You can click on the "+" sign as shown in the figure below to add it to the search filter.

  4. Select the checkbox for the metrics that you are interested in. In the below figure we have selected numberAllExecutors and numberMaxNeededExecutors.

    The screenshot shows applying an average to metrics when monitoring streaming jobs.
  5. Once you have selected these metrics, you can go to the Graphed metrics tab and apply your statistics.

  6. Since the metrics are emitted every min, you can apply the "average" over a minute for batchProcessingTimeInMs and maxConsumerLagInMs. For the numRecords you can apply the "sum" over every minute.

  7. You can add a horizontal windowSize annotation to your graph using the Options tab.

    The screenshot shows adding a windowSize annotation to your graph when monitoring streaming jobs.
  8. Once you have your metrics selected, create a dashboard and add it. Here is a sample dashboard.

    The screenshot shows a sample dashboard for monitoring streaming jobs.

Metrics deep dive

This section describes each of the metrics and how they co-relate with each other.

Number of records (metric: streaming.numRecords)

This metric indicates how many records are being processed.

The screenshot shows monitoring the number of records in streaming jobs.

This streaming metric provides visibility into the number of records you are processing in a window. Along with the number of records being processed, it will also help you understand the behavior of the input traffic.

  • Indicator #1 shows an example of stable traffic without any bursts. Typically this will be applications like IoT sensors that are collecting data at regular intervals and sending it to the streaming source.

  • Indicator #2 shows an example of a sudden burst in traffic on an otherwise stable load. This could happen in a clickstream application when there is a marketing event like Black Friday and there is a burst in the number of clicks

  • Indicator #3 shows an example of unpredictable traffic. Unpredictable traffic doesn't mean there is a problem. It is just the nature of the input data. Going back to the IoT sensor example, you can think of hundreds of sensors that are sending weather change events to the streaming source. As the weather change is not predictable, neither is the data. Understanding the traffic pattern is key to sizing your executors. If the input is very spiky, you may consider using autoscaling (more on that later).

The screenshot shows monitoring using the number of records and Kinesis PutRecords metrics in streaming jobs.

You can combine this metric with the Kinesis PutRecords metric to make sure the number of events being ingested and the number of records being read are nearly the same. This is especially useful when you are trying to understand lag. As the ingestion rate increases, so do the numRecords read by Amazon Glue.

Batch processing time (metric: streaming.batchProcessingTimeInMs)

The batch processing time metric helps you determine if the cluster is underprovisioned or overprovisioned.

The screenshot shows monitoring batch processing time in streaming jobs.

This metric indicates the number of milliseconds that it took to process each micro-batch of records. The main goal here is to monitor this time to make sure it less than the windowSize interval. It is okay if the batchProcessingTimeInMs goes over temporarily as long as it recovers in the following window interval. Indicator #1 shows a more or less stable time taken to process the job. However if the number of input records are increasing, the time it takes to process the job will increase as well as shown by indicator #2. If the numRecords is not going up, but the processing time is going up, then you would need to take a deeper look into the job processing on the executors. It is a good practice to set a threshold and alarm to make sure the batchProcessingTimeInMs doesn't stay over 120% for more than 10 minutes. For more information on setting alarms, see Using Amazon CloudWatch alarms.

Consumer lag (metric: streaming.maxConsumerLagInMs)

The consumer lag metric helps you understand if there is a lag in processing events. If your lag is too high, then you could miss the processing SLA that your business depends on, even though you have a correct windowSize. You have to explicitly enable this metrics using the emitConsumerLagMetrics connection option. For more information, see KinesisStreamingSourceOptions.

The screenshot shows monitoring lag in streaming jobs.

Derived metrics

To gain deeper insights, you can create derived metrics to understand more about your streaming jobs in Amazon CloudWatch.

The screenshot shows monitoring derived metrics in streaming jobs.

You can build a graph with derived metrics to decide if you need to use more DPUs. While autoscaling helps you do this automatically, you can use derived metrics to determine if autoscaling is working effectively.

  • InputRecordsPerSecond indicates the rate at which you are getting input records. It is derived as follows: number of input records (glue.driver.streaming.numRecords)/ WindowSize.

  • ProcessingRecordsPerSecond indicates the rate at which your records are being processed. It is derived as follows: number of input records (glue.driver.streaming.numRecords)/ batchProcessingTimeInMs.

If the input rate is higher than the processing rate, then you may need to add more capacity to process your job or increase the parallelism.

Autoscaling metrics

When your input traffic is spiky, then you should consider enabling autoscaling and specify the max workers. With that you get two additional metrics, numberAllExecutors and numberMaxNeededExecutors.

  • numberAllExecutors is the number of actively running job executors

  • numberMaxNeededExecutors is the number of maximum (actively running and pending) job executors needed to satisfy the current load.

These two metrics will help you understand if your autoscaling is working correctly.

The screenshot shows monitoring autoscaling in streaming jobs.

Amazon Glue will monitor the batchProcessingTimeInMs metric over a few micro-batches and do one of two things. It will scale-out the executors, if batchProcessingTimeInMs is closer to the windowSize, or scale-in the executors, if batchProcessingTimeInMs is comparatively lower than windowSize. Also, it will use an algorithm for step-scaling the executors.

  • indicator #1 shows you how the active executors scaled up to catch up with the max needed executors so as to process the load.

  • indicator #2 shows you how the active executers scaled in since the batchProcessingTimeInMs was low.

You can use these metrics to monitor current executor-level parallelism and adjust the number of max workers in your auto-scaling configuration accordingly.

How to get the best performance

Spark will try to create one task per shard, to read from, in the Amazon Kinesis stream. The data in each shard becomes a partition. It will then distribute these tasks across the executors/workers, depending of the number of cores on each worker (the number of cores per worker depends on the worker type you select G.025X, G.1X, etc). However it is non-deterministic how the tasks are distributed. All tasks are executed in parallel on their respective cores. If there are more shards than the number of available executor cores, the tasks are queued up.

You can use a combination of the above metrics and the number of shards, to provision your executors for a stable load with some room for bursts. It is recommend that you run a few iterations of your job in order to determine the approximate number of workers. For an unstable/spiky workload you can do the same by setting up autoscaling and max workers.

Set the windowSize as per the SLA requirement of your business. For example, if your business requires that the processed data cannot be more than 120 seconds stale, then set your windowSize to at least 60 seconds such that your average consumer lag is less than 120 seconds (refer to the section on consumer lag above). From there depending on the numRecords and number of shards, plan for the capacity in DPUs making sure your batchProcessingTimeInMs is less than 70% of your windowSize most of the time.

Note

Hot shards can cause data skew which means that some shards/partitions are much bigger than the others. This may cause some tasks that are running in parallel to take longer time causing straggler tasks. As a result, the next batch can't start until all tasks from the previous one complete, this will impact the batchProcessingTimeInMillis and the max lag.