Monitoring Ray jobs with metrics - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Monitoring Ray jobs with metrics

You can monitor Ray jobs using Amazon Glue Studio and Amazon CloudWatch. CloudWatch collects and processes raw metrics from Amazon Glue with Ray, which makes them available for analysis. These metrics are visualized in the Amazon Glue Studio console, so you can monitor your job as it runs.

For a general overview of how to monitor Amazon Glue, see Monitoring Amazon Glue using Amazon CloudWatch metrics. For a general overview of how to use CloudWatch metrics that are published by Amazon Glue, see Monitoring with Amazon CloudWatch.

Monitoring Ray jobs in the Amazon Glue console

On the details page for a job run, below the Run details section, you can view pre-built aggregated graphs that visualize your available job metrics. Amazon Glue Studio sends job metrics to CloudWatch for every job run. With these, you can build a profile of your cluster and tasks, as well as access detailed information about each node.

For more information about available metrics graphs, see Viewing Amazon CloudWatch metrics for a Ray job run.

Overview of Ray jobs metrics in CloudWatch

We publish Ray metrics when detailed monitoring is enabled in CloudWatch. Metrics are published to the Glue/Ray CloudWatch namespace.

  • Instance metrics

    We publish metrics about the CPU, memory and disk utilization of instances assigned to a job. These metrics are identified by features such as ExecutorId, ExecutorType and host. These metrics are a subset of the standard Linux CloudWatch agent metrics. You can find information about metric names and features in the CloudWatch documentation. For more information, see Metrics collected by the CloudWatch agent.

  • Ray cluster metrics

    We forward metrics from the Ray processes that run your script to this namespace, then provide those most critical for you. The metrics that are available might differ by Ray version. For more information about which Ray version your job is running, see Amazon Glue versions.

    Ray collects metrics at the instance level. It also provides metrics for tasks and the cluster. For more information about Ray's underlying metric strategy, see Metrics in the Ray documentation.

Note

We don't publish Ray metrics to the Glue/Job Metrics/ namespace, which is only used for Amazon Glue ETL jobs.