Generative AI troubleshooting for Apache Spark in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Generative AI troubleshooting for Apache Spark in Amazon Glue

The generative AI troubleshooting for Apache Spark preview is available for jobs running on Amazon Glue 4.0, and in the following Amazon Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), Europe (Ireland), Europe (Stockholm), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney). Preview features are subject to change.

Generative AI Troubleshooting for Apache Spark jobs in Amazon Glue is a new capability that helps data engineers and scientists diagnose and fix issues in their Spark applications with ease. Utilizing machine learning and generative AI technologies, this feature analyzes issues in Spark jobs and provides detailed root cause analysis along with actionable recommendations to resolve those issues.

How does Generative AI Troubleshooting for Apache Spark work?

For your failed Spark jobs, Generative AI Troubleshooting analyzes the job metadata and the precise metrics and logs associated with the error signature of your job to generate a root cause analysis, and recommends specific solutions and best practices to help address job failures.

Setting up Generative AI Troubleshooting for Apache Spark for your jobs

Note

During preview, this feature helps troubleshoot Amazon Glue 4.0 jobs that fail within the first 30 minutes of their execution time.

Configuring IAM permissions

Granting permissions to the APIs used by Spark Troubleshooting for your jobs in Amazon Glue requires appropriate IAM permissions. You can obtain permissions by attaching the following custom Amazon policy to your IAM identity (such as a user, role, or group).

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:StartCompletion", "glue:GetCompletion" ], "Resource": [ "arn:aws:glue:*:*:completion/*" ] } ] }
Note

During preview, Spark Troubleshooting does not have APIs available through the Amazon SDK that you can use programmatically. The following two APIs are used in the IAM policy for enabling this experience through the Amazon Glue Studio Console: StartCompletion and GetCompletion.

Assigning permissions

To provide access, add permissions to your users, groups, or roles:

Running troubleshooting analysis from a failed job run

You can access the troubleshooting feature through multiple paths in the Amazon Glue console. Here's how to get started:

Option 1: From the Jobs List page

  1. Open the Amazon Glue console at https://console.aws.amazon.com/glue/.

  2. In the navigation pane, choose ETL Jobs.

  3. Locate your failed job in the jobs list.

  4. Select the Runs tab in the job details section.

  5. Click on the failed job run you want to analyze.

  6. Choose Troubleshoot with AI to start the analysis.

  7. When the troubleshooting analysis is complete, you can view the root-cause analysis and recommendations in the Troubleshooting analysis tab at the bottom of the screen.

The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.

Option 2: Using the Job Run Monitoring page

  1. Navigate to the Job run monitoring page.

  2. Locate your failed job run.

  3. Choose the Actions drop-down menu.

  4. Choose Troubleshoot with AI.

The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.

Option 3: From the Job Run Details page

  1. Navigate to your failed job run's details page by either clicking View details on a failed run from the Runs tab or selecting the job run from the Job run monitoring page.

  2. In the job run details page, find the Troubleshooting analysis tab.

Supported troubleshooting categories (preview)

This service focuses on three primary categories of issues that data engineers and developers frequently encounter in their Spark applications:

  • Resource setup and access errors: When running Spark applications in Amazon Glue, resource setup and access errors are among the most common yet challenging issues to diagnose. These errors often occur when your Spark application attempts to interact with Amazon resources but encounters permission issues, missing resources, or configuration problems.

  • Spark driver and executor memory issues: Memory-related errors in Apache Spark jobs can be complex to diagnose and resolve. These errors often manifest when your data processing requirements exceed the available memory resources, either on the driver node or executor nodes.

  • Spark disk capacity issues: Storage-related errors in Amazon Glue Spark jobs often emerge during shuffle operations, data spilling, or when dealing with large-scale data transformations. These errors can be particularly tricky because they might not manifest until your job has been running for a while, potentially wasting valuable compute time and resources.

Note

Before implementing any suggested changes in your production environment, review the suggested changes thoroughly. The service provides recommendations based on patterns and best practices, but your specific use case might require additional considerations.