Enabling the Apache Spark web UI for Amazon Glue jobs
You can use the Apache Spark web UI to monitor and debug Amazon Glue ETL jobs running on the Amazon Glue job system. You can configure the Spark UI using the Amazon Glue console or the Amazon Command Line Interface (Amazon CLI).
Every 30 seconds, Amazon Glue backs up the Spark event logs to the Amazon S3 path that you specify.
Topics
Configuring the Spark UI (console)
Follow these steps to configure the Spark UI by using the Amazon Web Services Management Console. When creating an Amazon Glue job, Spark UI is enabled by default.
To turn on the Spark UI when you create or edit a job
-
Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/
. -
In the navigation pane, choose Jobs.
-
Choose Add job, or select an existing one.
-
In Job details, open the Advanced properties.
-
Under the Spark UI tab, choose Write Spark UI logs to Amazon S3.
-
Specify an Amazon S3 path for storing the Spark event logs for the job. Note that if you use a security configuration in the job, the encryption also applies to the Spark UI log file. For more information, see Encrypting data written by Amazon Glue.
-
Under Spark UI logging and monitoring configuration:
Select Standard if you are generating logs to view in the Amazon Glue console.
Select Legacy if you are generating logs to view on a Spark history server.
You can also choose to generate both.
Configuring the Spark UI (Amazon CLI)
To generate logs for viewing with Spark UI, in the Amazon Glue console, use the Amazon CLI to pass the following job parameters to Amazon Glue jobs. For more information, see Using job parameters in Amazon Glue jobs.
'--enable-spark-ui': 'true', '--spark-event-logs-path': 's3://s3-event-log-path'
To distribute logs to their legacy locations, set the --enable-spark-ui-legacy-path
parameter to "true"
. If you do
not want to generate logs in both formats, remove the --enable-spark-ui
parameter.
Configuring the Spark UI for sessions using Notebooks
Warning
Amazon Glue interactive sessions do not currently support Spark UI in the console. Configure a Spark history server.
If you use Amazon Glue notebooks, set up SparkUI config before starting the session. To do this, use the
%%configure
cell magic:
%%configure { “--enable-spark-ui”: “true”, “--spark-event-logs-path”: “s3://path” }
Enable rolling logs
Enabling SparkUI and rolling log event files for Amazon Glue jobs provides several benefits:
-
Rolling Log Event Files – With rolling log event files enabled, Amazon Glue generates separate log files for each step of the job execution, making it easier to identify and troubleshoot issues specific to a particular stage or transformation.
-
Better Log Management – Rolling log event files help in managing log files more efficiently. Instead of having a single, potentially large log file, the logs are split into smaller, more manageable files based on the job execution stages. This can simplify log archiving, analysis, and troubleshooting.
-
Improved Fault Tolerance – If a Amazon Glue job fails or is interrupted, the rolling log event files can provide valuable information about the last successful stage, making it easier to resume the job from that point rather than starting from scratch.
-
Cost Optimization – By enabling rolling log event files, you can save on storage costs associated with log files. Instead of storing a single, potentially large log file, you store smaller, more manageable log files, which can be more cost-effective, especially for long-running or complex jobs.
In a new environment, users can explicitly enable rolling logs through:
'—conf': 'spark.eventLog.rolling.enabled=true'
or
'—conf': 'spark.eventLog.rolling.enabled=true —conf spark.eventLog.rolling.maxFileSize=128m'
When rolling logs are activated, spark.eventLog.rolling.maxFileSize
specifies the maximum size of the event log
file before it rolls over. The default value of this optional parameter if not specified is 128 MB. Minimum is 10 MB.
The maximum sum of all generated rolled log event files is 2 GB. For Amazon Glue jobs without rolling log support, the maximum log event file size supported for SparkUI is 0.5 GB.
You can turn off rolling logs for a streaming job by passing an additional configuration. Note that very large log files may be costly to maintain.
To turn off rolling logs, provide the following configuration:
'--spark-ui-event-logs-path': 'true', '--conf': 'spark.eventLog.rolling.enabled=false'