Running Spark ETL jobs with reduced startup times
Amazon Glue versions 2.0 and later provide an upgraded infrastructure for running Apache Spark ETL (extract, transform, and load) jobs in Amazon Glue with reduced startup times. With the reduced wait times, data engineers can be more productive and increase their interactivity with Amazon Glue. The reduced variance in job start times can help you meet or exceed your SLAs of making data available for analytics.
To use this feature with your Amazon Glue ETL jobs, choose 2.0
or a later version for the
Glue version
when creating your jobs.
New features supported
This section describes new features supported with Amazon Glue versions 2.0 and later.
Support for specifying additional Python modules at the job level
Amazon Glue versions 2.0 and later also let you provide additional Python modules or different versions at
the job level. You can use the --additional-python-modules
option with
a list of comma-separated Python modules to add a new module or change the version
of an existing module.
For example to update or to add a new scikit-learn
module use the following key/value: "--additional-python-modules", "scikit-learn==0.21.3"
.
Also, within the --additional-python-modules
option you can specify an Amazon S3 path to a Python wheel module. For example:
--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/ephem-3.7.7.1-cp37-cp37m-linux_x86_64.whl,s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl,scikit-learn==0.21.3
Amazon Glue uses the Python Package Installer (pip3) to install the additional modules. You can pass additional options specified by the python-modules-installer-option
to pip3 for installing the modules. Any incompatibility or limitations from pip3 will apply.
Python modules already provided in Amazon Glue version 2.0
Amazon Glue version 2.0 supports the following python modules out of the box:
setuptools—45.2.0
subprocess32—3.5.4
ptvsd—4.3.2
pydevd—1.9.0
PyMySQL—0.9.3
docutils—0.15.2
jmespath—0.9.4
six—1.14.0
python_dateutil—2.8.1
urllib3—1.25.8
botocore—1.15.4
s3transfer—0.3.3
boto3—1.12.4
certifi—2019.11.28
chardet—3.0.4
idna—2.9
requests—2.23.0
pyparsing—2.4.6
enum34—1.1.9
pytz—2019.3
numpy—1.18.1
cycler—0.10.0
kiwisolver—1.1.0
scipy—1.4.1
pandas—1.0.1
pyarrow—0.16.0
matplotlib—3.1.3
pyhocon—0.3.54
mpmath—1.1.0
sympy—1.5.1
patsy—0.5.1
statsmodels—0.11.1
fsspec—0.6.2
s3fs—0.4.0
Cython—0.29.15
joblib—0.14.1
pmdarima—1.5.3
scikit-learn—0.22.1
tbats—1.0.9
Logging behavior
Amazon Glue versions 2.0 and later support different default logging behavior. The differences include:
Logging occurs in realtime.
There are separate streams for drivers and executors.
For each driver and executor there are two streams, the output stream and the error stream.
Driver and executor streams
Driver streams are identified by the job run ID. Executor streams are identified by the job <run id
>_<executor task id
>. For example:
"logStreamName": "jr_8255308b426fff1b4e09e00e0bd5612b1b4ec848d7884cebe61ed33a31789..._g-f65f617bd31d54bd94482af755b6cdf464542..."
Output and errors streams
The output stream has the standard output (stdout) from your code. The error stream has logging messages from the your code/library.
Log streams:
Driver log streams have <
jr
>, where <jr
> is the job run ID.Executor log streams have <
jr
>_<g
>, where <g
> is the task ID for the executor. You can look up the executor task ID in the driver error log.
The default log groups for Amazon Glue version 2.0 are as follows:
/aws-glue/jobs/logs/output
for output/aws-glue/jobs/logs/error
for errors
When a security configuration is provided, the log group names change to:
/aws-glue/jobs/<
security configuration
>-role/<Role Name
>/output/aws-glue/jobs/<
security configuration
>-role/<Role Name
>/error
On the console the Logs link points to the output log group and the Error link points to the error log group. When continuous logging is enabled, the Logs links points to the continuous log group, and the Output link points to the output log group.
Logging rules
Note
The default log groupname for continuous logging is /aws-glue/jobs/logs-v2
.
In Amazon Glue versions 2.0 and later, continuous logging has the same behavior as in Amazon Glue version 1.0:
Default log group:
/aws-glue/jobs/logs-v2
Driver log stream: <
jr
>-driverExecutor log stream: <
jr
>-<executor ID
>The log group name can be changed by setting
--continuous-log-logGroupName
The log streams name can be prefixed by setting
--continous-log-logStreamPrefix
Features not supported
The following Amazon Glue features are not supported:
Development endpoints
Amazon Glue versions 2.0 and later do not run on Apache YARN, so YARN settings do not apply
Amazon Glue versions 2.0 and later do not have a Hadoop Distributed File System (HDFS)
Amazon Glue versions 2.0 and later do not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available
For Amazon Glue version 2.0 or later jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
.-
Amazon Glue versions 2.0 and later do not support
s3n
out of the box. We recommend usings3
ors3a
. If jobs need to uses3n
for any reason, you can pass the following additional argument:--conf spark.hadoop.fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem