Migrating Amazon Glue for Spark jobs to Amazon Glue version 3.0
This topic describes the changes between Amazon Glue versions 0.9, 1.0, 2.0 and 3.0 to allow you to migrate your Spark applications and ETL jobs to Amazon Glue 3.0.
To use this feature with your Amazon Glue ETL jobs, choose 3.0
for the
Glue version
when creating your jobs.
Topics
New features supported
This section describes new features and advantages of Amazon Glue version 3.0.
It is based on Apache Spark 3.1.1, which has optimizations from open-source Spark and developed by the Amazon Glue and EMR services such as adaptive query execution, vectorized readers, and optimized shuffles and partition coalescing.
Upgraded JDBC drivers for all Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.1.1.
Optimized Amazon S3 access with upgraded EMRFS and enabled Amazon S3 optimized output committers by default.
Optimized Data Catalog access with partition indexes, push down predicates, partition listing, and upgraded Hive metastore client.
Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.
Improved Spark UI experience with Spark 3.1.1 with new Spark executor memory metrics and Spark structured streaming metrics.
Reduced startup latency improving overall job completion times and interactivity, similar to Amazon Glue 2.0.
Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum, similar to Amazon Glue 2.0.
Actions to migrate to Amazon Glue 3.0
For existing jobs, change the Glue version
from the previous version to Glue 3.0
in the job configuration.
In the console, choose
Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0)
inGlue version
.In Amazon Glue Studio, choose
Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
inGlue version
.In the API, choose
3.0
in theGlueVersion
parameter in theUpdateJob
API.
For new jobs, choose Glue 3.0
when you create a job.
In the console, choose
Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0)
inGlue version
.In Amazon Glue Studio, choose
Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
inGlue version
.In the API, choose
3.0
in theGlueVersion
parameter in theCreateJob
API.
To view Spark event logs of Amazon Glue 3.0, launch an upgraded Spark history server for Glue 3.0 using CloudFormation or Docker.
Migration check list
Review this checklist for migration.
Does your job depend on HDFS? If yes, try replacing HDFS with S3.
Search the file system path starting with
hdfs://
or/
as DFS path in the job script code.Check if your default file system is not configured with HDFS. If it is configured explicitly, you need to remove the
fs.defaultFS
configuration.Check if your job contains any
dfs.*
parameters. If it contains any, you need to verify it is okay to disable the parameters.
Does your job depend on YARN? If yes, verify the impacts by checking if your job contains the following parameters. If it contains any, you need to verify it is okay to disable the parameters.
spark.yarn.*
For example:
spark.yarn.executor.memoryOverhead spark.yarn.driver.memoryOverhead spark.yarn.scheduler.reporterThread.maxFailures
yarn.*
For example:
yarn.scheduler.maximum-allocation-mb yarn.nodemanager.resource.memory-mb
Does your job depend on Spark 2.2.1 or Spark 2.4.3? If yes, verify the impacts by checking if your job uses features changed in Spark 3.1.1.
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23
For example the
percentile_approx
function, or theSparkSession
withSparkSession.builder.getOrCreate()
when there is an existingSparkContext
.https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24
For example the
array_contains
function, or theCURRENT_DATE
,CURRENT_TIMESTAMP
function withspark.sql.caseSensitive=true
.
Do your job's extra jars conflict in Glue 3.0?
From Amazon Glue 0.9/1.0: Extra jars supplied in existing Amazon Glue 0.9/1.0 jobs may bring in classpath conflicts due to upgraded or new dependencies available in Glue 3.0. You can avoid classpath conflicts in Amazon Glue 3.0 with the
--user-jars-first
Amazon Glue job parameter or by shading your dependencies.From Amazon Glue 2.0: You can still avoid classpath conflicts in Amazon Glue 3.0 with the
--user-jars-first
Amazon Glue job parameter or by shading your dependencies.
Do your jobs depend on Scala 2.11?
Amazon Glue 3.0 uses Scala 2.12 so you need to rebuild your libraries with Scala 2.12 if your libraries depend on Scala 2.11.
Do your job's external Python libraries depend on Python 2.7/3.6?
Use the
--additional-python-modules
parameters instead of setting the egg/wheel/zip file in the Python library path.Update the dependent libraries from Python 2.7/3.6 to Python 3.7 as Spark 3.1.1 removed Python 2.7 support.
Migrating from Amazon Glue 0.9 to Amazon Glue 3.0
Note the following changes when migrating:
Amazon Glue 0.9 uses open-source Spark 2.2.1 and Amazon Glue 3.0 uses EMR-optimized Spark 3.1.1.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.
For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.2 does allow them.
All jobs in Amazon Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
Logging behavior has changed since Amazon Glue 2.0.
Several dependency updates, highlighted in Appendix A: notable dependency upgrades.
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.
Python 3.7 is also the default version used for Python scripts, as Amazon Glue 0.9 was only utilizing Python 2.
Python 2.7 is not supported with Spark 3.1.1.
A new mechanism of installing additional Python modules is available.
Amazon Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.
Amazon Glue 3.0 does not have a Hadoop Distributed File System (HDFS).
Any extra jars supplied in existing Amazon Glue 0.9 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in Amazon Glue 3.0 with the
--user-jars-first
Amazon Glue job parameter.Amazon Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.
In Amazon Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
.Amazon Glue 3.0 does not yet support machine learning transforms.
Amazon Glue 3.0 does not yet support development endpoints.
Refer to the Spark migration documentation:
Migrating from Amazon Glue 1.0 to Amazon Glue 3.0
Note the following changes when migrating:
Amazon Glue 1.0 uses open-source Spark 2.4 and Amazon Glue 3.0 uses EMR-optimized Spark 3.1.1.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.
For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
All jobs in Amazon Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
Logging behavior has changed since Amazon Glue 2.0.
Several dependency updates, highlighted in
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.
Python 3.7 is also the default version used for Python scripts, as Amazon Glue 0.9 was only utilizing Python 2.
Python 2.7 is not supported with Spark 3.1.1.
A new mechanism of installing additional Python modules is available.
Amazon Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.
Amazon Glue 3.0 does not have a Hadoop Distributed File System (HDFS).
Any extra jars supplied in existing Amazon Glue 1.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 1.0. You can avoid classpath conflicts in Amazon Glue 3.0 with the
--user-jars-first
Amazon Glue job parameter.Amazon Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.
In Amazon Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
.Amazon Glue 3.0 does not yet support machine learning transforms.
Amazon Glue 3.0 does not yet support development endpoints.
Refer to the Spark migration documentation:
Migrating from Amazon Glue 2.0 to Amazon Glue 3.0
Note the following changes when migrating:
All existing job parameters and major features that exist in Amazon Glue 2.0 will exist in Amazon Glue 3.0.
The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default in Amazon Glue 3.0. However, you can still disable it by setting
--enable-s3-parquet-optimized-committer
tofalse
.
Amazon Glue 2.0 uses open-source Spark 2.4 and Amazon Glue 3.0 uses EMR-optimized Spark 3.1.1.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.
For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
Amazon Glue 3.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by Amazon Glue.
All jobs in Amazon Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
Python 2.7 is not supported with Spark 3.1.1.
Several dependency updates, highlighted in Appendix A: notable dependency upgrades.
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.
Any extra jars supplied in existing Amazon Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 2.0. You can avoid classpath conflicts in Amazon Glue 3.0 with the
--user-jars-first
Amazon Glue job parameter.Amazon Glue 3.0 has different Spark task parallelism for driver/executor configuration compared to Amazon Glue 2.0 and improves the performance and better utilizes the available resources. Both
spark.driver.cores
andspark.executor.cores
are configured to number of cores on Amazon Glue 3.0 (4 on the standard andG.1X
worker, and 8 on theG.2X
worker). These configurations do not change the worker type or hardware for the Amazon Glue job. You can use these configurations to calculate the number of partitions or splits to match the Spark task parallelism in your Spark application.In general, jobs will see either similar or improved performance compared to Amazon Glue 2.0. If jobs run slower, you can increase the task parallelism by passing the following job argument:
key:
--executor-cores
value: <desired number of tasks that can run in parallel
>The value should not exceed 2x the number of vCPUs on the worker type, which is 8 on
G.1X
, 16 onG.2X
, 32 onG.4X
and 64 onG.8X
. You should exercise caution while updating this configuration as it could impact job performance because the increased parallelism causes memory and disk pressure, as well as it could throttle the source and target systems.
Amazon Glue 3.0 uses Spark 3.1, which changes the behavior to loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1
. We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the Amazon Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.
- Key: --conf - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
Refer to the Spark migration documentation:
Appendix A: notable dependency upgrades
The following are dependency upgrades:
Dependency | Version in Amazon Glue 0.9 | Version in Amazon Glue 1.0 | Version in Amazon Glue 2.0 | Version in Amazon Glue 3.0 |
---|---|---|---|---|
Spark | 2.2.1 | 2.4.3 | 2.4.3 | 3.1.1-amzn-0 |
Hadoop | 2.7.3-amzn-6 | 2.8.5-amzn-1 | 2.8.5-amzn-5 | 3.2.1-amzn-3 |
Scala | 2.11 | 2.11 | 2.11 | 2.12 |
Jackson | 2.7.x | 2.7.x | 2.7.x | 2.10.x |
Hive | 1.2 | 1.2 | 1.2 | 2.3.7-amzn-4 |
EMRFS | 2.20.0 | 2.30.0 | 2.38.0 | 2.46.0 |
Json4s | 3.2.x | 3.5.x | 3.5.x | 3.6.6 |
Arrow | N/A | 0.10.0 | 0.10.0 | 2.0.0 |
Amazon Glue Catalog client | N/A | N/A | 1.10.0 | 3.0.0 |
Appendix B: JDBC driver upgrades
The following are JDBC driver upgrades:
Driver | JDBC driver version in past Amazon Glue versions | JDBC driver version in Amazon Glue 3.0 |
---|---|---|
MySQL | 5.1 | 8.0.23 |
Microsoft SQL Server | 6.1.0 | 7.0.0 |
Oracle Databases | 11.2 | 21.1 |
PostgreSQL | 42.1.0 | 42.2.18 |
MongoDB | 2.0.0 | 4.0.0 |