Migrating Amazon Glue jobs to Amazon Glue version 3.0 - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Migrating Amazon Glue jobs to Amazon Glue version 3.0

This topic describes the changes between Amazon Glue versions 0.9, 1.0, 2.0 and 3.0 to allow you to migrate your Spark applications and ETL jobs to Amazon Glue 3.0.

To use this feature with your Amazon Glue ETL jobs, choose 3.0 for the Glue version when creating your jobs.

New features supported

This section describes new features and advantages of Amazon Glue version 3.0.

  • It is based on Apache Spark 3.1.1, which has optimizations from open-source Spark and developed by the Amazon Glue and EMR services such as adaptive query execution, vectorized readers, and optimized shuffles and partition coalescing.

  • Upgraded JDBC drivers for all Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.1.1.

  • Optimized Amazon S3 access with upgraded EMRFS and enabled Amazon S3 optimized output committers by default.

  • Optimized Data Catalog access with partition indexes, push down predicates, partition listing, and upgraded Hive metastore client.

  • Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.

  • Improved Spark UI experience with Spark 3.1.1 with new Spark executor memory metrics and Spark structured streaming metrics.

  • Reduced startup latency improving overall job completion times and interactivity, similar to Amazon Glue 2.0.

  • Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum, similar to Amazon Glue 2.0.

Actions to migrate to Amazon Glue 3.0

For existing jobs, change the Glue version from the previous version to Glue 3.0 in the job configuration.

  • In the console, choose Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0) in Glue version.

  • In Amazon Glue Studio, choose Glue 3.0 - Supports spark 3.1, Scala 2, Python 3 in Glue version.

  • In the API, choose 3.0 in the GlueVersion parameter in the UpdateJob API.

For new jobs, choose Glue 3.0 when you create a job.

  • In the console, choose Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0) in Glue version.

  • In Amazon Glue Studio, choose Glue 3.0 - Supports spark 3.1, Scala 2, Python 3 in Glue version.

  • In the API, choose 3.0 in the GlueVersion parameter in the CreateJob API.

To view Spark event logs of Amazon Glue 3.0, launch an upgraded Spark history server for Glue 3.0 using CloudFormation or Docker.

Migration check list

Review this checklist for migration.

  • Does your job depend on HDFS? If yes, try replacing HDFS with S3.

    • Search the file system path starting with hdfs:// or / as DFS path in the job script code.

    • Check if your default file system is not configured with HDFS. If it is configured explicitly, you need to remove the fs.defaultFS configuration.

    • Check if your job contains any dfs.* parameters. If it contains any, you need to verify it is okay to disable the parameters.

  • Does your job depend on YARN? If yes, verify the impacts by checking if your job contains the following parameters. If it contains any, you need to verify it is okay to disable the parameters.

    • spark.yarn.*

      For example:

      spark.yarn.executor.memoryOverhead spark.yarn.driver.memoryOverhead spark.yarn.scheduler.reporterThread.maxFailures
    • yarn.*

      For example:

      yarn.scheduler.maximum-allocation-mb yarn.nodemanager.resource.memory-mb
  • Does your job depend on Spark 2.2.1 or Spark 2.4.3? If yes, verify the impacts by checking if your job uses features changed in Spark 3.1.1.

    • https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23

      For example the percentile_approx function, or the SparkSession with SparkSession.builder.getOrCreate() when there is an existing SparkContext.

    • https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24

      For example the array_contains function, or the CURRENT_DATE, CURRENT_TIMESTAMP function with spark.sql.caseSensitive=true.

  • Do your job's extra jars conflict in Glue 3.0?

    • From Amazon Glue 0.9/1.0: Extra jars supplied in existing Amazon Glue 0.9/1.0 jobs may bring in classpath conflicts due to upgraded or new dependencies available in Glue 3.0. You can avoid classpath conflicts in Amazon Glue 3.0 with the --user-jars-first Amazon Glue job parameter or by shading your dependencies.

    • From Amazon Glue 2.0: You can still avoid classpath conflicts in Amazon Glue 3.0 with the --user-jars-first Amazon Glue job parameter or by shading your dependencies.

  • Do your jobs depend on Scala 2.11?

    • Amazon Glue 3.0 uses Scala 2.12 so you need to rebuild your libraries with Scala 2.12 if your libraries depend on Scala 2.11.

  • Do your job's external Python libraries depend on Python 2.7/3.6?

    • Use the --additional-python-modules parameters instead of setting the egg/wheel/zip file in the Python library path.

    • Update the dependent libraries from Python 2.7/3.6 to Python 3.7 as Spark 3.1.1 removed Python 2.7 support.

Migrating from Amazon Glue 0.9 to Amazon Glue 3.0

Note the following changes when migrating:

  • Amazon Glue 0.9 uses open-source Spark 2.2.1 and Amazon Glue 3.0 uses EMR-optimized Spark 3.1.1.

    • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.

    • For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.2 does allow them.

  • All jobs in Amazon Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.

  • Logging behavior has changed since Amazon Glue 2.0.

  • Several dependency updates, highlighted in

  • Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

  • Python 3.7 is also the default version used for Python scripts, as Amazon Glue 0.9 was only utilizing Python 2.

    • Python 2.7 is not supported with Spark 3.1.1.

    • A new mechanism of installing additional Python modules is available.

  • Amazon Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.

  • Amazon Glue 3.0 does not have a Hadoop Distributed File System (HDFS).

  • Any extra jars supplied in existing Amazon Glue 0.9 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in Amazon Glue 3.0 with the --user-jars-first Amazon Glue job parameter.

  • Amazon Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.

  • In Amazon Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a maxCapacity.

  • Amazon Glue 3.0 does not yet support machine learning transforms.

  • Amazon Glue 3.0 does not yet support development endpoints.

Refer to the Spark migration documentation:

Migrating from Amazon Glue 1.0 to Amazon Glue 3.0

Note the following changes when migrating:

  • Amazon Glue 1.0 uses open-source Spark 2.4 and Amazon Glue 3.0 uses EMR-optimized Spark 3.1.1.

    • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.

    • For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.

  • All jobs in Amazon Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.

  • Logging behavior has changed since Amazon Glue 2.0.

  • Several dependency updates, highlighted in

  • Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

  • Python 3.7 is also the default version used for Python scripts, as Amazon Glue 0.9 was only utilizing Python 2.

    • Python 2.7 is not supported with Spark 3.1.1.

    • A new mechanism of installing additional Python modules is available.

  • Amazon Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.

  • Amazon Glue 3.0 does not have a Hadoop Distributed File System (HDFS).

  • Any extra jars supplied in existing Amazon Glue 1.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 1.0. You can avoid classpath conflicts in Amazon Glue 3.0 with the --user-jars-first Amazon Glue job parameter.

  • Amazon Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.

  • In Amazon Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a maxCapacity.

  • Amazon Glue 3.0 does not yet support machine learning transforms.

  • Amazon Glue 3.0 does not yet support development endpoints.

Refer to the Spark migration documentation:

Migrating from Amazon Glue 2.0 to Amazon Glue 3.0

Note the following changes when migrating:

  • All existing job parameters and major features that exist in Amazon Glue 2.0 will exist in Amazon Glue 3.0.

    • The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default in Amazon Glue 3.0. However, you can still disable it by setting --enable-s3-parquet-optimized-committer to false.

  • Amazon Glue 2.0 uses open-source Spark 2.4 and Amazon Glue 3.0 uses EMR-optimized Spark 3.1.1.

    • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.

    • For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.

  • Amazon Glue 3.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by Amazon Glue.

  • All jobs in Amazon Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.

  • Python 2.7 is not supported with Spark 3.1.1.

  • Several dependency updates, highlighted in Appendix A: notable dependency upgrades.

  • Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

  • Any extra jars supplied in existing Amazon Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 2.0. You can avoid classpath conflicts in Amazon Glue 3.0 with the --user-jars-first Amazon Glue job parameter.

  • Amazon Glue 3.0 has different Spark task parallelism for driver/executor configuration compared to Amazon Glue 2.0 and improves the performance and better utilizes the available resources. Both spark.driver.cores and spark.executor.cores are configured to number of cores on Amazon Glue 3.0 (4 on the standard and G.1X worker, and 8 on the G.2X worker). These configurations do not change the worker type or hardware for the Amazon Glue job. You can use these configurations to calculate the number of partitions or splits to match the Spark task parallelism in your Spark application.

  • Amazon Glue 3.0 uses Spark 3.1, which changes the behavior to loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1.

    We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the Amazon Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.

    - Key: --conf - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]

Refer to the Spark migration documentation:

Appendix A: notable dependency upgrades

The following are dependency upgrades:

Dependency Version in Amazon Glue 0.9 Version in Amazon Glue 1.0 Version in Amazon Glue 2.0 Version in Amazon Glue 3.0
Spark 2.2.1 2.4.3 2.4.3 3.1.1-amzn-0
Hadoop 2.7.3-amzn-6 2.8.5-amzn-1 2.8.5-amzn-5 3.2.1-amzn-3
Scala 2.11 2.11 2.11 2.12
Jackson 2.7.x 2.7.x 2.7.x 2.10.x
Hive 1.2 1.2 1.2 2.3.7-amzn-4
EMRFS 2.20.0 2.30.0 2.38.0 2.46.0
Json4s 3.2.x 3.5.x 3.5.x 3.6.6
Arrow N/A 0.10.0 0.10.0 2.0.0
Amazon Glue Catalog client N/A N/A 1.10.0 3.0.0

Appendix B: JDBC driver upgrades

The following are JDBC driver upgrades:

Driver JDBC driver version in past Amazon Glue versions JDBC driver version in Amazon Glue 3.0
MySQL 5.1 8.0.23
Microsoft SQL Server 6.1.0 7.0.0
Oracle Databases 11.2 21.1
PostgreSQL 42.1.0 42.2.18
MongoDB 2.0.0 4.0.0