Amazon Glue Spark shuffle plugin with Amazon S3 - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Amazon Glue Spark shuffle plugin with Amazon S3

Shuffling is an important step in a Spark job whenever data is rearranged between partitions. This is required because wide transformations such as join, groupByKey, reduceByKey, and repartition require information from other partitions to complete processing. Spark gathers the required data from each partition and combines it into a new partition. During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity. Spark throws a No space left on device or MetadataFetchFailedException error when there is not enough disk space left on the executor and there is no recovery.

Solution

With Amazon Glue, you can now use Amazon S3 to store Spark shuffle data. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This solution disaggregates compute and storage for your Spark jobs, and gives complete elasticity and low-cost shuffle storage, allowing you to run your most shuffle-intensive workloads reliably.

We are introducing a new Cloud Shuffle Storage Plugin for Apache Spark to use Amazon S3. You can turn on Amazon S3 shuffling to run your Amazon Glue jobs reliably without failures if they are known to be bound by the local disk capacity for large shuffle operations. In some cases, shuffling to Amazon S3 is marginally slower than local disk (or EBS) if you have a large number of small partitions or shuffle files written out to Amazon S3.

Using Amazon Glue Spark shuffle manager from the Amazon console

To set up the Amazon Glue Spark shuffle manager using the Amazon Glue console or Amazon Glue Studio when configuring a job: choose the --write-shuffle-files-to-s3 job parameter to turn on Amazon S3 shuffling for the job.

Using Amazon Glue Spark shuffle plugin

The following job parameters turn on and tune the Amazon Glue shuffle manager.

  • --write-shuffle-files-to-s3 — The main flag, which when true enables the Amazon Glue Spark shuffle manager to use Amazon S3 buckets for writing and reading shuffle data. When false, or not specified the shuffle manager is not used.

  • --write-shuffle-spills-to-s3 — (Supported only on Amazon Glue version 2.0). An optional flag that when true allows you to offload spill files to Amazon S3 buckets, which provides additional resiliency to your Spark job. This is only required for large workloads that spill a lot of data to disk. When false, no intermediate spill files are written. This flag is disabled by default.

  • --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket> — Another optional flag that specifies the Amazon S3 bucket where you write the shuffle files. By default, --TempDir/shuffle-data.

Amazon Glue supports all other shuffle related configurations provided by Spark.

Software binaries for the Cloud Shuffle Storage plugin

You can also download the software binaries of the Cloud Shuffle Storage Plugin for Apache Spark under the Apache 2.0 license and run it on any Spark environment. The new plugin comes with out-of-the box support for Amazon S3, and can also be easily configured to use other forms of cloud storage such as Google Cloud Storage and Microsoft Azure Blob Storage. For more information, see Cloud Shuffle Storage Plugin for Apache Spark.

Notes and limitations

The following are notes or limitations for the Amazon Glue shuffle manager:

  • Make sure the location of the shuffle bucket is in the same Amazon Region in which the job runs.

  • Set the Amazon S3 storage lifecycle policies on the prefix as the shuffle manager does not clean the files after the job is done.

  • You can use this feature if your data is skewed.