Amazon Glue for Spark and Amazon Glue for Ray - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Glue for Spark and Amazon Glue for Ray

In Amazon Glue on Apache Spark (Amazon Glue ETL), you can use PySpark to write Python code to handle data at scale. Spark is a familiar solution for this problem, but data engineers with Python-focused backgrounds can find the transition unintuitive. The Spark DataFrame model is not seamlessly "Pythonic", which reflects the Scala language and Java runtime it is built upon.

In Amazon Glue, you can use Python shell jobs to run native Python data integrations. These jobs run on a single Amazon EC2 instance and are limited by the capacity of that instance. This restricts the throughput of the data you can process, and becomes expensive to maintain when dealing with big data.

Amazon Glue for Ray allows you to scale up Python workloads without substantial investment into learning Spark. You can take advantage of certain scenarios where Ray performs better. By offering you a choice, you can use the strengths of both Spark and Ray.

Amazon Glue ETL and Amazon Glue for Ray are different underneath, so they support different features. Please check the documentation to determine supported features.

What is Amazon Glue for Ray?

Ray is an open-source distributed computation framework that you can use to scale up workloads, with a focus on Python. For more information about Ray, see the Ray website. Amazon Glue Ray jobs and interactive sessions allow you to use Ray within Amazon Glue.

You can use Amazon Glue for Ray to write Python scripts for computations that will run in parallel across multiple machines. In Ray jobs and interactive sessions, you can use familiar Python libraries, like pandas, to make your workflows easy to write and run. For more information about Ray datasets, see Ray Datasets in the Ray documentation. For more information about pandas, see the Pandas website.

When you use Amazon Glue for Ray, you can run your pandas workflows against big data at enterprise scale—with only a few lines of code. You can create a Ray job from the Amazon Glue console or the Amazon SDK. You can also open an Amazon Glue interactive session to run your code on a serverless Ray environment. Visual jobs in Amazon Glue Studio are not yet supported.

Amazon Glue for Ray jobs allow you to run a script on a schedule or in response to an event from Amazon EventBridge. Jobs store log information and monitoring statistics in CloudWatch that enable you to understand the health and reliability of your script. For more information about the Amazon Glue job system, see Working with Ray jobs in Amazon Glue.

Amazon Glue for Ray interactive sessions (preview) allow you to run snippets of code one after another against the same provisioned resources. You can use this to efficiently prototype and develop scripts, or build your own interactive applications. You can use Amazon Glue interactive sessions from Amazon Glue Studio Notebooks in the Amazon Web Services Management Console. For more information, see Using Notebooks with Amazon Glue Studio and Amazon Glue. You can also use them through a Jupyter kernel, which allows you to run interactive sessions from existing code editing tools that support Jupyter Notebooks, such as VSCode. For more information, see Getting started with Amazon Glue for Ray interactive sessions (preview).

Ray automates the work of scaling Python code by distributing the processing across a cluster of machines that it reconfigures in real time, based on the load. This can lead to improved performance per dollar for certain workloads. With Ray jobs, we have built auto scaling natively into the Amazon Glue job model, so you can fully take advantage of this feature. Ray jobs run on Amazon Graviton, leading to higher overall price performance.

In addition to cost savings, you can use native auto scaling to run Ray workloads without investing time into cluster maintenance, tuning, and administration. You can use familiar open-source libraries out of the box, such as pandas, and the Amazon SDK for Pandas. These improve iteration speed while you're developing on Amazon Glue for Ray. When you use Amazon Glue for Ray, you will be able to rapidly develop and run cost-effective data integration workloads.