Amazon Glue for Spark and Amazon Glue for Ray

In Amazon Glue on Apache Spark (Amazon Glue ETL), you can use PySpark to write Python code to handle data at scale. Spark is a familiar solution for this problem, but data engineers with Python-focused backgrounds can find the transition unintuitive. The Spark DataFrame model is not seamlessly "Pythonic", which reflects the Scala language and Java runtime it is built upon.

In Amazon Glue, you can use Python shell jobs to run native Python data integrations. These jobs run on a single Amazon EC2 instance and are limited by the capacity of that instance. This restricts the throughput of the data you can process, and becomes expensive to maintain when dealing with big data.

Amazon Glue for Ray allows you to scale up Python workloads without substantial investment into learning Spark. You can take advantage of certain scenarios where Ray performs better. By offering you a choice, you can use the strengths of both Spark and Ray.

Amazon Glue ETL and Amazon Glue for Ray are different underneath, so they support different features. Please check the documentation to determine supported features.

What is Amazon Glue for Ray?

Ray is an open-source distributed computation framework that you can use to scale up workloads, with a focus on Python. For more information about Ray, see the Ray website. Amazon Glue Ray jobs and interactive sessions allow you to use Ray within Amazon Glue.

You can use Amazon Glue for Ray to write Python scripts for computations that will run in parallel across multiple machines. In Ray jobs and interactive sessions, you can use familiar Python libraries, like pandas, to make your workflows easy to write and run. For more information about Ray datasets, see Ray Datasets in the Ray documentation. For more information about pandas, see the Pandas website.

When you use Amazon Glue for Ray, you can run your pandas workflows against big data at enterprise scale—with only a few lines of code. You can create a Ray job from the Amazon Glue console or the Amazon SDK. You can also open an Amazon Glue interactive session to run your code on a serverless Ray environment. Visual jobs in Amazon Glue Studio are not yet supported.

Amazon Glue for Ray jobs allow you to run a script on a schedule or in response to an event from Amazon EventBridge. Jobs store log information and monitoring statistics in CloudWatch that enable you to understand the health and reliability of your script. For more information about the Amazon Glue job system, see Working with Ray jobs in Amazon Glue.

Ray automates the work of scaling Python code by distributing the processing across a cluster of machines that it reconfigures in real time, based on the load. This can lead to improved performance per dollar for certain workloads. With Ray jobs, we have built auto scaling natively into the Amazon Glue job model, so you can fully take advantage of this feature. Ray jobs run on Amazon Graviton, leading to higher overall price performance.

In addition to cost savings, you can use native auto scaling to run Ray workloads without investing time into cluster maintenance, tuning, and administration. You can use familiar open-source libraries out of the box, such as pandas, and the Amazon SDK for Pandas. These improve iteration speed while you're developing on Amazon Glue for Ray. When you use Amazon Glue for Ray, you will be able to rapidly develop and run cost-effective data integration workloads.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Components

Converting semi-structured schemas to relational schemas