Using Ray Core and Ray Data in Amazon Glue for Ray - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using Ray Core and Ray Data in Amazon Glue for Ray

Ray is a framework for scaling up Python scripts by distributing work across a cluster. You can use Ray as a solution to many sorts of problems, so Ray provides libraries to optimize certain tasks. In Amazon Glue, we focus on using Ray to transform large datasets. Amazon Glue offers support for Ray Data and parts of Ray Core to facilitate this task.

What is Ray Core?

The first step of building a distributed application is identifying and defining work that can be performed concurrently. Ray Core contains the parts of Ray that you use to define tasks that can be performed concurrently. Ray provides reference and quick start information that you can use to learn the tools they provide. For more information, see What is Ray Core? and Ray Core Quick Start. For more information about effectively defining concurrent tasks in Ray, see Tips for first-time users.

Ray tasks and actors

In Amazon Glue for Ray documentation, we might refer to tasks and actors, which are core concepts in Ray.

Ray uses Python functions and classes as the building blocks of a distributed computing system. Much like when Python functions and variables become "methods" and "attributes" when used in a class, functions become "tasks" and classes become "actors" when they're used in Ray to send code to workers. You can identify functions and classes that might be used by Ray by the @ray.remote annotation.

Tasks and actors are configurable, they have a lifecycle, and they take up compute resources throughout their life. Code that throws errors can be traced back to a task or actor when you're finding the root cause of problems. Thus, these terms might come up when you're learning how to configure, monitor, or debug Amazon Glue for Ray jobs.

To begin learning how to effectively use tasks and actors to build a distributed application, see Key Concepts in the Ray docs.

Ray Core in Amazon Glue for Ray

Amazon Glue for Ray environments manage cluster formation and scaling, as well as collecting and visualizing logs. Because we manage these concerns, we consequently limit access to and support for the APIs in Ray Core that would be used to address these concerns in an open-source cluster.

In the managed Ray2.4 runtime environment, we do not support:

What is Ray Data?

When you're connecting to data sources and destinations, handling datasets, and initiating common transforms, Ray Data is a straightforward methodology for using Ray to solve problems transforming Ray datasets. For more information about using Ray Data, see Ray Datasets: Distributed Data Preprocessing.

You can use Ray Data or other tools to access your data. For more information on accessing your data in Ray, see Connecting to data in Ray jobs.

Ray Data in Amazon Glue for Ray

Ray Data is supported and provided by default in the managed Ray2.4 runtime environment. For more information about provided modules, see Modules provided with Ray jobs.