Connecting to data in Ray jobs - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Connecting to data in Ray jobs

Amazon Glue Ray jobs can use a broad array of Python packages that are designed for you to quickly integrate data. We provide a minimal set of dependencies in order to not clutter your environment. For more information about what is included by default, see Modules provided with Ray jobs.

Note

Amazon Glue extract, transform, and load (ETL) provides the DynamicFrame abstraction to streamline ETL workflows where you resolve schema differences between rows in your dataset. Amazon Glue ETL provides additional features—job bookmarks and grouping input files. We don't currently provide corresponding features in Ray jobs.

Amazon Glue for Spark provides direct support for connecting to certain data formats, sources and sinks. In Ray, Amazon SDK for pandas and current third-party libraries substantively cover that need. You will need to consult those libraries to understand what capabilities are available.

Amazon Glue for Ray integration with Amazon VPC is not currently available. Resources in Amazon VPC will not be accessible without a public route. For more information about using Amazon Glue with Amazon VPC, see Configuring interface VPC endpoints (Amazon PrivateLink) for Amazon Glue (Amazon PrivateLink).

Common libraries for working with data in Ray

Ray Data – Ray Data provides methods to handle common data formats, sources and sinks. For more information about supported formats and sources in Ray Data, see Input/Output in the Ray Data documentation. Ray Data is an opinionated library, rather than a general-purpose library, for handling datasets.

Ray provides certain guidance around use cases where Ray Data might be the best solution for your job. For more information, see Ray use cases in the Ray documentation.

Amazon SDK for pandas (awswrangler) – Amazon SDK for pandas is an Amazon product that delivers clean, tested solutions for reading from and writing to Amazon services when your transformations manage data with pandas DataFrames. For more information about supported formats and sources in the Amazon SDK for pandas, see the API Reference in the Amazon SDK for pandas documentation.

For examples of how to read and write data with the Amazon SDK for pandas, see Quick Start in the Amazon SDK for pandas documentation. The Amazon SDK for pandas doesn't provide transforms for your data. It only provides support for reading and writing from sources.

Modin – Modin is a Python library that implements common pandas operations in a distributable way. For more information about Modin, see the Modin documentation. Modin itself doesn't provide support for reading and writing from sources. It provides distributed implementations of common transforms. Modin is supported by the Amazon SDK for pandas.

When you run Modin and the Amazon SDK for pandas together in a Ray environment, you can perform common ETL tasks with performant results. For more information about using Modin with the Amazon SDK for pandas, see At scale in the Amazon SDK for pandas documentation.

Other frameworks – For more information about frameworks that Ray supports, see The Ray Ecosystem in the Ray documentation. We don't provide support for other frameworks in Amazon Glue for Ray.

Connecting to data through the Data Catalog

Managing your data through the Data Catalog in conjunction with Ray jobs is supported with the Amazon SDK for pandas. For more information, see Glue Catalog on the Amazon SDK for pandas website.