Connecting to data in Ray jobs
Amazon Glue Ray jobs can use a broad array of Python packages that are designed for you to quickly integrate data. We provide a minimal set of dependencies in order to not clutter your environment. For more information about what is included by default, see Modules provided with Ray jobs.
Note
Amazon Glue extract, transform, and load (ETL) provides the DynamicFrame abstraction to streamline ETL workflows where you resolve schema differences between rows in your dataset. Amazon Glue ETL provides additional features—job bookmarks and grouping input files. We don't currently provide corresponding features in Ray jobs.
Amazon Glue for Spark provides direct support for connecting to certain data formats, sources and sinks. In Ray, Amazon SDK for pandas and current third-party libraries substantively cover that need. You will need to consult those libraries to understand what capabilities are available.
Amazon Glue for Ray integration with Amazon VPC is not currently available. Resources in Amazon VPC will not be accessible without a public route. For more information about using Amazon Glue with Amazon VPC, see Configuring interface VPC endpoints (Amazon PrivateLink) for Amazon Glue (Amazon PrivateLink).
Common libraries for working with data in Ray
Ray Data – Ray Data provides methods to handle common data
formats, sources and sinks. For more information about supported formats and sources in Ray Data, see Input/Output
Ray provides certain guidance around use cases where Ray Data might be the best solution for your job. For
more information, see
Ray use cases
Amazon SDK for pandas (awswrangler) – Amazon SDK for pandas is an
Amazon product that delivers clean, tested solutions for reading from and writing to Amazon services when your
transformations manage data with pandas DataFrames. For more information about supported formats and sources
in the Amazon SDK for pandas, see the API Reference
For examples of how to read and write data with the Amazon SDK for pandas, see Quick Start
Modin – Modin is a Python library that implements common pandas
operations in a distributable way. For more information about Modin, see the Modin documentation
When you run Modin and the Amazon SDK for pandas together in a Ray environment, you can perform common ETL
tasks with performant results. For more information about using Modin with the Amazon SDK for pandas, see
At scale
Other frameworks – For more information about frameworks that Ray
supports, see
The Ray Ecosystem
Connecting to data through the Data Catalog
Managing your data through the Data Catalog in conjunction with Ray jobs is supported with the Amazon SDK for
pandas. For more information, see Glue
Catalog