Connecting to data - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Connecting to data

An Amazon Glue connection is a Data Catalog object that stores login credentials, URI strings, virtual private cloud (VPC) information, and more for a particular data store. Amazon Glue crawlers, jobs, and development endpoints use connections in order to access certain types of data stores. You can use connections for both sources and targets, and reuse the same connection across multiple crawler or extract, transform, and load (ETL) jobs.

The latest version of the Amazon Glue connections schema provides a unified way to manage data connections across Amazon services and applications, such as Amazon Glue, Amazon Athena, and Amazon SageMaker Unified Studio.

Overview of using connectors and connections

A connection contains the properties that are required to connect to a particular data store. When you create a connection, it is stored in the Amazon Glue Data Catalog. You choose a connector, and then create a connection based on that connector.

You can subscribe to connectors for non-natively supported data stores in Amazon Web Services Marketplace, and then use those connectors when you're creating connections. Developers can also create their own connectors, and you can use them when creating connections.

Note

Connections created using custom or Amazon Web Services Marketplace connectors in Amazon Glue Studio appear in the Amazon Glue console with type set to UNKNOWN.

The following steps describe the overall process of using connectors in Amazon Glue Studio:

  1. Subscribe to a connector in Amazon Web Services Marketplace, or develop your own connector and upload it to Amazon Glue Studio. For more information, see Adding connectors to Amazon Glue Studio.

  2. Review the connector usage information. You can find this information on the Usage tab on the connector product page. For example, if you click the Usage tab on this product page, Amazon Glue Connector for Google BigQuery, you can see in the Additional Resources section a link to a blog about using this connector.

  3. Create a connection. You choose which connector to use and provide additional information for the connection, such as login credentials, URI strings, and virtual private cloud (VPC) information. For more information, see Creating connections for connectors.

  4. Create an IAM role for your job. The job assumes the permissions of the IAM role that you specify when you create it. This IAM role must have the necessary permissions to authenticate with, extract data from, and write data to your data stores.

  5. Create an ETL job and configure the data source properties for your ETL job. Provide the connection options and authentication information as instructed by the custom connector provider. For more information, see Authoring jobs with custom connectors.

  6. Customize your ETL job by adding transforms or additional data stores, as described in Starting visual ETL jobs in Amazon Glue Studio.

  7. If using a connector for the data target, configure the data target properties for your ETL job. Provide the connection options and authentication information as instructed by the custom connector provider. For more information, see Authoring jobs with custom connectors.

  8. Customize the job run environment by configuring job properties, as described in Modify the job properties.

  9. Run the job.

Unified connections

With Unified connections, you can configure a data connection once, and it can be reused by various services for use cases in data integration, data analytics, and data science. You can create data connections through the Amazon Glue console, or custom-built applications using unified data connectivity APIs. With Unified connections, you can set up a connection to a data source using a connection configuration template that is standardized for multiple services. These services (Amazon Glue, Amazon SageMaker Unified Studio and Amazon Athena) can share and reuse the same connection with proper permission configuration.

Amazon Glue Studio now creates unified connections by default. In the Amazon Glue console, you can see the version of the connection in the connections table on the connections page, on the connections detail page, and the connections table in the job details page.

The connection version is visible on Connection details:

Screenshot shows the connections detail on the v2 connection.

The connection version is also visible when viewing all your Connections.

Screenshot shows the connections detail on the v2 connection.

Finally, connection version is visible in the Job details tab for a job.

Screenshot shows the connections detail on the v2 connection.

With version 2 connections, you have the following expanded data connectivity capabilities:

  • Connection type discovery: Support for creating connections using standardized templates. Amazon Glue automatically discovers the connection types accessible by you and the required and optional inputs for a given connection type.

  • Reusability: Connection definitions that are reusable across Amazon data processing engines and tools like Amazon Glue, Amazon Athena, and Amazon SageMaker. Connections now contain AthenaProperties, SparkProperties, PythonProperties which allow to specify compute environment/service specific connection properties in addition to the common properties stored in ConnectionProperties. Athena now creates Connections in Amazon Glue by specifying Athena specific properties in the AthenaProperties property map.

  • Data preview: Ability to browse metadata and preview data from connected sources.

  • Connector metadata: Reusable connections may be used in order to discover table metadata.

  • Service linked secrets: Users may provide necessary OAuth, basic or custom authentication credentials in the CreateConnection request. The CreateConnection API creates a Service Linked Secret in your account and stores the credentials on your behalf.