Configure Amazon Glue connections - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Configure Amazon Glue connections

To configure data sources for use with the SQL extension, administrators need to create an Amazon Glue connection for each data source. These connections store the configuration details that allow accessing and interacting with the data source.

To create these connections:

  • First, create a JSON file that defines the connection properties for each data source. The JSON file includes details such as the data source identifier, access credentials, and other relevant configuration parameters to access the data sources through the Amazon Glue connections.

  • Then use the Amazon Command Line Interface (Amazon CLI) to create the Amazon Glue connection, passing the JSON file as a parameter. The Amazon CLI command reads the connection details from the JSON file and establishes the appropriate connection.

    Note

    The SQL extension supports creating connections using the Amazon CLI only.

Before creating Amazon Glue connections, ensure that you complete the following steps:

  • Install and configure the Amazon Command Line Interface (Amazon CLI). For more information about how to install and configure the Amazon CLI, see About Amazon CLI version 2. Ensure that the access keys and tokens of the IAM user or role used to configure the Amazon CLI have the required permissions to create Amazon Glue connections. Add a policy that allows the glue:CreateConnection action otherwise.

  • Understand how to use Amazon Secrets Manager. We recommend that you use Secrets Manager to provide connection credentials and any other sensitive information for your data store. For more information on using Secrets Manager to store credentials, see Storing connection credentials in Amazon Secrets Manager.

Create a connection definition JSON file

To create an Amazon Glue connection definition file, create a JSON file to define the connection details on the machine where you have installed and configured the Amazon CLI. For this example, name the file sagemaker-sql-connection.json.

The connection definition file should follow the following general format:

  • Name is the name for the connection.

  • Description is a textual description of the connection.

  • ConnectionType is the type of connection. Choose REDSHIFT, ATHENA, or SNOWFLAKE.

  • ConnectionProperties is a map of key-value pairs for the connection properties, such as the ARN of your Amazon secret, or the name of your database.

{ "ConnectionInput": { "Name": <GLUE_CONNECTION_NAME>, "Description": <GLUE_CONNECTION_DESCRIPTION>, "ConnectionType": "REDSHIFT | ATHENA | SNOWFLAKE", "ConnectionProperties": { "PythonProperties": "{\"aws_secret_arn\": <SECRET_ARN>, \"database\": <...>}" } } }
Note
  • The properties within the ConnectionProperties key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (\) character.

  • All properties available in Secrets Manager can also be directly provided through PythonProperties. However, it is not recommended to include sensitive fields such as passwords in PythonProperties. Instead, the preferred approach is to use Secrets Manager.

Connection definition files specific to different data stores can be found in the following sections.

The connection definition files for each data source contain the specific properties and configuration required to connect to those data stores from the SQL extension. Refer to the appropriate section for details on defining connections to that source.

This section provides details on the secret and connection properties in JSON definition files that are specific to Amazon Redshift. Before creating your connection configuration file, we recommend storing your Amazon Redshift access credentials as a secret in Secrets Manager. Alternatively, you can generate temporary database credentials based on permissions granted through an Amazon Identity and Access Management (IAM) permissions policy to manage the access that your users have to your Amazon Redshift database. For more information, see Using IAM authentication to generate database user credentials.

Create a secret for Amazon Redshift access credentials

To store Amazon Redshift information in Amazon Secrets Manager
  1. From the Amazon console, navigate to Secrets Manager.

  2. Choose Store a new secret.

  3. Under Secret type, choose Credentials for Amazon Redshift.

  4. Enter the administrator username and password configured when launching the Amazon Redshift cluster.

  5. Select the Amazon Redshift cluster associated with the secrets.

  6. Name your secret.

  7. The remaining settings can be left at their default values for initial secret creation, or customized if required.

  8. Create the secret and retrieve its ARN.

Configure an Amazon Glue connection for Amazon Redshift

The SQL extension connects to data sources using custom Amazon Glue connections. For general information on creating Amazon Glue connections to connect a data source, see Configure Amazon Glue connections. The following example is a sample Amazon Glue connection definition for connecting to Amazon Redshift.

Before creating a new connection, keep these recommendations in mind:

  • The properties within the PythonProperties key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (\) character.

  • In the connection definition file, enter the name and description of the connection, replace the ARN of the secret in aws_secret_arn with the ARN of the secret previously created.

  • Ensure that the database declared by its name in the connection definition above matches the cluster database. You can verify this by going to the cluster details page on Amazon Redshift console, and verifying the database name under Database configurations in Properties section.

  • For additional parameters, see the list of connection properties supported by Amazon Redshift in Amazon Redshift connection parameters.

    Note
    • By default, the SQL extension connector for Python runs all queries in a transaction, unless the auto_commit in connection properties is set to true.

    • You can add all connection parameters, including the database name, to a secret.

{ "ConnectionInput": { "Name": "Redshift connection name", "Description": "Redshift connection description", "ConnectionType": "REDSHIFT", "ConnectionProperties": { "PythonProperties":"{\"aws_secret_arn\": \"arn:aws:secretsmanager:region:account_id:secret:secret_name\", \"database\":\"database_name\", \"database_metadata_current_db_only\": false}" } } }

Once your definition file is updated, follow the steps in Create Amazon Glue connections to create your Amazon Glue connection.

This section provides details on the connection properties in JSON definition files that are specific to Athena.

Configure an Amazon Glue connection for Athena

The SQL extension connects to data sources using custom Amazon Glue connections. For general information on creating Amazon Glue connections to connect a data source, see Configure Amazon Glue connections. The following example is a sample Amazon Glue connection definition for connecting to Athena.

Before creating a new connection, keep these recommendations in mind:

  • The properties within the ConnectionProperties key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (\) character.

  • In the connection definition file, enter the name and description of the connection, replace the catalog_name with the name of your catalog, s3_staging_dir with the Amazon S3 URI (Uniform Resource Identifier) of your output directory in your Amazon S3 bucket, and the region_name with the region of your Amazon S3 bucket.

  • For additional parameters, refer to the list of connection properties supported by Athena in Athena connection parameters.

    Note
    • You can add all connection parameters, including the catalog_name or s3_staging_dir, to a secret.

    • If you specify a workgroup, you don't need to specify s3_staging_dir.

{ "ConnectionInput": { "Name": "Athena connection name", "Description": "Athena connection description", "ConnectionType": "ATHENA", "ConnectionProperties": { "PythonProperties": "{\"catalog_name\": \"catalog_name\",\"s3_staging_dir\": \"s3://bucket_name_in_same_region/output_query_results_dir/\", \"region_name\": \"region\"}" } } }

Once your definition file is updated, follow the steps in Create Amazon Glue connections to create your Amazon Glue connection.

This section provides details on the secret and connection properties in JSON definition files that are specific to Snowflake. Before creating your connection configuration file, we recommend storing your Snowflake access credentials as a secret in Secrets Manager.

Create a secret for Snowflake access credentials

To store Amazon Redshift information in Secrets Manager
  1. From the Amazon console, navigate to Secrets Manager.

  2. Choose Store a new secret.

  3. Under Secret type, choose Other type of secret.

  4. In the key-value pair, choose Plaintext, and then copy the following JSON content. Replace the user, password, and account by their values.

    { "user":"snowflake_user", "password":"snowflake_password", "account":"account_id" }
  5. Name the secret.

  6. The remaining settings can be left at their default values for initial secret creation, or customized if required.

  7. Create the secret and retrieve its ARN.

Configure an Amazon Glue connection for Snowflake

The SQL extension connects to data sources using custom Amazon Glue connections. For general information on creating Amazon Glue connections to connect a data source, see Configure Amazon Glue connections. The following example is a sample Amazon Glue connection definition for connecting to Snowflake.

Before creating a new connection, keep these recommendations in mind:

  • The properties within the ConnectionProperties key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (\) character.

  • In the connection definition file, enter the name and description of the connection, then replace the ARN of the secret in aws_secret_arn with the ARN of the secret previously created, and your account ID in account.

  • For additional parameters, refer to the list of connection properties supported by Snowflake in Snowflake connection parameters.

    Note

    You can add all connection parameters, including the account, to a secret.

{ "ConnectionInput": { "Name": "Snowflake connection name", "Description": "Snowflake connection description", "ConnectionType": "SNOWFLAKE", "ConnectionProperties": { "PythonProperties": "{\"aws_secret_arn\": \"arn:aws:secretsmanager:region:account_id:secret:secret_name\", \"account\":\"account_id\"}"}" } } }

Once your definition file is updated, follow the steps in Create Amazon Glue connections to create your Amazon Glue connection.

Create Amazon Glue connections

To create an Amazon Glue connection through the Amazon CLI, use your connection definition file and run this Amazon CLI command. Replace the region placeholder with your Amazon Region name and provide the local path to your definition file.

Note

The path to your configuration definition file must be preceded by file://.

aws --region region glue create-connection --cli-input-json file://path_to_file/sagemaker-sql-connection.json

Verify that the Amazon Glue connection was created by running the following command and check for your connection name.

aws --region region glue get-connections

Alternatively, you can update an existing Amazon Glue connection as follows:

  • Modify the Amazon Glue connection definition file as required.

  • Run the following command to update the connection.

    aws --region region glue update-connection --name glue_connection_name --cli-input-json file://path_to_file/sagemaker-sql-connection.json