Creating Amazon RDS zero-ETL integrations with an Amazon SageMaker lakehouse
When you create an Amazon RDS zero-ETL integration with an Amazon SageMaker lakehouse, you specify the source RDS database and the target Amazon SageMaker lakehouse. You can also customize encryption settings and add tags. Amazon RDS creates an integration between the source database and its target. Once the integration is active, any data that you insert into the source database will be replicated into the configured Amazon Glue target.
Prerequisites
Before you create a zero-ETL integration with an Amazon SageMaker lakehouse, you must create a source database and a target Amazon Glue catalog. You also must allow replication into the catalog by adding the database as an authorized integration source.
For instructions to complete each of these steps, see Getting started with Amazon RDS zero-ETL integrations.
Required permissions
Certain IAM permissions are required to create a zero-ETL integration with an Amazon SageMaker lakehouse. At minimum, you need permissions to perform the following actions:
Create zero-ETL integrations for the source RDS database.
View and delete all zero-ETL integrations.
Create inbound integrations into the target Amazon Glue catalog.
Access Amazon S3 buckets used by the Amazon Glue catalog.
Use Amazon KMS keys for encryption if custom encryption is configured.
Register resources with Lake Formation.
Put resource policy on the Amazon Glue catalog to authorize inbound integrations.
The following sample policy demonstrates the least privilege
permissions required to create and manage integrations with an Amazon SageMaker lakehouse. You might not need
these exact permissions if your user or role has broader permissions, such as an
AdministratorAccess
managed policy.
Additionally, you must configure a resource policy on the target Amazon Glue catalog to authorize inbound integrations. Use the following Amazon CLI command to apply the resource policy.
aws glue put-resource-policy \ --policy-in-json '{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" }, "Action": [ "glue:AuthorizeInboundIntegration" ], "Resource": ["arn:aws:glue:us-east-1:
account_id
:catalog/catalog_name
"], "Condition": { "StringEquals": { "aws:SourceArn": "arn:aws:rds:us-east-1:account_id
:db:source_name
" } } }, { "Effect": "Allow", "Principal": { "AWS": "account_id
" }, "Action": ["glue:CreateInboundIntegration"], "Resource": ["arn:aws:glue:sa-east-1:account_id
:catalog/catalog_name
"] } ] }' \ --region sa-east-1
Note
Glue catalog Amazon Resource Names (ARNs) have the following format:
-
Glue catalog –
arn:aws:glue:{region}:{account-id}:catalog/
catalog-name
Choosing a target Amazon Glue catalog in a different account
If you plan to specify a target Amazon Glue catalog that's in another Amazon Web Services account, you must create a role that allows users in the current account to access resources in the target account. For more information, see Providing access to an IAM user in another Amazon Web Services account that you own.
The role must have the following permissions, which allow the user to view available Amazon Glue catalogs in the target account.
{ "Version":"2012-10-17", "Statement":[ { "Effect":"Allow", "Action":[ "glue:GetCatalog" ], "Resource":[ "*" ] } ] }
The role must have the following trust policy, which specifies the target account ID.
{ "Version":"2012-10-17", "Statement":[ { "Effect":"Allow", "Principal":{ "AWS": "arn:aws:iam::
{external-account-id}
:root" }, "Action":"sts:AssumeRole" } ] }
For instructions to create the role, see Creating a role using custom trust policies.
Creating zero-ETL integrations with an Amazon SageMaker lakehouse
You can create a zero-ETL integration with an Amazon SageMaker lakehouse using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.
Important
Zero-ETL integrations with an Amazon SageMaker lakehouse do not support refresh or resync operations. If you encounter issues with an integration after creation, you must delete the integration and create a new one.
By default, RDS for MySQL immediately purges binary log files. Because zero-ETL integrations rely on binary logs to replicate data from the source to the target, the retention period for the source database must be at least one hour. As soon as you create an integration, Amazon RDS checks the binary log file retention period for the selected source database. If the current value is 0 hours, Amazon RDS automatically changes it to 1 hour. Otherwise, the value remains the same.
To create a zero-ETL integration with an Amazon SageMaker lakehouse
Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/
. -
In the left navigation pane, choose Zero-ETL integrations.
-
Choose Create zero-ETL integration.
-
For Integration identifier, enter a name for the integration. The name can have up to 63 alphanumeric characters and can include hyphens.
-
Choose Next.
For Source, select the RDS database where the data will originate from.
Note
RDS notifies you if the DB parameters aren't configured correctly. If you receive this message, you can either choose Fix it for me, or configure them manually. For instructions to fix them manually, see Step 1: Create a custom DB parameter group.
Modifying DB parameters requires a reboot. Before you can create the integration, the reboot must be complete and the new parameter values must be successfully applied to the database.
-
Once your source database is successfully configured, choose Next.
For Target, do the following:
(Optional) To use a different Amazon Web Services account for the Amazon SageMaker lakehouse target, choose Specify a different account. Then, enter the ARN of an IAM role with permissions to display your Amazon Glue catalogs. For instructions to create the IAM role, see Choosing a target Amazon Glue catalog in a different account.
For Amazon Glue catalog, select the target for replicated data from the source database. You can choose an existing Amazon Glue catalog as the target.
The target IAM role needs describe permissions on the target catalog and must have the following permissions:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "glue:GetCatalog", "Resource": [ "arn:aws:glue:
region
:account-id
:catalog/*", "arn:aws:glue:region
:account-id
:catalog" ] } ] }The target IAM role must have the following trust relationship:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
You must grant the target IAM role describe permissions for the target Amazon Glue catalog with the Lake Formation administrator role created in Step 3b: Create a target Amazon SageMaker lakehouse.
Note
RDS notifies you if the resource policy or configuration settings for the specified Amazon Glue catalog aren't configured correctly. If you receive this message, you can either choose Fix it for me, or configure them manually.
If your selected source and target are in different Amazon Web Services accounts, then Amazon RDS cannot fix these settings for you. You must navigate to the other account and fix them manually in SageMaker Unified Studio.
-
Once your target Amazon Glue catalog is configured correctly, choose Next.
-
(Optional) For Tags, add one or more tags to the integration. For more information, see Tagging Amazon RDS resources.
-
For Encryption, specify how you want your integration to be encrypted. By default, RDS encrypts all integrations with an Amazon owned key. To choose a customer managed key instead, enable Customize encryption settings and choose a KMS key to use for encryption. For more information, see Encrypting Amazon RDS resources.
Optionally, add an encryption context. For more information, see Encryption context in the Amazon Key Management Service Developer Guide.
Note
Amazon RDS adds the following encryption context pairs in addition to any that you add:
-
aws:glue:integration:arn
-IntegrationArn
-
aws:servicename:id
-glue
This reduces the overall number of pairs that you can add from 8 to 6, and contributes to the overall character limit of the grant constraint. For more information, see Using grant constraints in the Amazon Key Management Service Developer Guide.
-
-
Choose Next.
Review your integration settings and choose Create zero-ETL integration.
If creation fails, see Troubleshooting Amazon RDS zero-ETL integrations for troubleshooting steps.
The integration has a status of Creating
while it's being created, and the
target Amazon SageMaker lakehouse has a status of Modifying
. During this time, you
can't query the catalog or make any configuration changes on it.
When the integration is successfully created, the status of the integration and the target
Amazon SageMaker lakehouse both change to Active
.
To prepare a target Amazon Glue catalog for zero-ETL integration using the Amazon CLI, you must first use the create-integration-resource-property command with the following options:
-
--resource-arn
– Specify the ARN of the Amazon Glue catalog that will be the target for the integration. -
--target-processing-properties
– Specify the ARN of the IAM role to access the target Amazon Glue catalog
aws glue create-integration-resource-property --region us-east-1 --resource-arn arn:aws:glue:
region
:account_id
:catalog/catalog_name
\ --target-processing-properties '{"RoleArn" : "arn:aws:iam::account_id
:role/TargetIamRole"}'
To create a zero-ETL integration with an Amazon SageMaker lakehouse using the Amazon CLI, use the create-integration command with the following options:
-
--integration-name
– Specify a name for the integration. -
--source-arn
– Specify the ARN of the RDS database that will be the source for the integration. -
--target-arn
– Specify the ARN of the Amazon Glue catalog that will be the target for the integration.
For Linux, macOS, or Unix:
aws rds create-integration \ --integration-name
my-sagemaker-integration
\ --source-arn arn:aws:rds:{region}
:{account-id}
:my-db
\ --target-arn arn:aws:glue:{region}
:{account-id}
:catalog/catalog-name
For Windows:
aws rds create-integration ^ --integration-name
my-sagemaker-integration
^ --source-arn arn:aws:rds:{region}
:{account-id}
:my-db
^ --target-arn arn:aws:glue:{region}
:{account-id}
:catalog/catalog-name
To create a zero-ETL integration with Amazon SageMaker by using the Amazon RDS API, use the CreateIntegration
operation with the following
parameters:
Note
Catalog names are limited to 19 characters. Ensure your IntegrationName parameter meets this requirement if it will be used as a catalog name.
-
IntegrationName
– Specify a name for the integration. -
SourceArn
– Specify the ARN of the RDS database that will be the source for the integration. -
TargetArn
– Specify the ARN of the Amazon Glue catalog that will be the target for the integration.
Encrypting integrations with a customer managed key
If you specify a custom KMS key rather than an Amazon owned key when you create an
integration with Amazon SageMaker, the key policy must provide the SageMaker Unified Studio service principal access to the
CreateGrant
action. In addition, it must allow the current user to
perform to the DescribeKey
and CreateGrant
actions.
The following sample policy demonstrates how to provide the required permissions in the key policy. It includes context keys to further reduce the scope of permissions.
{ "Version": "2012-10-17", "Id": "Key policy", "Statement": [ { "Sid": "Enables IAM user permissions", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::
{account-ID}
:root" }, "Action": "kms:*", "Resource": "*" }, { "Sid": "Allows the Glue service principal to add a grant to an Amazon KMS key", "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" }, "Action": "kms:CreateGrant", "Resource": "*", "Condition": { "StringEquals": { "kms:EncryptionContext:{context-key}
":"{context-value}
" }, "ForAllValues:StringEquals": { "kms:GrantOperations": [ "Decrypt", "GenerateDataKey", "CreateGrant" ] } } }, { "Sid": "Allows the current user or role to add a grant to a KMS key", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::{account-ID}
:role/{role-name}
" }, "Action": "kms:CreateGrant", "Resource": "*", "Condition": { "StringEquals": { "kms:EncryptionContext:{context-key}
":"{context-value}
", "kms:ViaService": "rds.us-east-1.amazonaws.com" }, "ForAllValues:StringEquals": { "kms:GrantOperations": [ "Decrypt", "GenerateDataKey", "CreateGrant" ] } } }, { "Sid": "Allows the current uer or role to retrieve information about a KMS key", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::{account-ID}
:role/{role-name}
" }, "Action": "kms:DescribeKey", "Resource": "*" } ] }
For more information, see Creating a key policy in the Amazon Key Management Service Developer Guide.
Next steps
After you successfully create a zero-ETL integration with Amazon SageMaker, you can start adding data to the source RDS database and querying it in your Amazon SageMaker lakehouse. The data will be automatically replicated and made available for analytics and machine learning workloads.