Federate to Databricks Unity Catalog
Amazon Glue Data Catalog federates to Databricks using the OAuth2 credentials of a Databricks service principal. This authentication mechanism allows Amazon Glue Data Catalog to access the metadata of various objects (such as catalogs, databases, and tables) in Databricks Unity Catalog, based on the privileges associated with the service principal. To ensure access to the right objects, it is essential to grant the service principal with the necessary permissions in Databricks to read metadata of these objects.
Next, catalog federation enables discovery and query of Iceberg tables in your Databricks Unity Catalog. To read delta tables, please ensure Iceberg metadata is available for these tables using Uniform. Follow the Databricks tutorial and documentation to create the service principal and associated privileges in your Databricks workspace.
Prerequisites
Before you create a federated catalog in Data Catalog that is governed by Lake Formation, ensure you have the following permissions:
Your IAM principal (user or role) must have the following permissions:
-
Lake Formation permissions –
lakeformation:RegisterResource,lakeformation:DescribeResource -
Amazon Glue permissions –
glue:CreateConnection,glue:CreateCatalog,glue:GetConnection -
Secrets Manager permissions –
secretsmanager:CreateSecret,secretsmanager:GetSecretValue -
IAM permissions –
iam:CreateRole,iam:AttachRolePolicy,iam:PassRole
You must be a Lake Formation data lake administrator or have CREATE_CATALOG permission on the Data Catalog
Create Federated Catalog
Sign in to the the console and open the Lake Formation console at https://console.amazonaws.cn/lakeformation/
. Choose the preferred Amazon region in the top-right page section.
In the left navigation pane, choose Catalogs.
Choose Create Catalog to open the Create Catalog Workflow.
In Choose data source step, select Databricks from the available options.
In Set catalog details step, you provide three information - catalog details, connection details, and registration details.
In catalog details container, provide a unique name to your Amazon Glue federated catalog and enter the name of the existing Databricks catalog.
In connections details container, you can either choose from an existing connection that you have access or provide configuration to create a new connector.
New connection configurations include:
Connection Name – A unique name of the Amazon Glue connection object.
Workspace URL – The endpoint URL of your existing Databricksworkspace.
Authentication – Specify the authentication configuration that Amazon Glue uses to connect to remote catalog server. Amazon Glue supports both OAuth2 and Custom authentication.
Token URL – Specify the URL of remote catalog's identity provider.
OAuth2 Client ID – Specify the Client ID of the OAuth2 credential associated with your remote catalog.
Secret – Store and use OAuth2 client secret using Amazon Secrets Manager or enter the secret value in textbox. When you enter the secret manually in console, Amazon Glue creates the secret on your behalf.
Token URL Scope – Specify the OAuth scope for authentication.
Create an IAM role that Amazon Glue and Lake Formation service principals can use to access secret in and Amazon S3 locations of remote Iceberg tables respectively. Select the IAM role in the registration dropdown. Refer to step 2 and 3 in following CLI section for IAM policy details.
Select Test Connection to test whether your connection properties and IAM role access are configured correctly. Test connection functionality is not available when connecting to Databricks using Amazon VPC.
Select Next to review your settings.
Select Create Catalog in review page.
-
Create an Amazon Secrets Manager secret
The Amazon Glue connector supports two authentication types - OAuth2 and Custom. When using OAuth2 option, use Amazon Secrets Manager to store client secret of the Databricks service principal. You will later use this secret when creating the Amazon Glue Connection. For Custom authentication, use Amazon Secrets Manager to store and retrieve the access token.
In the following example, replace
,<databricks-secret>,<client_secret>with your own information.<region>aws secretsmanager create-secret \ --name<databricks-secret>\ --description "Databricks secret" \ --secret-string '{ "USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET": "<client_secret>" }' \ --region<region>Note
USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRETis a reserved keyword that Amazon Glue uses to refer to a client secret value in the secret. Use the same keyword when you are creating the secret in Lake Formation console too. -
Create an IAM role which gives Amazon Glue connection object access to secret created in previous step
The Amazon Glue connection object requires access to the Amazon Secrets Manager secret when you use Amazon Secrets Manager to store, retrieve, and refresh your OAuth secret token. The Amazon Glue connection object also requires access to create, describe, and use the Amazon VPC network interfaces when you use a Amazon VPC endpoint to restrict connectivity to your Databricks workspace.
Create an IAM policy and attach it to an IAM role. Add Amazon Glue service principal to the trust policy.
In the following example, replace
,<your-secrets-manager-ARN>, and<your-vpc-id>with your own information.<your-subnet-id1>Example IAM Policy
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret", "secretsmanager:PutSecretValue" ], "Resource": [ "<your-secrets-manager-ARN>" ] }, { "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeNetworkInterfaces" ], "Resource": "*", "Condition": { "ArnEquals": { "ec2:Vpc": "arn:aws:ec2:region:account-id:vpc/<your-vpc-id>", "ec2:Subnet": ["arn:aws:ec2:region:account-id:subnet/<your-subnet-id1>"] } } } ] }Example Trust Policy
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" }, "Action": "sts:AssumeRole" }] } -
Create an IAM policy that gives Lake Formation read access to catalog's Amazon S3 location
As the catalog owner of a federated catalog in Data Catalog, you use Lake Formation to grant coarse-grained table access, fine-grained - column-level, row-level, and cell-level - access, and tag-based access to your data teams. Lake Formation uses an IAM role that gives it access to the underlying Amazon S3 locations of your remote Iceberg tables. This access allows Lake Formation to vend scoped access credentials to analytics engines querying remote tables.
Create IAM policy and attach to an IAM role. Add Lake Formation service principal to the IAM role trust policy.
In the following example, replace
and<your-s3-bucket-N>with your own information.<your-kms-key>Example IAM Policy
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": [ "arn:aws:s3:::<your-s3-bucket-1>/*", "arn:aws:s3:::<your-s3-bucket-2>/*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<your-s3-bucket-1>", "arn:aws:s3:::<your-s3-bucket-2>" ] }, { "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:Encrypt" ], "Resource": [ "<your-kms-key>" ] } ] }Example Trust Policy
{ "Version": "2012-10-17", "Statement": [{ "Sid": "", "Effect": "Allow", "Principal": { "Service": "lakeformation.amazonaws.com" }, "Action": "sts:AssumeRole" }] }Note
When you use Lake Formation console to create a federated catalog, the console uses a single IAM role with both policies attached to complete setup.
-
Create an Amazon Glue connection object
Data Catalog supports connectionType:
DATABRICKSICEBERGRESTCATALOGfor connecting Data Catalog to Databricks. This Amazon Glue connector supports OAuth2 and Custom authentication methods.The following example uses OAuth2 authentication configuration to create an Amazon Glue connection. Replace
with your information.highlighed sectionsaws glue create-connection \ --connection-input '{ "Name": "<your-glue-connection-to-databricks-unity-account>", "ConnectionType": "DATABRICKSICEBERGRESTCATALOG", "ConnectionProperties": { "INSTANCE_URL": "<your-databricks-workspace-catalog-URL>", "ROLE_ARN": "<your-IAM-role-for-secrets-and-VPC-access>" }, "AuthenticationConfiguration": { "AuthenticationType": "OAUTH2", "OAuth2Properties": { "OAuth2GrantType": "CLIENT_CREDENTIALS", "TokenUrl": "<your-internal-or-external-token-server-url>", "OAuth2ClientApplication": { "UserManagedClientApplicationClientId": "<your-client-id>" }, "TokenUrlParametersMap": { "Scope": "all-apis" } }, "SecretArn": "arn:aws:secretsmanager:<aws-region>:<your-aws-account-id>:secret:<databricks-secret>" } }' -
Register Amazon Glue connection as a Lake Formation resource
Using the Amazon Glue connection object (created in Step 4) and IAM role (created in Step 3), you can now register the Amazon Glue connection object as a Lake Formation managed resource.
Replace
and<your-glue-connector-arn>with your information.<your-IAM-role-ARN-having-LF-access>aws lakeformation register-resource \ --resource-arn<your-glue-connector-arn>\ --role-arn<your-IAM-role-ARN-having-LF-access>\ --with-federation \ --with-privileged-access -
Create a federated catalog in Data Catalog
After creating an Amazon Glue connection object and registering it with Lake Formation, you can create a federated catalog in the Data Catalog:
Provide the federated catalog a unique name at
, reference the catalog in Databricks at<your-federated-catalog-name>, and input connection name created earlier at"<catalog-name-in-Databricks>."<your-glue-connection-name>aws glue create-catalog \ --name<your-federated-catalog-name>\ --catalog-input '{ "FederatedCatalog": { "Identifier":"<catalog-name-in-Databricks>", "ConnectionName":"<your-glue-connection-name>" }, "CreateTableDefaultPermissions": [], "CreateDatabaseDefaultPermissions": [] }'
Considerations when integrating with Databricks
-
When you drop resources (like databases and tables) in the Databricks, Lake Formation does not automatically revoke the permissions granted on that federated resource. To remove the access permissions, you need to explicitly revoke the permissions that were previously granted on the federated resource using Lake Formation.
-
You can query Iceberg tables stored in Amazon S3 using this integration. When using any other table format or object storage, you can federate metadata in remote catalogs to Amazon Glue and list its databases and tables but query operations like
SELECT ColumnFoo from TableBarwill fail during query with error 'Failed to read Apache Iceberg table. Object storage location is not supported.' -
You can reuse the same Amazon Glue connection to create multiple federated catalogs. Deleting a catalog will not delete the associated connection object. To delete a connection object, please use Amazon CLI
aws glue delete-connectioncommand and ensure all associated catalogs are deleted first.