Connecting the Data Catalog to an external Hive metastore
To connect the Amazon Glue Data Catalog to a Hive metastore, you need to deploy an Amazon SAM application called GlueDataCatalogFederation-HiveMetastore
The Amazon SAM application creates the connection for the Hive metastore behind Amazon API Gateway using a Lambda function. The Amazon SAM application uses a uniform resource identifier (URI) as an input from the user and connects the external Hive metastore to the Data Catalog. When a user runs a query on Hive tables, the Data Catalog calls the API Gateway endpoint. The endpoint invokes the Lambda function to retrieve the metadata of the Hive tables.
To connect the Data Catalog to the Hive metastore and set up permissions
-
Deploy the Amazon SAM application.
Sign in to the Amazon Web Services Management Console and open the Amazon Serverless Application Repository.
In the navigation pane, choose Available applications.
-
Choose Public applications.
Select the option Show apps that create custom IAM roles or resource policies.
In the search box, enter the name GlueDataCatalogFederation-HiveMetastore.
-
Choose the GlueDataCatalogFederation-HiveMetastore application.
-
Under Application Settings, enter the following minimum required settings for your Lambda function:
Application name - A name for your Amazon SAM application.
GlueConnectionName - A name for the connection.
HiveMetastoreURIs - The URI of your Hive metastore host.
-
LambdaMemory - The amount of Lambda memory in MB from 128-10240. The default is 1024.
LambdaTimeout - The maximum Lambda invocation runtime in seconds. The default is 30.
VPCSecurityGroupIds and VPCSubnetIds - Information for the VPC where the Hive metastore exists.
Select I acknowledge that this app creates custom IAM roles and resource policies. For more information, choose the Info link.
At the bottom right of the Application settings section, choose Deploy. When the deployment is complete, the Lambda function appears in the Resources section in the Lambda console.
The application is deployed to Lambda. Its name is prepended with serverlessrepo- to indicate that the application was deployed from the Amazon Serverless Application Repository. Selecting the application takes you to the Resources page where each of the resources of the application that were deployed are listed. The resources include the Lambda function that allows communication between the Data Catalog and the Hive metastore, the Amazon Glue connection, and other resources that are needed for the database federation.
-
Create a federated database in the Data Catalog.
After you've created a connection to the Hive metastore, you can create federated databases in the Data Catalog that point to the external Hive metastore databases. You need to create a corresponding database in the Data Catalog for every Hive metastore database that you're connecting to the Data Catalog.
-
View tables in the federated database.
After you've created the federated database, you can view the list of tables in your Hive metastore using the Lake Formation console or the Amazon CLI.
-
Grant permissions.
After you’ve created the database, you can grant permissions to other IAM users and roles in your account or to external Amazon Web Services accounts and organizations. You will not be able to grant write data permissions (insert, delete) and metadata permissions (alter, drop, create) on the federated databases. For more information on granting permissions, see Managing Lake Formation permissions.
-
Query the federated databases.
After you grant permissions, users can sign in and start querying the federated database using Athena and Amazon Redshift. Users can now use the local database name to reference the Hive database in SQL queries.
Example Amazon Athena query syntax
Replace
fed_glue_db
with the local database name that you created earlier.Select * from fed_glue_db.customers limit 10;