Metastore configuration for EMR Serverless
A Hive metastore is a centralized location that stores structural information about your tables, including schemas, partition names, and data types. With EMR Serverless, you can persist this table metadata in a metastore that has access to your jobs.
You have two options for a Hive metastore:
-
The Amazon Glue Data Catalog
-
An external Apache Hive metastore
Using the Amazon Glue Data Catalog as a metastore
You can configure your Spark and Hive jobs to use the Amazon Glue Data Catalog as its metastore. We
recommend this configuration when you require a persistent metastore or a metastore shared
by different applications, services, or Amazon Web Services accounts. For more information about the
Data Catalog, see Populating the Amazon Glue Data Catalog. For information about Amazon Glue pricing, see Amazon Glue pricing
You can configure your EMR Serverless job to use the Amazon Glue Data Catalog either in the same Amazon Web Services account as your application, or in a different Amazon Web Services account.
Configure the Amazon Glue Data Catalog
To configure the Data Catalog, choose which type of EMR Serverless application that you want to use.
Configure cross-account access for EMR Serverless and Amazon Glue Data Catalog
To set up cross-account access for EMR Serverless, you must first sign in to the following Amazon Web Services accounts:
-
AccountA
– An Amazon Web Services account where you have created an EMR Serverless application. -
AccountB
– An Amazon Web Services account that contains a Amazon Glue Data Catalog that you want your EMR Serverless job runs to access.
-
Make sure an administrator or other authorized identity in
AccountB
attaches a resource policy to the Data Catalog inAccountB
. This policy grantsAccountA
specific cross-account permissions to perform operations on resources in theAccountB
catalog.{ "Version" : "2012-10-17", "Statement" : [ { "Effect" : "Allow", "Principal": { "AWS": [ "arn:aws:iam::
accountA
:role/job-runtime-role-A" ]}, "Action" : [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["arn:aws:glue:region:AccountB
:catalog"] } ] } -
Add an IAM policy to the EMR Serverless job runtime role in
AccountA
so that role can access Data Catalog resources inAccountB
.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["arn:aws:glue:
region:AccountB
:catalog"] } ] } -
Start your job run. This step is slightly different depending on
AccountA
's EMR Serverless application type.
Considerations when using the Amazon Glue Data Catalog
You can add auxiliary JARs with ADD JAR
in your Hive scripts. For
additional considerations, see Considerations when using Amazon Glue Data Catalog.