Configuring an external metastore for Hive

By default, Hive records metastore information in a MySQL database on the primary node's file system. The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on. When a cluster terminates, all cluster nodes shut down, including the primary node. When this happens, local data is lost because node file systems use ephemeral storage. If you need the metastore to persist, you must create an external metastore that exists outside the cluster.

You have two options for an external metastore:

Amazon Glue Data Catalog (Amazon EMR release 5.8.0 or later only).

For more information, see Using the Amazon Glue Data Catalog as the metastore for Hive.
Amazon RDS or Amazon Aurora.

For more information, see Using an external MySQL database or Amazon Aurora.

Note

If you're using Hive 3 and encounter too many connections to Hive metastore, configure the parameter datanucleus.connectionPool.maxPoolSize to have a smaller value or increase the number of connection the database server can handle. The increased number of connections is due to the way Hive computes the maximum number of JDBC connections. To calculate the optimal value for performance, see Hive Configuration Properties.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Differences and considerations for Hive on Amazon EMR

Using the Amazon Glue Data Catalog as the metastore for Hive