MSCK Optimization
Hive stores a list of partitions for each table in its metastore. However, when
partitions are directly added to or removed from the file system, the Hive metastore
is unaware of these changes. The MSCK command
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
Hive implements this command as follows:
-
Hive retrieves all the partitions for the table from the metastore. From the list of partition paths that do not exist in the file system then creates a list of partitions to drop from the metastore.
-
Hive gathers the partition paths present in the file system, compares them with the list of partitions from the metastore, and generates a list of partitions that need to be added to the metastore.
-
Hive updates the metastore using
ADD
,DROP
, orSYNC
mode.
Note
When there are many partitions in the metastore, the step to check if a
partition does not exist in the file system takes a long time to run because the
file system's exists
API call must be made for each
partition.
In Amazon EMR 6.5.0, Hive introduced a flag called
hive.emr.optimize.msck.fs.check
. When enabled, this flag causes
Hive to check for the presence of a partition from the list of partition paths from
the file system that is generated in step 2 above instead of making file system API
calls. In Amazon EMR 6.8.0, Hive enabled this optimization by default, eliminating the
need to set the flag hive.emr.optimize.msck.fs.check
.