Using Apache Hudi with EMR Serverless
To use Apache Hudi with EMR Serverless applications
-
Set the required Spark properties in the corresponding Spark job run.
spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar spark.serializer=org.apache.spark.serializer.KryoSerializer
-
To sync a Hudi table to the configured catalog, designate either the Amazon Glue Data Catalog as your metastore, or configure an external metastore. EMR Serverless supports
hms
as the sync mode for Hive tables for Hudi workloads. EMR Serverless activates this property as a default. To learn more about how to set up your metastore, see Metastore configuration.Important
EMR Serverless doesn't support
HIVEQL
orJDBC
as sync mode options for Hive tables to handle Hudi workloads. To learn more, see Sync modes. When you use the Amazon Glue Data Catalog as your metastore, you can specify the following configuration properties for your Hudi job.
--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar, --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
To learn more about Apache Hudi releases of Amazon EMR, see Hudi release history.