Using Apache Hudi with EMR Serverless - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using Apache Hudi with EMR Serverless

To use Apache Hudi with EMR Serverless applications
  1. Set the required Spark properties in the corresponding Spark job run.

    spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar spark.serializer=org.apache.spark.serializer.KryoSerializer
  2. To sync a Hudi table to the configured catalog, designate either the Amazon Glue Data Catalog as your metastore, or configure an external metastore. EMR Serverless supports hms as the sync mode for Hive tables for Hudi workloads. EMR Serverless activates this property as a default. To learn more about how to set up your metastore, see Metastore configuration.

    Important

    EMR Serverless doesn't support HIVEQL or JDBC as sync mode options for Hive tables to handle Hudi workloads. To learn more, see Sync modes.

    When you use the Amazon Glue Data Catalog as your metastore, you can specify the following configuration properties for your Hudi job.

    --conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar, --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

To learn more about Apache Hudi releases of Amazon EMR, see Hudi release history.