Use an Iceberg cluster with Hive - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Use an Iceberg cluster with Hive

With Amazon EMR releases 6.9.0 and higher, you can use Iceberg with a Hive cluster without having to perform the setup steps that are required for Open Source Iceberg Hive Integration. For Amazon EMR versions 6.8.0 and earlier, you can use a bootstrap action to install iceberg-hive-runtime jar for configuring Hive for Iceberg support.

Amazon EMR 6.9.0 includes all features for Hive 3.1.3 integration with Iceberg 0.14.1 and also includes Amazon EMR added features such as auto selection of supported execution engines at runtime (Amazon EMR on EKS 6.9.0).

Create an Iceberg cluster

You can create a cluster with Iceberg installed using the Amazon Web Services Management Console, the Amazon CLI or the Amazon EMR API. In this tutorial, you use the Amazon CLI to work with Iceberg on an Amazon EMR cluster. To use the console to create a cluster with Iceberg installed, follow the steps in Build an Iceberg data lake using Amazon Athena, Amazon EMR, and Amazon Glue.

To use Iceberg on Amazon EMR with the Amazon CLI, first create a cluster using the steps below. For information on specifying the Iceberg classification using the Amazon CLI or the Java SDK, see Supply a configuration using the Amazon CLI when you create a cluster or Supply a configuration using the Java SDK when you create a cluster. Create a file named configurations.json with the following content:

[{ "Classification":"iceberg-defaults", "Properties":{"iceberg.enabled":"true"} }]

Next, create a cluster with the following configuration, replacing the example Amazon S3 bucket path and the subnet ID with your own:

aws emr create-cluster --release-label emr-6.9.0 \ --applications Name=Hive \ --configurations file://iceberg_configurations.json \ --region us-east-1 \ --name My_hive_Iceberg_Cluster \ --log-uri s3://DOC-EXAMPLE-BUCKET/ \ --instance-type m5.xlarge \ --instance-count 2 \ --service-role EMR_DefaultRole \ --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef

A Hive Iceberg cluster does the following things:

  • Loads the Iceberg Hive runtime jar in Hive and enables Iceberg-related configuration for the Hive engine.

  • Enables Amazon EMR Hive’s dynamic execution engine selection to prevent users from setting supported execution engine for Iceberg compatibility.

Note

Hive Iceberg clusters do not currently support Amazon Glue Data Catalog. The default Iceberg catalog is HiveCatalog, which corresponds to the metastore configured for the Hive environment. For more information on catalog management, see Using HCatalog in the Apache Hive documentation.

Feature support

Amazon EMR 6.9.0 supports Hive 3.1.3 and Iceberg 0.14.1. The feature support is limited to Iceberg-compatible features for Hive 3.1.2 and 3.1.3. The following commands are supported:

  • With Amazon EMR releases 6.9.0 to 6.12.x, you must include the libfb303 jar in the Hive auxlib directory. Use the following command to include it:

    sudo /usr/bin/ln -sf /usr/lib/hive/lib/libfb303-*.jar /usr/lib/hive/auxlib/libfb303.jar

    With Amazon EMR releases 6.13 and higher, the libfb303 jar is automatically symlinked to the Hive auxlib directory.

  • Creating a table

    • Non-partitioned table – External tables in Hive can be created by providing the storage handler as follows:

      CREATE EXTERNAL TABLE x (i int) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
    • Partitioned table – External partitioned tables in Hive can be created as follows:

      CREATE EXTERNAL TABLE x (i int) PARTITIONED BY (j int) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
    Note

    The STORED AS file format of ORC/AVRO/PARQUET is not supported in Hive 3. The default and only option is Parquet.

  • Dropping a table – The DROP TABLE command is used to drop tables, such as in the following example:

    DROP TABLE [IF EXISTS] table_name [PURGE];
  • Reading a tableSELECT statements can be used to read Iceberg tables in Hive, such as in the following example. Supported execution engines are MR and Tez.

    SELECT * FROM table_name

    For information on Hive’s select syntax, see LanguageManual Select. For information on select statements with Iceberg tables in Hive, see Apache Iceberg Select.

  • Inserting into a table – HiveQL‘s INSERT INTO statement works on Iceberg tables with support for the Map Reduce execution engine only. Amazon EMR users don’t need to explicitly set the execution engine because Amazon EMR Hive selects the engine for Iceberg Tables at runtime.

    • Single table insert into – Example:

      INSERT INTO table_name VALUES ('a', 1); INSERT INTO table_name SELECT...;
    • Multi-table insert into – Non-atomic multi-table insert into statements are supported. Example:

      FROM source INSERT INTO table_1 SELECT a, b INSERT INTO table_2 SELECT c,d;