Considerations for using Hive on Amazon EMR 4.x - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Considerations for using Hive on Amazon EMR 4.x

This section covers differences to consider when using Hive version 1.0.0 on Amazon EMR 4.x release versions, as compared to Hive 2.x on Amazon EMR 5.x release versions.

ACID transactions not supported

Hive on Amazon EMR 4.x release versions does not support ACID transactions with Hive data stored in Amazon S3 when using 4.x release versions. If you try to create a transactional table in Amazon S3, an exception occurs.

Reading and writing to tables in Amazon S3

Hive on Amazon EMR 4.x release versions can write directly to Amazon S3 without using temporary files. This improves performance, but a consequence is that you cannot read and write to the same table in Amazon S3 within the same Hive statement. A workaround is to create and use a temporary table in HDFS.

The following example shows how to use multiple Hive statements to update a table in Amazon S3. The statements create a temporary table in HDFS named tmp based on a table in Amazon S3 named my_s3_table. The table in Amazon S3 is then updated with the contents of the temporary table.

CREATE TEMPORARY TABLE tmp LIKE my_s3_table; INSERT OVERWRITE TABLE tmp SELECT ....; INSERT OVERWRITE TABLE my_s3_table SELECT * FROM tmp;

Log4j vs. Log4j 2

Hive on Amazon EMR 4.x release versions uses Log4j. Beginning with version 5.0.0, Log4j 2 is the default. These versions may require different logging configurations. See Apache Log4j 2 for details.

MapReduce is the default execution engine

Hive on Amazon EMR 4.x release versions uses MapReduce as the default execution engine. Beginning with Amazon EMR version 5.0.0, Tez is the default, which provides improved performance for most workflows.

Hive authorization

Hive on Amazon EMR 4.x release versions supports Hive authorization for HDFS but not for EMRFS and Amazon S3. Amazon EMR clusters run with authorization disabled by default.

Hive file merge behavior with Amazon S3

Hive on Amazon EMR 4.x release versions merges small files at the end of a map-only job if hive.merge.mapfiles is true. A merge is triggered only if the average output size of the job is less than the hive.merge.smallfiles.avgsize setting. Amazon EMR Hive has exactly the same behavior if the final output path is in HDFS. If the output path is in Amazon S3, however, the hive.merge.smallfiles.avgsize parameter is ignored. In that situation, the merge task is always triggered if hive.merge.mapfiles is set to true.