

# Using Hive configurations when you run EMR Serverless jobs
<a name="jobs-hive"></a>

You can run Hive jobs on an application with the `type` parameter set to `HIVE`. Jobs must be compatible with the Hive version compatible with the Amazon EMR release version. For example, when you run jobs on an application with Amazon EMR release 6.6.0, your job must be compatible with Apache Hive 3.1.2. For information on the application versions for each release, refer to [Amazon EMR Serverless release versions](release-versions.md).

## Hive job parameters
<a name="hive-params"></a>

When you use the [`StartJobRun` API](https://docs.amazonaws.cn/emr-serverless/latest/APIReference/API_StartJobRun.html) to run a Hive job, specify the following parameters.

**Topics**
+ [Hive job runtime role](#hive-defaults-executionRoleArn)
+ [Hive job driver parameter](#hive-defaults-jobDriver)
+ [Hive configuration override parameter](#hive-defaults-configurationOverrides)

### Hive job runtime role
<a name="hive-defaults-executionRoleArn"></a>

Use **`executionRoleArn`** to specify the ARN for the IAM role that your application uses to execute Hive jobs. This role must contain the following permissions:
+ Read from S3 buckets or other data sources where your data resides
+ Read from S3 buckets or prefixes where your Hive query file and init query file reside
+ Read and write to S3 buckets where your Hive Scratch directory and Hive Metastore warehouse directory reside
+ Write to S3 buckets where you intend to write your final output
+ Write logs to an S3 bucket or prefix that `S3MonitoringConfiguration` specifies
+ Access to KMS keys if you use KMS keys to encrypt data in your S3 bucket
+ Access to the Amazon Glue Data Catalog

If your Hive job reads or writes data to or from other data sources, specify the appropriate permissions in this IAM role. If you don't provide these permissions to the IAM role, your job might fail. For more information, refer to [Job runtime roles for Amazon EMR Serverless](security-iam-runtime-role.md).

### Hive job driver parameter
<a name="hive-defaults-jobDriver"></a>

Use **`jobDriver`** to provide input to the job. The job driver parameter accepts only one value for the job type that you want to run. When you specify `hive` as the job type, EMR Serverless passes a Hive query to the `jobDriver` parameter. Hive jobs have the following parameters:
+ **`query`** – This is the reference in Amazon S3 to the Hive query file that you want to run.
+ **`parameters`** – These are the additional Hive configuration properties that you want to override. To override properties, pass them to this parameter as `--hiveconf {{property=value}}`. To override variables, pass them to this parameter as `--hivevar {{key=value}}`.
+ **`initQueryFile`** – This is the init Hive query file. Hive runs this file prior to your query and can use it to initialize tables.

### Hive configuration override parameter
<a name="hive-defaults-configurationOverrides"></a>

Use **`configurationOverrides`** to override monitoring-level and application-level configuration properties. This parameters accepts a JSON object with the following two fields:
+ **`monitoringConfiguration`** – Use this field to specify the Amazon S3 URL (`s3MonitoringConfiguration`) where you want the EMR Serverless job to store logs of your Hive job. Make sure that you create this bucket with the same Amazon Web Services account that hosts your application, and in the same Amazon Web Services Region where your job is running.
+ **`applicationConfiguration`** – You can provide a configuration object in this field to override the default configurations for applications. You can use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings that you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object.
**Note**  
Available configuration classifications vary by specific EMR Serverless release. For example, classifications for custom Log4j `spark-driver-log4j2` and `spark-executor-log4j2` are only available with releases 6.8.0 and higher.

If you pass the same configuration in an application override and in Hive parameters, the Hive parameters take priority. The following list ranks configurations from highest priority to lowest priority.
+ Configuration that you provide as part of Hive parameters with `--hiveconf {{property=value}}`.
+ Configuration that you provide as part of your application overrides when you start a job.
+ Configuration that you provide as part of your `runtimeConfiguration` when you create an application.
+ Optimized configurations that Amazon EMR assigns for the release.
+ Default open-source configurations for the application.

For more information on declaring configurations at the application level, and overriding configurations during job run, refer to [Default application configuration for EMR Serverless](default-configs.md).

## Hive job properties
<a name="hive-defaults"></a>

The following table lists the mandatory properties that configure when you submit a Hive job.


**Mandatory Hive job properties**  

| Setting | Description | 
| --- | --- | 
| hive.exec.scratchdir | The Amazon S3 location where EMR Serverless creates temporary files during the Hive job execution.  | 
| hive.metastore.warehouse.dir | The Amazon S3 location of databases for managed tables in Hive. | 

The following table lists the optional Hive properties and their default values that you can override when you submit a Hive job.


**Optional Hive properties and default values**  

| Setting | Description | Default value | 
| --- | --- | --- | 
| fs.s3.customAWSCredentialsProvider | The Amazon Credentials provider you want to use.  | com.amazonaws.auth.DefaultAWSCredentialsProviderChain | 
| fs.s3a.aws.credentials.provider | The Amazon Credentials provider you want to use with a S3A file system. | com.amazonaws.auth.DefaultAWSCredentialsProviderChain | 
| hive.auto.convert.join | Option that turns on auto-conversion of common joins into mapjoins, based on the input file size. | TRUE | 
| hive.auto.convert.join.noconditionaltask | Option that turns on optimization when Hive converts a common join into a mapjoin based on the input file size. | TRUE | 
| hive.auto.convert.join.noconditionaltask.size | A join converts directly to a mapjoin below this size. | Optimal value is calculated based on Tez task memory | 
| hive.cbo.enable | Option that turns on cost-based optimizations with the Calcite framework. | TRUE | 
| hive.cli.tez.session.async | Option to start a background Tez session while your Hive query compiles. When set to false, Tez AM launches after your Hive query compiles. | TRUE | 
| hive.compute.query.using.stats | Option that activates Hive to answer certain queries with statistics stored in the metastore. For basic statistics, set hive.stats.autogather to TRUE. For a more advanced collection of queries, run analyze table queries. | TRUE | 
| hive.default.fileformat | The default file format for CREATE TABLE statements. You can explicitly override this if you specify STORED AS [FORMAT] in your CREATE TABLE command. | TEXTFILE | 
| hive.driver.cores | The number of cores to use for the Hive driver process. | 2 | 
| hive.driver.disk | The disk size for the Hive driver. | 20G | 
| hive.driver.disk.type | The disk type for the Hive driver. | Standard | 
| hive.tez.disk.type | The disk size for the tez workers. | Standard | 
| hive.driver.memory | The amount of memory to use per Hive driver process. The Hive CLI and Tez Application Master share this memory equally with 20% of headroom.  | 6G | 
| hive.emr-serverless.launch.env.[{{KEY}}] | Option to set the {{KEY}} environment variable in all Hive-specific processes, such as your Hive driver, Tez AM, and Tez task. |  | 
| hive.exec.dynamic.partition | Options that turns on dynamic partitions in DML/DDL. | TRUE | 
| hive.exec.dynamic.partition.mode | Option that specifies whether you want to use strict mode or non-strict mode. In strict mode, specify at least one static partition in case you accidentally overwrite all partitions. In non-strict mode, all partitions are allowed to be dynamic. | strict | 
| hive.exec.max.dynamic.partitions | The maximum number of dynamic partitions that Hive creates in total. | 1000 | 
| hive.exec.max.dynamic.partitions.pernode | Maximum number of dynamic partitions that Hive creates in each mapper and reducer node. | 100 | 
| hive.exec.orc.split.strategy | Expects one of the following values: BI, ETL, or HYBRID. This isn’t a user-level configuration. BI specifies that you want to spend less time in split generation as opposed to query execution. ETL specifies that you want to spend more time in split generation. HYBRID specifies a choice of the preceding strategies based on heuristics. | HYBRID | 
| hive.exec.reducers.bytes.per.reducer | The size per reducer. The default is 256 MB. If the input size is 1G, the job uses 4 reducers. | 256000000 | 
| hive.exec.reducers.max | The maximum number of reducers. | 256 | 
| hive.exec.stagingdir | The name of the directory that stores temporary files that Hive creates inside table locations and in the scratch directory location specified in the hive.exec.scratchdir property. | .hive-staging | 
| hive.fetch.task.conversion | Expects one of the following values: NONE, MINIMAL, or MORE. Hive can convert select queries to a single FETCH task. This minimizes latency. | MORE | 
| hive.groupby.position.alias | Option that causes Hive to use a column position alias in GROUP BY statements. | FALSE | 
| hive.input.format | The default input format. Set to HiveInputFormat if you encounter problems with CombineHiveInputFormat. | org.apache.hadoop.hive.ql.io.CombineHiveInputFormat | 
| hive.log.explain.output | Option that turns on explanations of extended output for any query in your Hive log. | FALSE | 
| hive.log.level | The Hive logging level. | INFO | 
| hive.mapred.reduce.tasks.speculative.execution | Option that turns on speculative launch for reducers. Only supported with Amazon EMR 6.10.x and lower. | TRUE | 
| hive.max-task-containers | The maximum number of concurrent containers. The configured mapper memory is multiplied by this value to determine available memory that split computation and task preemption use. | 1000 | 
| hive.merge.mapfiles | Option that causes small files to merge at the end of a map-only job. | TRUE | 
| hive.merge.size.per.task | The size of merged files at the end of the job. | 256000000 | 
| hive.merge.tezfiles | Option that turns on a merge of small files at the end of a Tez DAG. | FALSE | 
| hive.metastore.client.factory.class | The name of the factory class that produces objects that implement the IMetaStoreClient interface. | com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory | 
| hive.metastore.glue.catalogid | If the Amazon Glue Data Catalog acts as a metastore but runs in a different Amazon Web Services account than the jobs, the ID of the Amazon Web Services account where the jobs are running. | NULL | 
| hive.metastore.uris | The thrift URI that the metastore client uses to connect to remote metastore. | NULL | 
| hive.optimize.ppd | Option that turns on predicate pushdown. | TRUE | 
| hive.optimize.ppd.storage | Option that turns on predicate pushdown to storage handlers. | TRUE | 
| hive.orderby.position.alias | Option that causes Hive to use a column position alias in ORDER BY statements. | TRUE | 
| hive.prewarm.enabled | Option that turns on container prewarm for Tez. | FALSE | 
| hive.prewarm.numcontainers | The number of containers to pre-warm for Tez. | 10 | 
| hive.stats.autogather | Option that causes Hive to gather basic statistics automatically during the INSERT OVERWRITE command. | TRUE | 
| hive.stats.fetch.column.stats | Option that turns off the fetch of column statistics from the metastore. A fetch of column statistics can be expensive when the number of columns is high. | FALSE | 
| hive.stats.gather.num.threads | The number of threads that the partialscan and noscan analyze commands use for partitioned tables. This only applies to file formats that implement StatsProvidingRecordReader (like ORC). | 10 | 
| hive.strict.checks.cartesian.product | Options that turns on strict Cartesian join checks. These checks disallow a Cartesian product (a cross join). | FALSE | 
| hive.strict.checks.type.safety | Option that turns on strict type safety checks and turns off comparison of bigint with both string and double. | TRUE | 
| hive.support.quoted.identifiers | Expects value of NONE or COLUMN. NONE implies only alphanumeric and underscore characters are valid in identifiers. COLUMN implies column names can contain any character. | COLUMN | 
| hive.tez.auto.reducer.parallelism | Option that turns on the Tez auto-reducer parallelism feature. Hive still estimates data sizes and sets parallelism estimates. Tez samples the output sizes of source vertices and adjusts the estimates at runtime as necessary. | TRUE | 
| hive.tez.container.size | The amount of memory to use per Tez task process. | 6144 | 
| hive.tez.cpu.vcores | The number of cores to use for each Tez task. | 2 | 
| hive.tez.disk.size | The disk size for each task container. | 20G | 
| hive.tez.input.format | The input format for splits generation in the Tez AM. | org.apache.hadoop.hive.ql.io.HiveInputFormat | 
| hive.tez.min.partition.factor | Lower limit of reducers that Tez specifies when you turn on auto-reducer parallelism. | 0.25 | 
| hive.vectorized.execution.enabled | Option that turns on vectorized mode of query execution. | TRUE | 
| hive.vectorized.execution.reduce.enabled | Option that turns on vectorized mode of a query execution's reduce-side.  | TRUE | 
| javax.jdo.option.ConnectionDriverName | The driver class name for a JDBC metastore. | org.apache.derby.jdbc.EmbeddedDriver | 
| javax.jdo.option.ConnectionPassword | The password associated with a metastore database. | NULL | 
| javax.jdo.option.ConnectionURL | The JDBC connect string for a JDBC metastore. | jdbc:derby:;databaseName=metastore\_db;create=true | 
| javax.jdo.option.ConnectionUserName | The user name associated with a metastore database. | NULL | 
| mapreduce.input.fileinputformat.split.maxsize | The maximum size of a split during split computation when your input format is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. A value of 0 indicates no limit. | 0 | 
| tez.am.dag.cleanup.on.completion | Option that turns on cleanup of shuffle data when DAG completes. | TRUE | 
| tez.am.emr-serverless.launch.env.[{{KEY}}] | Option to set the {{KEY}} environment variable in the Tez AM process. For Tez AM, this value overrides the hive.emr-serverless.launch.env.[{{KEY}}] value. |  | 
| tez.am.log.level | The root logging level that EMR Serverless passes to the Tez app primary. | INFO | 
| tez.am.sleep.time.before.exit.millis | EMR Serverless should push ATS events after this period of time following AM shutdown request. | 0 | 
| tez.am.speculation.enabled | Option that causes speculative launch of slower tasks. This can help reduce job latency when some tasks are running slower due bad or slow machines. Only supported with Amazon EMR 6.10.x and lower. | FALSE | 
| tez.am.task.max.failed.attempts | The maximum number of attempts that can fail for a particular task before the task fails. This number doesn't count manually terminated attempts. | 3 | 
| tez.am.vertex.cleanup.height | A distance at which, if all dependent vertices are complete, Tez AM will delete vertex shuffle data. This feature is turned off when the value is 0. Amazon EMR versions 6.8.0 and later support this feature. | 0 | 
| tez.client.asynchronous-stop | Option that causes EMR Serverless to push ATS events before it ends the Hive driver. | FALSE | 
| tez.grouping.max-size | The upper size limit (in bytes) of a grouped split. This limit prevents excessively large splits. | 1073741824 | 
| tez.grouping.min-size | The lower size limit (in bytes) of a grouped split. This limit prevents too many small splits. | 16777216 | 
| tez.runtime.io.sort.mb | The size of the soft buffer when Tez sorts the output is sorted. | Optimal value is calculated based on Tez task memory | 
| tez.runtime.unordered.output.buffer.size-mb | The size of the buffer to use if Tez doesn't write directly to disk. | Optimal value is calculated based on Tez task memory | 
| tez.shuffle-vertex-manager.max-src-fraction | The fraction of source tasks that must complete before EMR Serverless schedules all tasks for the current vertex (in case of a ScatterGather connection). The number of tasks ready for scheduling on the current vertex scales linearly between min-fraction and max-fraction. This defaults the default value or tez.shuffle-vertex-manager.min-src-fraction, whichever is greater. | 0.75 | 
| tez.shuffle-vertex-manager.min-src-fraction | The fraction of source tasks that must complete before EMR Serverless schedules tasks for the current vertex (in case of a ScatterGather connection). | 0.25 | 
| tez.task.emr-serverless.launch.env.[{{KEY}}] | Option to set the {{KEY}} environment variable in the Tez task process. For Tez tasks, this value overrides the hive.emr-serverless.launch.env.[{{KEY}}] value. |  | 
| tez.task.log.level | The root logging level that EMR Serverless passes to the Tez tasks. | INFO | 
| tez.yarn.ats.event.flush.timeout.millis | The maximum amount of time that AM should wait for events to be flushed before shutting down. | 300000 | 

## Hive job examples
<a name="hive-examples"></a>

The following code example shows how to run a Hive query with the `StartJobRun` API.

```
aws emr-serverless start-job-run \
    --application-id {{application-id}} \
    --execution-role-arn {{job-role-arn}} \
    --job-driver '{
        "hive": {
            "query": "s3://{{amzn-s3-demo-bucket}}/emr-serverless-hive/query/hive-query.ql",
            "parameters": "--hiveconf hive.log.explain.output=false"
        }
    }' \
    --configuration-overrides '{
        "applicationConfiguration": [{
            "classification": "hive-site",
            "properties": {
                "hive.exec.scratchdir": "s3://{{amzn-s3-demo-bucket}}/emr-serverless-hive/hive/scratch",
                "hive.metastore.warehouse.dir": "s3://{{amzn-s3-demo-bucket}}/emr-serverless-hive/hive/warehouse",
                "hive.driver.cores": "2",
                "hive.driver.memory": "4g",
                "hive.tez.container.size": "4096",
                "hive.tez.cpu.vcores": "1"
            }
        }]
    }'
```

You can find additional examples of how to run Hive jobs in the [EMR Serverless Samples](https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/hive) GitHub repository.