Hive application specifics for earlier AMI versions of Amazon EMR
Log files
Using Amazon EMR AMI versions 2.x and 3.x, Hive logs are saved to
/mnt/var/log/apps/
. In order to support concurrent versions of
Hive, the version of Hive that you run determines the log file name, as shown in the
following table.
Hive version | Log file name |
---|---|
0.13.1 | hive.log
NoteBeginning with this version, Amazon EMR uses an unversioned
file name, |
0.11.0 | hive_0110.log
NoteMinor versions of Hive 0.11.0, such as 0.11.0.1, share the same log file location as Hive 0.11.0. |
0.8.1 | hive_081.log
NoteMinor versions of Hive 0.8.1, such as Hive 0.8.1.1, share the same log file location as Hive 0.8.1. |
0.7.1 | hive_07_1.log
NoteMinor versions of Hive 0.7.1, such as Hive 0.7.1.3 and Hive 0.7.1.4, share the same log file location as Hive 0.7.1. |
0.7 | hive_07.log |
0.5 | hive_05.log |
0.4 | hive.log |
Split input functionality
To implement split input functionality using Hive versions earlier than 0.13.1 (Amazon EMR AMI versions earlier 3.11.0), use the following:
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveCombineSplitsInputFormat; hive> set mapred.min.split.size=
100000000
;
This functionality was deprecated with Hive 0.13.1. To get the same split input format functionality in Amazon EMR AMI Version 3.11.0, use the following:
set hive.hadoop.supports.splittable.combineinputformat=true;
Thrift service ports
Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on the following ports.
Hive version | Port number |
---|---|
Hive 0.13.1 | 10000 |
Hive 0.11.0 | 10004 |
Hive 0.8.1 | 10003 |
Hive 0.7.1 | 10002 |
Hive 0.7 | 10001 |
Hive 0.5 | 10000 |
For more information about thrift services, see http://wiki.apache.org/thrift/
Use Hive to recover partitions
Amazon EMR includes a statement in the Hive query language that recovers the partitions of a table from table data located in Amazon S3. The following example shows this.
CREATE EXTERNAL TABLE (json string) raw_impression PARTITIONED BY (dt string) LOCATION 's3://elastic-mapreduce/samples/hive-ads/tables/impressions'; ALTER TABLE logs RECOVER PARTITIONS;
The partition directories and data must be at the location specified in the table
definition and must be named according to the Hive convention: for example,
dt=2009-01-01
.
Note
After Hive 0.13.1 this capability is supported natively using msck
repair
and therefore table
recover
partitions
is not supported. For more information, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Pass a Hive variable to a script
To pass a variable into a Hive step using the Amazon CLI, type the following command,
replace myKey
with the name of your EC2 key pair, and
replace amzn-s3-demo-bucket
with your bucket name. In this example,
SAMPLE
is a variable value preceded by the -d
switch.
This variable is defined in the Hive script as: ${SAMPLE}
.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.9
\ --applications Name=Hue
Name=Hive
Name=Pig
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typem3.xlarge
--instance-count3
\ --steps Type=Hive
,Name="Hive Program
",ActionOnFailure=CONTINUE
,\ Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/response-time-stats.q
,-d,\ INPUT=s3://elasticmapreduce/samples/hive-ads/tables
,-d,OUTPUT=s3://amzn-s3-demo-bucket/hive-ads/output/
,\ -d,SAMPLE
=s3://elasticmapreduce/samples/hive-ads/
]
Specify an external metastore location
The following procedure shows you how to override the default configuration values for the Hive metastore location and start a cluster using the reconfigured metastore location.
To create a metastore located outside of the EMR cluster
-
Create a MySQL or Aurora database using Amazon RDS.
For information about how to create an Amazon RDS database, see Getting started with Amazon RDS.
-
Modify your security groups to allow JDBC connections between your database and the ElasticMapReduce-Master security group.
For information about how to modify your security groups for access, see Amazon RDS security groups in the Amazon RDS User Guide.
-
Set the JDBC configuration values in
hive-site.xml
:-
Create a
hive-site.xml
configuration file containing the following:<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mariadb://
hostname
:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password
</value> <description>Password to use against metastore database</description> </property> </configuration>hostname
is the DNS address of the Amazon RDS instance running the database.username
andpassword
are the credentials for your database. For more information about connecting to MySQL and Aurora database instances, see Connecting to a DB instance running the MySQL database engine and Connecting to an Aurora DB cluster in the Amazon RDS User Guide.The JDBC drivers are installed by Amazon EMR.
Note
The value property should not contain any spaces or carriage returns. It should appear all on one line.
-
Save your
hive-site.xml
file to a location on Amazon S3, such ass3://
.amzn-s3-demo-bucket/
hive-site.xml
-
-
Create a cluster, specifying the Amazon S3 location of the customized
hive-site.xml
file.The following example command demonstrates an Amazon CLI command that does this.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.10
\ --applications Name=Hue
Name=Hive
Name=Pig
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typem3.xlarge
--instance-count3
\ --bootstrap-actions Name="Install Hive Site Configuration
",\ Path="s3://region
.elasticmapreduce/libs/hive/hive-script",\ Args=["--base-path","s3://elasticmapreduce/libs/hive","--install-hive-site",\ "--hive-site=s3://amzn-s3-demo-bucket
/hive-site.xml","--hive-versions","latest"]
Connect to Hive using JDBC
To connect to Hive via JDBC requires you to download the JDBC driver and install a SQL client. The following example demonstrates using SQL Workbench/J to connect to Hive using JDBC.
To download JDBC drivers
-
Download and extract the drivers appropriate to the versions of Hive that you want to access. The Hive version differs depending on the AMI that you choose when you create an Amazon EMR cluster.
-
Hive 0.13.1 JDBC drivers: https://amazon-odbc-jdbc-drivers.s3.amazonaws.com/public/AmazonHiveJDBC_1.0.4.1004.zip
-
Hive 0.11.0 JDBC drivers: https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/0.11.0
-
Hive 0.8.1 JDBC drivers: https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/0.8.1
-
-
Install SQL Workbench/J. For more information, see Installing and starting SQL Workbench/J
in the SQL Workbench/J Manual User's Manual. -
Create an SSH tunnel to the cluster master node. The port for connection is different depending on the version of Hive. Example commands are provided in the tables below for Linux
ssh
users and PuTTY commands for Windows usersLinux SSH commands Hive version Command 0.13.1 ssh -o ServerAliveInterval=10 -i
path-to-key-file
-N -L 10000:localhost:10000 hadoop@master-public-dns-name
0.11.0 ssh -o ServerAliveInterval=10 -i
path-to-key-file
-N -L 10004:localhost:10004 hadoop@master-public-dns-name
0.8.1 ssh -o ServerAliveInterval=10 -i
path-to-key-file
-N -L 10003:localhost:10003 hadoop@master-public-dns-name
0.7.1 ssh -o ServerAliveInterval=10 -i
path-to-key-file
-N -L 10002:localhost:10002 hadoop@master-public-dns-name
0.7 ssh -o ServerAliveInterval=10 -i
path-to-key-file
-N -L 10001:localhost:10001 hadoop@master-public-dns-name
0.5 ssh -o ServerAliveInterval=10 -i
path-to-key-file
-N -L 10000:localhost:10000 hadoop@master-public-dns-name
Windows PuTTY tunnel settings Hive version Tunnel settings 0.13.1 Source port: 10000 Destination: master-public-dns-name
:100000.11.0 Source port: 10004 Destination: master-public-dns-name
:100040.8.1 Source port: 10003 Destination: master-public-dns-name
:10003 -
Add the JDBC driver to SQL Workbench.
-
In the Select Connection Profile dialog box, choose Manage Drivers.
-
Choose the Create a new entry (blank page) icon.
-
In the Name field, type
Hive JDBC
. -
For Library, click the Select the JAR file(s) icon.
-
Select JAR files as shown in the following table.
Hive driver version JAR files to add 0.13.1 hive_metastore.jar hive_service.jar HiveJDBC3.jar libfb303-0.9.0.jar libthrift-0.9.0.jar log4j-1.2.14.jar ql.jar slf4j-api-1.5.8.jar slf4j-log4j12-1.5.8.jar TCLIServiceClient.jar
0.11.0 hadoop-core-1.0.3.jar hive-exec-0.11.0.jar hive-jdbc-0.11.0.jar hive-metastore-0.11.0.jar hive-service-0.11.0.jar libfb303-0.9.0.jar commons-logging-1.0.4.jar slf4j-api-1.6.1.jar
0.8.1 hadoop-core-0.20.205.jar hive-exec-0.8.1.jar hive-jdbc-0.8.1.jar hive-metastore-0.8.1.jar hive-service-0.8.1.jar libfb303-0.7.0.jar libthrift-0.7.0.jar log4j-1.2.15.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar
0.7.1 hadoop-0.20-core.jar hive-exec-0.7.1.jar hive-jdbc-0.7.1.jar hive-metastore-0.7.1.jar hive-service-0.7.1.jar libfb303.jar commons-logging-1.0.4.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar
0.7 hadoop-0.20-core.jar hive-exec-0.7.0.jar hive-jdbc-0.7.0.jar hive-metastore-0.7.0.jar hive-service-0.7.0.jar libfb303.jar commons-logging-1.0.4.jar slf4j-api-1.5.6.jar slf4j-log4j12-1.5.6.jar
0.5 hadoop-0.20-core.jar hive-exec-0.5.0.jar hive-jdbc-0.5.0.jar hive-metastore-0.5.0.jar hive-service-0.5.0.jar libfb303.jar log4j-1.2.15.jar commons-logging-1.0.4.jar
-
In the Please select one driver dialog box, select a driver according to the following table and click OK.
Hive version Driver classname 0.13.1 com.amazon.hive.jdbc3.HS2Driver
0.11.0 org.apache.hadoop.hive.jdbc.HiveDriver.jar
0.8.1 org.apache.hadoop.hive.jdbc.HiveDriver.jar
0.7.1 org.apache.hadoop.hive.jdbc.HiveDriver.jar
0.7 org.apache.hadoop.hive.jdbc.HiveDriver.jar
0.5 org.apache.hadoop.hive.jdbc.HiveDriver.jar
-
-
When you return to the Select Connection Profile dialog box, verify that the Driver field is set to Hive JDBC and provide the JDBC connection string in the URL field according to the following table.
Hive version JDBC connection string 0.13.1 jdbc:hive2://localhost:10000/default
0.11.0 jdbc:hive://localhost:10004/default
0.8.1 jdbc:hive://localhost:10003/default
If your cluster uses AMI version 3.3.1 or later, in the Select Connection Profile dialog box, type
hadoop
in the Username field.