Customizing cluster and application configuration with earlier AMI versions of Amazon EMR
Amazon EMR release version 4.0.0 introduced a simplified method of configuring applications
using configuration classifications. For more information, see Configure applications. When using an AMI
version, you configure applications using bootstrap actions along with arguments that
you pass. For example, the configure-hadoop
and
configure-daemons
bootstrap actions set Hadoop and YARN–specific
environment properties like --namenode-heap-size
. In more recent versions,
these are configured using the hadoop-env
and yarn-env
configuration classifications. For bootstrap actions that perform common configurations,
see the emr-bootstrap-actions repository on Github
The following tables map bootstrap actions to configuration classifications in more recent Amazon EMR release versions.
Affected application file name | AMI version bootstrap action | Configuration classification |
---|---|---|
core-site.xml |
configure-hadoop -c
|
core-site |
log4j.properties |
configure-hadoop -l |
hadoop-log4j |
hdfs-site.xml |
configure-hadoop -s |
hdfs-site
|
n/a | n/a | hdfs-encryption-zones |
mapred-site.xml
|
configure-hadoop -m |
mapred-site |
yarn-site.xml
|
configure-hadoop -y
|
yarn-site |
httpfs-site.xml |
configure-hadoop -t |
httpfs-site |
capacity-scheduler.xml
|
configure-hadoop -z
|
capacity-scheduler |
yarn-env.sh |
configure-daemons --resourcemanager-opts |
yarn-env |
Affected application file name | AMI version bootstrap action | Configuration classification |
---|---|---|
hive-env.sh |
n/a | hive-env |
hive-site.xml |
hive-script --install-hive-site
${MY_HIVE_SITE_FILE} |
hive-site |
hive-exec-log4j.properties |
n/a | hive-exec-log4j |
hive-log4j.properties |
n/a | hive-log4j |
Affected application file name | AMI version bootstrap action | Configuration classification |
---|---|---|
emrfs-site.xml |
configure-hadoop -e |
emrfs-site |
n/a | s3get -s s3://custom-provider.jar -d
/usr/share/aws/emr/auxlib/ |
emrfs-site (with new setting
fs.s3.cse.encryptionMaterialsProvider.uri ) |
For a list of all classifications, see Configure applications.
Application environment variables
When using an AMI version, a hadoop-user-env.sh
script is used along
with the configure-daemons
bootstrap action to configure the Hadoop
environment. The script includes the following actions:
#!/bin/bash export HADOOP_USER_CLASSPATH_FIRST=true; echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh
In Amazon EMR release 4.x, you do the same using the hadoop-env
configuration classification, as shown in the following example:
[ { "Classification":"hadoop-env", "Properties":{ }, "Configurations":[ { "Classification":"export", "Properties":{ "HADOOP_USER_CLASSPATH_FIRST":"true", "HADOOP_CLASSPATH":"/path/to/my.jar" } } ] } ]
As another example, using configure-daemons
and passing
--namenode-heap-size=2048
and
--namenode-opts=-XX:GCTimeRatio=19
is equivalent to the following
configuration classifications.
[ { "Classification":"hadoop-env", "Properties":{ }, "Configurations":[ { "Classification":"export", "Properties":{ "HADOOP_DATANODE_HEAPSIZE": "2048", "HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19" } } ] } ]
Other application environment variables are no longer defined in
/home/hadoop/.bashrc
. Instead, they are primarily set in
/etc/default
files per component or application, such as
/etc/default/hadoop
. Wrapper scripts in
/usr/bin/
installed by application RPMs may also set
additional environment variables before involving the actual bin script.
Service ports
When using an AMI version, some services use custom ports.
Setting | AMI version 3.x | Open-source default |
---|---|---|
fs.default.name | hdfs://emrDeterminedIP:9000 | default (hdfs://emrDeterminedIP :8020)
|
dfs.datanode.address | 0.0.0.0:9200 | default (0.0.0.0:50010) |
dfs.datanode.http.address | 0.0.0.0:9102 | default (0.0.0.0:50075) |
dfs.datanode.https.address | 0.0.0.0:9402 | default (0.0.0.0:50475) |
dfs.datanode.ipc.address | 0.0.0.0:9201 | default (0.0.0.0:50020) |
dfs.http.address | 0.0.0.0:9101 | default (0.0.0.0:50070) |
dfs.https.address | 0.0.0.0:9202 | default (0.0.0.0:50470) |
dfs.secondary.http.address | 0.0.0.0:9104 | default (0.0.0.0:50090) |
yarn.nodemanager.address | 0.0.0.0:9103 | default (${yarn.nodemanager.hostname}:0) |
yarn.nodemanager.localizer.address | 0.0.0.0:9033 | default (${yarn.nodemanager.hostname}:8040) |
yarn.nodemanager.webapp.address | 0.0.0.0:9035 | default (${yarn.nodemanager.hostname}:8042) |
yarn.resourcemanager.address | emrDeterminedIP :9022 |
default (${yarn.resourcemanager.hostname}:8032) |
yarn.resourcemanager.admin.address | emrDeterminedIP :9025 |
default (${yarn.resourcemanager.hostname}:8033) |
yarn.resourcemanager.resource-tracker.address | emrDeterminedIP :9023 |
default (${yarn.resourcemanager.hostname}:8031) |
yarn.resourcemanager.scheduler.address | emrDeterminedIP :9024 |
default (${yarn.resourcemanager.hostname}:8030) |
yarn.resourcemanager.webapp.address | 0.0.0.0:9026 | default (${yarn.resourcemanager.hostname}:8088) |
yarn.web-proxy.address | emrDeterminedIP :9046 |
default (no-value) |
yarn.resourcemanager.hostname | 0.0.0.0 (default) | emrDeterminedIP |
Note
The emrDeterminedIP
is an IP address that is
generated by Amazon EMR.
Users
When using an AMI version, the user hadoop
runs all processes and
owns all files. In Amazon EMR release version 4.0.0 and later, users exist at the
application and component level.
Installation sequence, installed artifacts, and log file locations
When using an AMI version, application artifacts and their configuration
directories are installed in the
/home/hadoop/
directory. For example, if you installed Hive, the directory would be
application
/home/hadoop/hive
. In Amazon EMR release 4.0.0 and later,
application artifacts are installed in the
/usr/lib/
directory. When using an AMI version, log files are found in various places. The
table below lists locations.application
Daemon or application | Directory location |
---|---|
instance-state | node/instance-id /instance-state/ |
hadoop-hdfs-namenode | daemons/instance-id /hadoop-hadoop-namenode.log |
hadoop-hdfs-datanode | daemons/instance-id /hadoop-hadoop-datanode.log |
hadoop-yarn (ResourceManager) | daemons/instance-id /yarn-hadoop-resourcemanager |
hadoop-yarn (Proxy Server) | daemons/instance-id /yarn-hadoop-proxyserver |
mapred-historyserver | daemons/instance-id / |
httpfs | daemons/instance-id /httpfs.log |
hive-server | node/instance-id /hive-server/hive-server.log |
hive-metastore | node/instance-id /apps/hive.log |
Hive CLI | node/instance-id /apps/hive.log |
YARN applications user logs and container logs | task-attempts/ |
Mahout | N/A |
Pig | N/A |
spark-historyserver | N/A |
mapreduce job history files | jobs/ |
Command runner
When using an AMI version, many scripts or programs, like
/home/hadoop/contrib/streaming/hadoop-streaming.jar
, are
not placed on the shell login path environment, so you need to specify the full path
when you use a jar file such as command-runner.jar or script-runner.jar to execute
the scripts. The command-runner.jar
is located on the AMI so
there is no need to know a full URI as was the case with
script-runner.jar
.
Replication factor
The replication factor lets you configure when to start a Hadoop JVM. You can
start a new Hadoop JVM for every task, which provides better task isolation, or you
can share JVMs between tasks, providing lower framework overhead. If you are
processing many small files, it makes sense to reuse the JVM many times to amortize
the cost of start-up. However, if each task takes a long time or processes a large
amount of data, then you might choose to not reuse the JVM to ensure that all memory
is freed for subsequent tasks. When using an AMI version, you can customize the
replication factor using the configure-hadoop
bootstrap action to set
the mapred.job.reuse.jvm.num.tasks
property.
The following example demonstrates setting the JVM reuse factor for infinite JVM reuse.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.11.0
\ --applications Name=Hue
Name=Hive
Name=Pig
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-groups InstanceGroupType=MASTER
,InstanceCount=1
,InstanceType=m3.xlarge
\ InstanceGroupType=CORE
,InstanceCount=2
,InstanceType=m3.xlarge
\ --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop
,\ Name="Configuring infinite JVM reuse"
,Args=["-m","mapred.job.reuse.jvm.num.tasks=-1"
]