Creating a cluster with earlier AMI versions of Amazon EMR
Amazon EMR 2.x and 3.x releases are referenced by AMI version. With
Amazon EMR release 4.0.0 and later, releases are referenced by release version, using a
release label such as emr-5.11.0
. This change is
most apparent when you create a cluster using the Amazon CLI or programmatically.
When you use the Amazon CLI to create a cluster using an AMI release version, use the
--ami-version
option, for example, --ami-version 3.11.0
.
Many options, features, and applications introduced in Amazon EMR 4.0.0 and later are not
available when you specify an --ami-version
. For more information, see
create-cluster in the
Amazon CLI Command Reference.
The following example Amazon CLI command launches a cluster using an AMI version.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.11.0
\ --applications Name=Hue
Name=Hive
Name=Pig
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-groups InstanceGroupType=MASTER
,InstanceCount=1
,\ InstanceType=m3.xlarge
InstanceGroupType=CORE
,InstanceCount=2
,\ InstanceType=m3.xlarge
--bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop
,\ Name="Configuring infinite JVM reuse"
,Args=["-m","mapred.job.reuse.jvm.num.tasks=-1"
]
Programmatically, all Amazon EMR release versions use the RunJobFlowRequest
action in the EMR API to create clusters. The following example Java code creates a
cluster using AMI release version 3.11.0.
RunJobFlowRequest request = new RunJobFlowRequest() .withName("AmiVersion Cluster") .withAmiVersion("3.11.0") .withInstances(new JobFlowInstancesConfig() .withEc2KeyName("myKeyPair") .withInstanceCount(1) .withKeepJobFlowAliveWhenNoSteps(true) .withMasterInstanceType("m3.xlarge") .withSlaveInstanceType("m3.xlarge");
The following RunJobFlowRequest
call uses a release label instead:
RunJobFlowRequest request = new RunJobFlowRequest() .withName("ReleaseLabel Cluster") .withReleaseLabel("
emr-7.5.0
") .withInstances(new JobFlowInstancesConfig() .withEc2KeyName("myKeyPair") .withInstanceCount(1) .withKeepJobFlowAliveWhenNoSteps(true) .withMasterInstanceType("m3.xlarge") .withSlaveInstanceType("m3.xlarge");
Configuring cluster size
When your cluster runs, Hadoop determines the number of mapper and reducer tasks needed to process the data. Larger clusters should have more tasks for better resource use and shorter processing time. Typically, an EMR cluster remains the same size during the entire cluster; you set the number of tasks when you create the cluster. When you resize a running cluster, you can vary the processing during the cluster execution. Therefore, instead of using a fixed number of tasks, you can vary the number of tasks during the life of the cluster. There are two configuration options to help set the ideal number of tasks:
-
mapred.map.tasksperslot
-
mapred.reduce.tasksperslot
You can set both options in the mapred-conf.xml
file. When you submit
a job to the cluster, the job client checks the current total number of map and
reduce slots available clusterwide. The job client then uses the following equations
to set the number of tasks:
-
mapred.map.tasks
=mapred.map.tasksperslot
* map slots in cluster -
mapred.reduce.tasks
=mapred.reduce.tasksperslot
* reduce slots in cluster
The job client only reads the tasksperslot
parameter if the number of
tasks is not configured. You can override the number of tasks at any time, either
for all clusters via a bootstrap action or individually per job by adding a step to
change the configuration.
Amazon EMR withstands task node failures and continues cluster execution even if a task node becomes unavailable. Amazon EMR automatically provisions additional task nodes to replace those that fail.
You can have a different number of task nodes for each cluster step. You can also add a step to a running cluster to modify the number of task nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running task nodes for any step.