

# Create or run a Hadoop application


**Topics**
+ [

# Build binaries using Amazon EMR
](emr-build-binaries.md)
+ [

# Process data with streaming
](UseCase_Streaming.md)
+ [

# Process data with a custom JAR
](UseCase_CustomJar.md)

# Build binaries using Amazon EMR


You can use Amazon EMR as a build environment to compile programs for use in your cluster. Programs that you use with Amazon EMR must be compiled on a system running the same version of Linux used by Amazon EMR. For a 32-bit version, you should have compiled on a 32-bit machine or with 32-bit cross compilation options turned on. For a 64-bit version, you need to have compiled on a 64-bit machine or with 64-bit cross compilation options turned. For more information about EC2 instance versions, see [Plan and configure EC2 instances](https://docs.amazonaws.cn/emr/latest/ManagementGuide/emr-plan-ec2-instances.html) in the *Amazon EMR Management Guide*. Supported programming languages include C\$1\$1, Python, and C\$1. 

The following table outlines the steps involved to build and test your application using Amazon EMR.


**Process for building a module**  

|  |  | 
| --- |--- |
|  1 |  Connect to the master node of your cluster. | 
|  2  |  Copy source files to the master node. | 
|  3  |  Build binaries with any necessary optimizations. | 
|  4 |  Copy binaries from the master node to Amazon S3. | 

The details for each of these steps are covered in the sections that follow. 

**To connect to the master node of the cluster**
+ Follow the instructions at [Connect to the master node using SSH](https://docs.amazonaws.cn/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html) in the *Amazon EMR Management Guide*.

**To copy source files to the master node**

1. Put your source files in an Amazon S3 bucket. To learn how to create buckets and how to move data into Amazon S3, see the [Amazon Simple Storage Service User Guide](https://docs.amazonaws.cn/AmazonS3/latest/userguide/).

1. Create a folder on your Hadoop cluster for your source files by entering a command similar to the following:

   ```
   mkdir SourceFiles
   ```

1. Copy your source files from Amazon S3 to the master node by typing a command similar to the following:

   ```
   hadoop fs -get s3://amzn-s3-demo-bucket/SourceFiles SourceFiles
   ```

**Build binaries with any necessary optimizations**  
How you build your binaries depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. You can use Hadoop system specification commands to obtain cluster information to determine how to install your build environment.

**To identify system specifications**
+ Use the following commands to verify the architecture you are using to build your binaries.

  1. To view the version of Debian, enter the following command:

     ```
     master$ cat /etc/issue
     ```

     The output looks similar to the following.

     ```
     Debian GNU/Linux 5.0
     ```

  1. To view the public DNS name and processor size, enter the following command:

     ```
     master$ uname -a
     ```

     The output looks similar to the following.

     ```
     Linux domU-12-31-39-17-29-39.compute-1.internal 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 GNU/Linux
     ```

  1. To view the processor speed, enter the following command:

     ```
     master$ cat /proc/cpuinfo
     ```

     The output looks similar to the following.

     ```
     processor : 0
     vendor_id : GenuineIntel
     model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
     flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr cda lahf_lm
     ...
     ```

Once your binaries are built, you can copy the files to Amazon S3.

**To copy binaries from the master node to Amazon S3**
+ Type the following command to copy the binaries to your Amazon S3 bucket:

  ```
  hadoop fs -put BinaryFiles s3://amzn-s3-demo-bucket/BinaryDestination
  ```

# Process data with streaming


Hadoop streaming is a utility that comes with Hadoop that enables you to develop MapReduce executables in languages other than Java. Streaming is implemented in the form of a JAR file, so you can run it from the Amazon EMR API or command line just like a standard JAR file. 

This section describes how to use streaming with Amazon EMR. 

**Note**  
Apache Hadoop streaming is an independent tool. As such, all of its functions and parameters are not described here. For more information about Hadoop streaming, go to [http://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html](http://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html).

## Using the Hadoop streaming utility


This section describes how use to Hadoop's streaming utility.


**Hadoop process**  

|  |  | 
| --- |--- |
| 1 |  Write your mapper and reducer executable in the programming language of your choice. Follow the directions in Hadoop's documentation to write your streaming executables. The programs should read their input from standard input and output data through standard output. By default, each line of input/output represents a record and the first tab on each line is used as a separator between the key and value.  | 
| 2 |  Test your executables locally and upload them to Amazon S3.  | 
| 3 |  Use the Amazon EMR command line interface or Amazon EMR console to run your application.  | 

Each mapper script launches as a separate process in the cluster. Each reducer executable turns the output of the mapper executable into the data output by the job flow.

The `input`, `output`, `mapper`, and `reducer` parameters are required by most streaming applications. The following table describes these and other, optional parameters.


| Parameter | Description | Required | 
| --- | --- | --- | 
| -input |  Location on Amazon S3 of the input data. Type: String Default: None Constraint: URI. If no protocol is specified then it uses the cluster's default file system.   | Yes | 
| -output |  Location on Amazon S3 where Amazon EMR uploads the processed data. Type: String Default: None Constraint: URI Default: If a location is not specified, Amazon EMR uploads the data to the location specified by `input`.  | Yes | 
| -mapper |  Name of the mapper executable. Type: String Default: None  | Yes | 
| -reducer |  Name of the reducer executable. Type: String Default: None  | Yes | 
| -cacheFile |  An Amazon S3 location containing files for Hadoop to copy into your local working directory (primarily to improve performance). Type: String Default: None Constraints: [URI]\$1[symlink name to create in working directory]   | No | 
| -cacheArchive |  JAR file to extract into the working directory Type: String Default: None Constraints: [URI]\$1[symlink directory name to create in working directory   | No | 
| -combiner |  Combines results Type: String Default: None Constraints: Java class name  | No | 

The following code sample is a mapper executable written in Python. This script is part of the WordCount sample application.

```
 1. #!/usr/bin/python
 2. import sys
 3. 
 4. def main(argv):
 5.   line = sys.stdin.readline()
 6.   try:
 7.     while line:
 8.       line = line.rstrip()
 9.       words = line.split()
10.       for word in words:
11.         print "LongValueSum:" + word + "\t" + "1"
12.       line = sys.stdin.readline()
13.   except "end of file":
14.     return None
15. if __name__ == "__main__":
16.   main(sys.argv)
```

# Submit a streaming step


This section covers the basics of submitting a streaming step to a cluster. A streaming application reads input from standard input and then runs a script or executable (called a mapper) against each input. The result from each of the inputs is saved locally, typically on a Hadoop Distributed File System (HDFS) partition. After all the input is processed by the mapper, a second script or executable (called a reducer) processes the mapper results. The results from the reducer are sent to standard output. You can chain together a series of streaming steps, where the output of one step becomes the input of another step. 

The mapper and the reducer can each be referenced as a file or you can supply a Java class. You can implement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python, PHP, or Bash.

## Submit a streaming step using the console


This example describes how to use the Amazon EMR console to submit a streaming step to a running cluster.

**To submit a streaming step**

1. Open the Amazon EMR console at [https://console.amazonaws.cn/emr](https://console.amazonaws.cn/emr/).

1. In the **Cluster List**, select the name of your cluster.

1. Scroll to the **Steps** section and expand it, then choose **Add step**.

1. In the **Add Step** dialog box:
   + For **Step type**, choose **Streaming program**.
   + For **Name**, accept the default name (Streaming program) or type a new name.
   + For **Mapper**, type or browse to the location of your mapper class in Hadoop, or an S3 bucket where the mapper executable, such as a Python program, resides. The path value must be in the form *BucketName*/*path*/*MapperExecutable*.
   + For **Reducer**, type or browse to the location of your reducer class in Hadoop, or an S3 bucket where the reducer executable, such as a Python program, resides. The path value must be in the form *BucketName*/*path*/*MapperExecutable*. Amazon EMR supports the special *aggregate* keyword. For more information, go to the Aggregate library supplied by Hadoop.
   + For **Input S3 location**, type or browse to the location of your input data. 
   + For **Output S3 location**, type or browse to the name of your Amazon S3 output bucket.
   + For **Arguments**, leave the field blank.
   + For **Action on failure**, accept the default option (**Continue**).

1. Choose **Add**. The step appears in the console with a status of Pending. 

1. The status of the step changes from Pending to Running to Completed as the step runs. To update the status, choose the **Refresh** icon above the Actions column. 

## Amazon CLI


These examples demonstrate how to use the Amazon CLI to create a cluster and submit a Streaming step. 

**To create a cluster and submit a streaming step using the Amazon CLI**
+ To create a cluster and submit a streaming step using the Amazon CLI, type the following command and replace *myKey* with the name of your EC2 key pair. Note that your argument for `--files` should be the Amazon S3 path to your script's location, and the arguments for `-mapper` and `-reducer` should be the names of the respective script files.

  ```
  aws emr create-cluster --name "Test cluster" --release-label emr-7.12.0 --applications Name=Hue Name=Hive Name=Pig --use-default-roles \
  --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 \
  --steps Type=STREAMING,Name="Streaming Program",ActionOnFailure=CONTINUE,Args=[--files,pathtoscripts,-mapper,mapperscript,-reducer,reducerscript,aggregate,-input,pathtoinputdata,-output,pathtooutputbucket]
  ```
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  When you specify the instance count without using the `--instance-groups` parameter, a single master node is launched, and the remaining instances are launched as core nodes. All nodes use the instance type specified in the command.
**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, type aws `emr create-default-roles` to create them before typing the `create-cluster` subcommand.

  For more information on using Amazon EMR commands in the Amazon CLI, see [https://docs.amazonaws.cn/cli/latest/reference/emr](https://docs.amazonaws.cn/cli/latest/reference/emr).

# Process data with a custom JAR


A custom JAR runs a compiled Java program that you can upload to Amazon S3. You should compile the program against the version of Hadoop you want to launch, and submit a `CUSTOM_JAR` step to your Amazon EMR cluster. For more information about how to compile a JAR file, see [Build binaries using Amazon EMR](emr-build-binaries.md).

For more information about building a Hadoop MapReduce application, see the [MapReduce Tutorial](http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) in the Apache Hadoop documentation.

**Topics**
+ [

# Submit a custom JAR step
](emr-launch-custom-jar-cli.md)

# Submit a custom JAR step


A custom JAR runs a compiled Java program that you can upload to Amazon S3. You should compile the program against the version of Hadoop you want to launch, and submit a `CUSTOM_JAR` step to your Amazon EMR cluster. For more information about how to compile a JAR file, see [Build binaries using Amazon EMR](emr-build-binaries.md).

For more information about building a Hadoop MapReduce application, see the [MapReduce Tutorial](http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) in the Apache Hadoop documentation.

This section covers the basics of submitting a custom JAR step in Amazon EMR. Submitting a custom JAR step enables you to write a script to process your data with the Java programming language. 

## Submit a custom JAR step with the console


This example describes how to use the Amazon EMR console to submit a custom JAR step to a running cluster.

**To submit a custom JAR step with the console**

1. Open the Amazon EMR console at [https://console.amazonaws.cn/emr](https://console.amazonaws.cn/emr/).

1. In the **Cluster List**, select the name of your cluster.

1. Scroll to the **Steps** section and expand it, then choose **Add step**.

1. In the **Add Step** dialog:
   + For **Step type**, choose **Custom JAR**.
   + For **Name**, accept the default name (Custom JAR) or type a new name.
   + For **JAR S3 location**, type or browse to the location of your JAR file. JAR location maybe a path into S3 or a fully qualified java class in the classpath.. 
   + For **Arguments**, type any required arguments as space-separated strings or leave the field blank.
   + For **Action on failure**, accept the default option (**Continue**).

1. Choose **Add**. The step appears in the console with a status of Pending. 

1. The status of the step changes from Pending to Running to Completed as the step runs. To update the status, choose the **Refresh** icon above the Actions column. 

## Launching a cluster and submitting a custom JAR step with the Amazon CLI


**To launch a cluster and submit a custom JAR step with the Amazon CLI**

To launch a cluster and submit a custom JAR step with the Amazon CLI, type the `create-cluster` subcommand with the `--steps` parameter.
+ To launch a cluster and submit a custom JAR step, type the following command, replace *myKey* with the name of your EC2 key pair, and replace *amzn-s3-demo-bucket* with your bucket name.

  ```
  aws emr create-cluster --name "Test cluster" --release-label emr-7.12.0 \
  --applications Name=Hue Name=Hive Name=Pig --use-default-roles \
  --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 \
  --steps Type=CUSTOM_JAR,Name="Custom JAR Step",ActionOnFailure=CONTINUE,Jar=pathtojarfile,Args=["pathtoinputdata","pathtooutputbucket","arg1","arg2"]
  ```
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  When you specify the instance count without the `--instance-groups` parameter, a single primary node launches, and the remaining instances launch as core nodes. All nodes use the instance type that you specify in the command.
**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

  For more information on using Amazon EMR commands in the Amazon CLI, see [https://docs.amazonaws.cn/cli/latest/reference/emr](https://docs.amazonaws.cn/cli/latest/reference/emr).

## Third-party dependencies


Sometimes it may be necessary to include in the MapReduce classpath JARs for use with your program. You have two options for doing this:
+ Include the `--libjars s3://URI_to_JAR` in the step options for the procedure in [Launching a cluster and submitting a custom JAR step with the Amazon CLI](#emr-dev-create-jar-cli).
+ Launch the cluster with a modified `mapreduce.application.classpath` setting in `mapred-site.xml`. Use the `mapred-site` configuration classification. To create the cluster with the step using Amazon CLI, this would look like the following:

  ```
  aws emr create-cluster --release-label emr-7.12.0 \
  --applications Name=Hue Name=Hive Name=Pig --use-default-roles \
  --instance-type m5.xlarge --instance-count 2  --ec2-attributes KeyName=myKey \
  --steps Type=CUSTOM_JAR,Name="Custom JAR Step",ActionOnFailure=CONTINUE,Jar=pathtojarfile,Args=["pathtoinputdata","pathtooutputbucket","arg1","arg2"] \
  --configurations https://s3.amazonaws.com/amzn-s3-demo-bucket/myfolder/myConfig.json
  ```

  `myConfig.json`:

  ```
  [
      {
        "Classification": "mapred-site",
        "Properties": {
          "mapreduce.application.classpath": "path1,path2"
        }
      }
    ]
  ```

  The comma-separated list of paths should be appended to the JVM classpath for each task.