Output data compression Intermediate data compression Using the Snappy library with Amazon EMR

Ways to compress the output of your Amazon EMR cluster

There are different ways to compress output that results from data processing. The compression tools you use depend on properties of your data. Compression can improve performance when you transfer large amounts of data.

Output data compression

This compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'ed text file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressed internally. This can be enabled by setting the configuration setting mapred.output.compress to true.

If you are running a streaming job you can enable this by passing the streaming job these arguments.



-jobconf mapred.output.compress=true

You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that with the Ruby client.


   
--bootstrap-actions s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-s,mapred.output.compress=true"

Finally, if are writing a Custom Jar you can enable output compression with the following line when creating your job.



FileOutputFormat.setCompressOutput(conf, true);

Intermediate data compression

If your job shuffles a significant amount data from the mappers to the reducers, you can see a performance improvement by enabling intermediate compression. Compress the map output and decompress it when it arrives on the core node. The configuration setting is mapred.compress.map.output. You can enable this similarly to output compression.

When writing a Custom Jar, use the following command:



conf.setCompressMapOutput(true);

Using the Snappy library with Amazon EMR

Snappy is a compression and decompression library that is optimized for speed. It is available on Amazon EMR AMIs version 2.0 and later and is used as the default for intermediate compression. For more information about Snappy, go to http://code.google.com/p/snappy/.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

How to write data to an Amazon S3 bucket you don't own with Amazon EMR

Plan and configure primary nodes in your Amazon EMR cluster