Automate recurring Amazon EMR clusters with Amazon Data Pipeline

Note

Amazon Data Pipeline is no longer available to new customers. Existing customers of Amazon Data Pipeline can continue to use the service as normal.

Amazon Data Pipeline is a service that automates the movement and transformation of data. You can use it to schedule moving input data into Amazon S3 and to schedule launching clusters to process that data. For example, consider the case where you have a web server recording traffic logs. If you want to run a weekly cluster to analyze the traffic data, you can use Amazon Data Pipeline to schedule those clusters. Amazon Data Pipeline is a data-driven workflow, so that one task (launching the cluster) can be dependent on another task (moving the input data to Amazon S3). It also has robust retry functionality.

For more information about Amazon Data Pipeline, see the Amazon Data Pipeline Developer Guide, especially the tutorials regarding Amazon EMR:

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Clone a cluster

Amazon EMR tutorials