Data preparation using Amazon EMR

Important

Amazon SageMaker Studio and Amazon SageMaker Studio Classic are two of the machine learning environments that you can use to interact with SageMaker AI.

If your domain was created after November 30, 2023, Studio is your default experience.

If your domain was created before November 30, 2023, Amazon SageMaker Studio Classic is your default experience. To use Studio if Amazon SageMaker Studio Classic is your default experience, see Migration from Amazon SageMaker Studio Classic.

When you migrate from Amazon SageMaker Studio Classic to Amazon SageMaker Studio, there is no loss in feature availability. Studio Classic also exists as an application within Amazon SageMaker Studio to help you run your legacy machine learning workflows.

Amazon SageMaker Studio and Studio Classic come with built-in integration with Amazon EMR. Within JupyterLab and Studio Classic notebooks, data scientists and data engineers can discover and connect to existing Amazon EMR clusters, then interactively explore, visualize, and prepare large-scale data for machine learning using Apache Spark, Apache Hive, or Presto. With a single click, they can access the Spark UI to monitor the status and metrics of their Spark jobs without leaving their notebook.

Administrators can create Amazon CloudFormation templates that define Amazon EMR clusters. They can then make those cluster templates available in the Amazon Service Catalog for Studio and Studio Classic users to launch. Data scientists can then choose a predefined template to self-provision an Amazon EMR cluster directly from their Studio environment. Administrators can further parameterize the templates to let users choose aspects of the cluster within predefined values. For example, users may want to specify the number of core nodes or select the instance type of a node from a dropdown menu.

Using Amazon CloudFormation, administrators can control the organizational, security, and networking setup of Amazon EMR clusters. Data scientists and data engineers can then customize those templates for their workloads to create on-demand Amazon EMR clusters directly from Studio and Studio Classic without setting up complex configurations. Users can terminate Amazon EMR clusters after use.

If you are an administrator:

Ensure that you have enabled communication between Studio or Studio Classic and Amazon EMR clusters. For instructions, see the Configure network access for your Amazon EMR cluster section. Once this communication is enabled, you can:
- Configure Amazon EMR CloudFormation templates in the Service Catalog
- Configure listing Amazon EMR clusters
If you are a data scientist or data engineer, you can:

List of topics

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Stop an EMR Serverless application

Quickstart: Launch Amazon EMR clusters in Studio