EMR Studio considerations - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

EMR Studio considerations

Considerations

Consider the following when you work with EMR Studio:

  • EMR Studio is available in the following Amazon Web Services Regions:

    • US East (Ohio) (us-east-2)

    • US East (N. Virginia) (us-east-1)

    • US West (N. California) (us-west-1)

    • US West (Oregon) (us-west-2)

    • Africa (Cape Town) (af-south-1)

    • Asia Pacific (Hong Kong) (ap-east-1)

    • Asia Pacific (Jakarta) (ap-southeast-3)*

    • Asia Pacific (Mumbai) (ap-south-1)

    • Asia Pacific (Osaka) (ap-northeast-3)*

    • Asia Pacific (Seoul) (ap-northeast-2)

    • Asia Pacific (Singapore) (ap-southeast-1)

    • Asia Pacific (Sydney) (ap-southeast-2)

    • Asia Pacific (Tokyo) (ap-northeast-1)

    • Canada (Central) (ca-central-1)

    • Europe (Frankfurt) (eu-central-1)

    • Europe (Ireland) (eu-west-1)

    • Europe (London) (eu-west-2)

    • Europe (Milan) (eu-south-1)

    • Europe (Paris) (eu-west-3)

    • Europe (Spain) (eu-south-2)

    • Europe (Stockholm) (eu-north-1)

    • Middle East (UAE) (me-central-1)*

    • South America (São Paulo) (sa-east-1)

    • Amazon GovCloud (US-East) (gov-us-east-1)

    • Amazon GovCloud (US-West) (gov-us-west-1)

    * The Spark UI isn't supported in these Regions.

  • To let users provision new EMR clusters running on Amazon EC2 for a Workspace, you can associate an EMR Studio with a set of cluster templates. Administrators can define cluster templates with Service Catalog and can choose whether a user or group can access the cluster templates, or no cluster templates, within a Studio.

  • When you define access permissions to notebook files stored in Amazon S3 or read secrets from Amazon Secrets Manager, use the Amazon EMR service role. Session policies aren't supported with these permissions.

  • You can create multiple EMR Studios to control access to EMR clusters in different VPCs.

  • Use the Amazon CLI to set up Amazon EMR on EKS clusters. You can then use the Studio interface to attach clusters to Workspaces with a managed endpoint to run notebook jobs.

  • There are additional considerations when you use trusted identity propagation with Amazon EMR that also apply to EMR Studio. For more information, see Considerations and limitations for Amazon EMR with the Identity Center integration.

  • EMR Studio doesn't support the following Python magic commands:

    • %alias

    • %alias_magic

    • %automagic

    • %macro

    • %%js

    • %%javascript

    • Modifying proxy_user using %configure

    • Modifying KERNEL_USERNAME using %env or %set_env

  • Amazon EMR on EKS clusters don't support SparkMagic commands for EMR Studio.

  • To write multi-line Scala statements in notebook cells, make sure that all but the last line end with a period. The following example uses the correct syntax for multi-line Scala statements.

    val df = spark.sql("SELECT * from table_name). filter("col1=='value'"). limit(50)
  • To augment the security for the off-console applications that you might use with Amazon EMR, the application hosting domains are registered in the Public Suffix List (PSL). Examples of these hosting domains include the following: emrstudio-prod.us-east-1.amazonaws.com, emrnotebooks-prod.us-east-1.amazonaws.com, emrappui-prod.us-east-1.amazonaws.com. For further security, if you ever need to set sensitive cookies in the default domain name, we recommend that you use cookies with a __Host- prefix. This helps to defend your domain against cross-site request forgery attempts (CSRF). For more information, see the Set-Cookie page in the Mozilla Developer Network.

Known issues

  • An EMR Studio that uses IAM Identity Center with trusted identity propagation enabled can only associate with EMR clusters that also use trusted identity propagation.

  • Make sure you deactivate proxy management tools such as FoxyProxy or SwitchyOmega in the browser before you create a Studio. Active proxies can cause errors when you choose Create Studio, and result in a Network Failure error message.

  • Kernels that run on Amazon EMR on EKS clusters can fail to start due to timeout issues. If you encounter an error or issue starting the kernel, close the notebook file, shut down the kernel, and then reopen the notebook file.

  • The Restart kernel operation doesn't work as expected when you use an Amazon EMR on EKS cluster. After you select Restart kernel, refresh the Workspace for the restart to take effect.

  • If a Workspace isn't attached to a cluster, an error message appears when a Studio user opens a notebook file and tries to select a kernel. You can ignore this error message by choosing Ok, but you must attach the Workspace to a cluster and select a kernel before you can run notebook code.

  • When you use Amazon EMR 6.2.0 with a security configuration to set up cluster security, the Workspace interface appears blank and doesn't work as expected. We recommend that you use a different supported version of Amazon EMR if you want to configure data encryption or Amazon S3 authorization for EMRFS for a cluster. EMR Studio works with Amazon EMR versions 5.32.0 (Amazon EMR 5.x series) and 6.2.0 (Amazon EMR 6.x series) and higher.

  • When you Debug Amazon EMR running on Amazon EC2 jobs, the links to the on-cluster Spark UI may not work or fail to appear. To regenerate the links, create a new notebook cell and run the %%info command.

  • Jupyter Enterprise Gateway doesn't clean up idle kernels on the primary node of a cluster in the following Amazon EMR release versions: 5.32.0, 5.33.0, 6.2.0, and 6.3.0. Idle kernels consume computing resources and can cause long running clusters to fail. You can configure idle kernel cleanup for Jupyter Enterprise Gateway using the following example script. You can Connect to the primary node using SSH, or submit the script as a step. For more information, see Run commands and scripts on an Amazon EMR cluster.

    #!/bin/bash sudo tee -a /emr/notebook-env/conf/jupyter_enterprise_gateway_config.py << EOF c.MappingKernelManager.cull_connected = True c.MappingKernelManager.cull_idle_timeout = 10800 c.MappingKernelManager.cull_interval = 300 EOF sudo systemctl daemon-reload sudo systemctl restart jupyter_enterprise_gateway
  • When you use an auto-termination policy with Amazon EMR versions 5.32.0, 5.33.0, 6.2.0, or 6.3.0, Amazon EMR marks a cluster as idle and may automatically terminate the cluster even if you have an active Python3 kernel. This is because executing a Python3 kernel does not submit a Spark job on the cluster. To use auto-termination with a Python3 kernel, we recommend that you use Amazon EMR version 6.4.0 or later. For more information about auto-termination, see Using an auto-termination policy.

  • When you use %%display to display a Spark DataFrame in a table, very wide tables may get truncated. You can right-click the output and select Create New View for Output to get a scrollable view of the output.

  • Starting a Spark-based kernel, such as PySpark, Spark, or SparkR, starts a Spark session, and running a cell in a notebook queues up Spark jobs in that session. When you interrupt a running cell, the Spark job continues to run. To stop the Spark job, you should use the on-cluster Spark UI. For instructions on how to connect to the Spark UI, see Debug applications and jobs with EMR Studio.

Feature limitations

Amazon EMR Studio doesn't support the following Amazon EMR features:

  • Attaching and running jobs on EMR clusters with a security configuration that specifies Kerberos authentication

  • Clusters with multiple primary nodes

  • Clusters that use Amazon EC2 instances based on Amazon Graviton2 for Amazon EMR 6.x releases lower than 6.9.0, and 5.x releases lower than 5.36.1

The following features aren't supported from a Studio that uses trusted identity propagation:

  • Creating EMR clusters without a template.

  • Using EMR Serverless applications.

  • Launching Amazon EMR on EKS clusters.

  • Using a runtime role.

  • Enabling SQL Explorer or Workspace collaboration.

Service limits for EMR Studio

The following table displays service limits for EMR Studio.

Item Limit
EMR Studios Maximum of 100 per Amazon account
Subnets Maximum of 5 associated with each EMR Studio
IAM Identity Center Groups Maximum of 5 assigned to each EMR Studio
IAM Identity Center Users Maximum of 100 assigned to each EMR Studio

VPC and subnet best practices

Use the following best practices to set up an Amazon Virtual Private Cloud (Amazon VPC) with subnets for EMR Studio:

  • You can specify a maximum of five subnets in your VPC to associate with the Studio. We recommend that you provide multiple subnets in different Availability Zones in order to support Workspace availability and give Studio users access to clusters across different Availability Zones. To learn more about working with VPCs, subnets, and Availability Zones, see VPCs and subnets in the Amazon Virtual Private Cloud User Guide.

  • The subnets that you specify should be able to communicate with each other.

  • To let users link a Workspace to publicly hosted Git repositories, you should specify only private subnets that have access to the internet through Network Address Translation (NAT). For more information about setting up a private subnet for Amazon EMR, see Private subnets.

  • When you use Amazon EMR on EKS with EMR Studio, there must be at least one subnet in common between your Studio and the Amazon EKS cluster that you use to register a virtual cluster. Otherwise, your managed endpoint won't appear as an option in Studio Workspaces. You can create an Amazon EKS cluster and associate it with a subnet that belongs to the Studio, or create a Studio and specify your EKS cluster's subnets.

  • If you plan to use Amazon Amazon EMR on EKS with EMR Studio, choose the same VPC as your Amazon EKS cluster worker nodes.

Cluster requirements for Amazon EMR Studio

Amazon EMR Clusters Running on Amazon EC2

All Amazon EMR clusters running on Amazon EC2 that you create for an EMR Studio Workspace must meet the following requirements. Clusters that you create using the EMR Studio interface automatically meet these requirements.

  • The cluster must use Amazon EMR versions 5.32.0 (Amazon EMR 5.x series) or 6.2.0 (Amazon EMR 6.x series) or later. You can create a cluster using the Amazon EMR console, Amazon Command Line Interface, or SDK, and then attach it to an EMR Studio Workspace. Studio users can also provision and attach clusters when creating or working in an Amazon EMR Workspace. For more information, see Attach a compute to an EMR Studio Workspace.

  • The cluster must be within an Amazon Virtual Private Cloud. The EC2-Classic platform isn't supported.

  • The cluster must have Spark, Livy, and Jupyter Enterprise Gateway installed. If you plan to use the cluster for SQL Explorer, you should install both Presto and Spark.

  • To use SQL Explorer, the cluster must use Amazon EMR version 5.34.0 or later or version 6.4.0 or later and have Presto installed. If you want to specify the Amazon Glue Data Catalog as the Hive metastore for Presto, you must configure it on the cluster. For more information, see Using Presto with the Amazon Glue Data Catalog.

  • The cluster must be in a private subnet with network address translation (NAT) to use publicly-hosted Git repositories with EMR Studio.

We recommend the following cluster configurations when you work with EMR Studio.

  • Set deploy mode for Spark sessions to cluster mode. Cluster mode places the application master processes on the core nodes and not on the primary node of a cluster. Doing so relieves the primary node of potential memory pressures. For more information, see Cluster Mode Overview in the Apache Spark documentation.

  • Change the Livy timeout from the default of one hour to six hours as in the following example configuration.

    { "classification":"livy-conf", "Properties":{ "livy.server.session.timeout":"6h", "livy.spark.deploy-mode":"cluster" } }
  • Create diverse instance fleets with up to 30 instances, and select multiple instance types in your Spot Instance fleet. For example, you might specify the following memory-optimized instance types for Spark workloads: r5.2x, r5.4x, r5.8x, r5.12x, r5.16x, r4.2x, r4.4x, r4.8x, r4.12, etc. For more information, see Configure instance fleets.

  • Use the capacity-optimized allocation strategy for Spot Instances to help Amazon EMR make effective instance selections based on real-time capacity insights from Amazon EC2. For more information, see Allocation strategy for instance fleets.

  • Enable managed scaling on your cluster. Set the maximum core nodes parameter to the minimum persistent capacity that you plan to use, and configure scaling on a well-diversified task fleet that runs on Spot Instances to save on costs. For more information, see Using managed scaling in Amazon EMR.

We also urge you to keep Amazon EMR Block Public Access enabled, and that to restrict inbound SSH traffic to trusted sources. Inbound access to a cluster lets users run notebooks on the cluster. For more information, see Using Amazon EMR block public access and Control network traffic with security groups.

Amazon EMR on EKS Clusters

In addition to EMR clusters running on Amazon EC2, you can set up and manage Amazon EMR on EKS clusters for EMR Studio using the Amazon CLI. Set up Amazon EMR on EKS clusters using the following guidelines:

  • Create a managed HTTPS endpoint for the Amazon EMR on EKS cluster. Users attach a Workspace to a managed endpoint. The Amazon Elastic Kubernetes Service (EKS) cluster that you use to register a virtual cluster must have a private subnet to support managed endpoints.

  • Use an Amazon EKS cluster with at least one private subnet and network address translation (NAT) when you want to use publicly-hosted Git repositories.

  • Avoid using Amazon EKS optimized Arm Amazon Linux AMIs, which aren't supported for Amazon EMR on EKS managed endpoints.

  • Avoid using Amazon Fargate-only Amazon EKS clusters, which aren't supported.