Requirements, differences in release versions, and security for EMR Notebooks
Note
EMR Notebooks are available as EMR Studio Workspaces in the console. The Create Workspace button in the console lets you create new notebooks. To access or create Workspaces, EMR Notebooks users need additional IAM role permissions. For more information, see Amazon EMR Notebooks are Amazon EMR Studio Workspaces in the console and Amazon EMR console.
Consider the following requirements, differences in release versions, security information and other considerations when you create clusters and develop solutions using EMR notebook.
Cluster requirements
-
Enable Amazon EMR Block Public Access – Inbound access to a cluster enables cluster users to execute notebook kernels. Ensure that only authorized users can access the cluster. We strongly recommend that you leave block public access enabled, and that you limit inbound SSH traffic to only trusted sources. For more information, see Using Amazon EMR block public access and Control network traffic with security groups for your Amazon EMR cluster.
-
Use a Compatible Cluster – A cluster attached to a notebook must meet the following requirements:
-
Only clusters created using Amazon EMR are supported. You can create a cluster independently within Amazon EMR and then attach an EMR notebook, or you can create a compatible cluster when you create an EMR notebook.
-
Only clusters created using Amazon EMR release version 5.18.0 and later are supported. See Differences in capabilities by cluster release version.
-
Clusters created using Amazon EC2 instances with AMD EPYC processors—for example, m5a.* and r5a.* instance types—are not supported.
-
EMR Notebooks works only with clusters created with
VisibleToAllUsers
set totrue
.VisibleToAllUsers
istrue
by default. -
The cluster must be launched within an EC2-VPC. Public and private subnets are supported. The EC2-Classic platform is not supported.
-
The cluster must be launched with Hadoop, Spark, and Livy installed. Other applications may be installed, but EMR Notebooks currently supports Spark clusters only.
Important
For Amazon EMR release versions 5.32.0 and later, or 6.2.0 and later, your cluster must also be running the Jupyter Enterprise Gateway application in order to work with EMR Notebooks.
-
Clusters using Kerberos authentication are not supported.
-
Clusters integrated with Amazon Lake Formation support the installation of notebook-scoped libraries only. Installing kernels and libraries on the cluster are not supported.
-
Clusters with multiple primary nodes are not supported.
-
Clusters using Amazon EC2 instances based on Amazon Graviton2 are not supported.
-
Differences in capabilities by cluster release version
We strongly recommend that you use EMR Notebooks with clusters created using Amazon EMR release versions 5.30.0, 5.32.0 or later, or 6.2.0 or later. With these versions, EMR Notebooks runs kernels on the attached Amazon EMR cluster. Kernels and libraries can be installed directly on the cluster primary node. Using EMR Notebooks with these cluster versions has the following benefits:
-
Improved performance – Notebook kernels run on clusters with EC2 instance types that you select. Earlier versions run kernels on a specialized instance that cannot be resized, accessed, or customized.
-
Ability to add and customize kernels – You can connect to the cluster to install kernel packages using
conda
andpip
. In addition,pip
installation is supported using terminal commands within notebook cells. In earlier versions, only pre-installed kernels were available (Python, PySpark, Spark, and SparkR). For more information, see Installing kernels and Python libraries on a cluster primary node. -
Ability to install Python libraries – You can install Python libraries on the cluster primary node using
conda
andpip
. We recommend usingconda
. With earlier versions, only notebook-scoped libraries for PySpark are supported.
Cluster release version | Notebook-scoped libraries for PySpark | Kernel installation on cluster | Python library installation on primary node |
---|---|---|---|
Earlier than 5.18.0 |
EMR Notebooks not supported |
||
5.18.0–5.25.0 |
No |
No |
No |
5.26.0–5.29.0 |
No |
No |
|
5.30.0 |
|||
6.0.0 |
No |
No |
No |
5.32.0 and later, and 6.2.0 and later | Yes | Yes | Yes |
Limits for concurrently attached EMR Notebooks
When you create a cluster that supports notebooks, consider the EC2 Instance type of the cluster primary node. The memory constraints of this EC2 Instance determine the number of notebooks that can be ready simultaneously to run code and queries on the cluster.
Primary node EC2 instance type | Number of EMR Notebooks |
---|---|
*.medium |
2 |
*.large |
4 |
*.xlarge |
8 |
*.2xlarge |
16 |
*.4xlarge |
24 |
*.8xlarge |
24 |
*.16xlarge |
24 |
Jupyter Notebook and Python versions
EMR Notebooks runs Jupyter Notebook version 6.0.2
Security-related considerations
- Using encrypted S3 locations
-
If you specify an encrypted location in Amazon S3 to store notebook files, you must set up the Service role for EMR Notebooks as a key user. The default service role is
EMR_Notebooks_DefaultRole
. If you are using an Amazon KMS key for encryption, see Using key policies in Amazon KMS in the Amazon Key Management Service Developer Guide and the support article for adding key users. - Using cookies with hosting domains
-
To augment the security for the off-console applications that you might use with Amazon EMR, the application hosting domains are registered in the Public Suffix List (PSL). Examples of these hosting domains include the following:
emrstudio-prod.us-east-1.amazonaws.com
,emrnotebooks-prod.us-east-1.amazonaws.com
,emrappui-prod.us-east-1.amazonaws.com
. For further security, if you ever need to set sensitive cookies in the default domain name, we recommend that you use cookies with a__Host-
prefix. This helps to defend your domain against cross-site request forgery attempts (CSRF). For more information, see the Set-Cookiepage in the Mozilla Developer Network.