Installing and using kernels and libraries - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Installing and using kernels and libraries

Note

EMR Notebooks are available as EMR Studio Workspaces in the console. The Create Workspace button in the console lets you create new notebooks. To access or create Workspaces, EMR Notebooks users need additional IAM role permissions. For more information, see Amazon EMR Notebooks are Amazon EMR Studio Workspaces in the console and Amazon EMR console.

Each EMR notebook comes with a set of pre-installed libraries and kernels. You can install additional libraries and kernels in an EMR cluster if the cluster has access to the repository where the kernels and libraries are located. For example, for clusters in private subnets, you might need to configure network address translation (NAT) and provide a path for the cluster to access the public PyPI repository to install a library. For more information about configuring external access for different network configurations, see Scenarios and examples in the Amazon VPC User Guide.

EMR Serverless applications come with the following pre-installed libraries for Python and PySpark:

  • Python libraries – ggplot, matplotlib, numpy, pandas, plotly, bokeh, scikit-learn, scipy, scipy

  • PySpark libraries – ggplot, matplotlib, numpy, pandas, plotly, bokeh, scikit-learn, scipy, scipy

Installing kernels and Python libraries on a cluster primary node

With Amazon EMR release version 5.30.0 and later, excluding 6.0.0, you can install additional Python libraries and kernels on the primary node of the cluster. After installation, these kernels and libraries are available to any user running an EMR notebook attached to the cluster. Python libraries installed this way are available only to processes running on the primary node. The libraries are not installed on core or task nodes and are not available to executors running on those nodes.

Note

For Amazon EMR versions 5.30.1, 5.31.0, and 6.1.0, you must take additional steps in order to install kernels and libraries on the primary node of a cluster.

To enable the feature, do the following:

  1. Make sure that the permissions policy attached to the service role for EMR Notebooks allows the following action:

    elasticmapreduce:ListSteps

    For more information, see Service role for EMR Notebooks.

  2. Use the Amazon CLI to run a step on the cluster that sets up EMR Notebooks as shown in the following example. You must use the step name EMRNotebooksSetup. Replace us-east-1 with the Region in which your cluster resides. For more information, see Adding steps to a cluster using the Amazon CLI.

    aws emr add-steps --cluster-id MyClusterID --steps Type=CUSTOM_JAR,Name=EMRNotebooksSetup,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://awssupportdatasvcs.com/bootstrap-actions/EMRNotebooksSetup/emr-notebooks-setup.sh"]

You can install kernels and libraries using pip or conda in the /emr/notebook-env/bin directory on the primary node.

Example – Installing Python libraries

From the Python3 kernel, run the %pip magic as a command from within a notebook cell to install Python libraries.

%pip install pmdarima

You may need to restart the kernel to use updated packages. You can also use the %%sh Spark magic to invoke pip.

%%sh /emr/notebook-env/bin/pip install -U matplotlib /emr/notebook-env/bin/pip install -U pmdarima

When using a PySpark kernel, you can either install libraries on the cluster using pip commands or use notebook-scoped libraries from within a PySpark notebook.

To run pip commands on the cluster from the terminal, first connect to the primary node using SSH, as the following commands demonstrate.

sudo pip3 install -U matplotlib sudo pip3 install -U pmdarima

Alternatively, you can use notebook-scoped libraries. With notebook-scoped libraries, your library installation is limited to the scope of your session and occurs on all Spark executors. For more information, see Using Notebook Scoped Libraries.

If you want to package multiple Python libraries within a PySpark kernel, you can also create an isolated Python virtual environment. For examples, see Using Virtualenv.

To create a Python virtual environment in a session, use the Spark property spark.yarn.dist.archives from the %%configure magic command in the first cell in a notebook, as the following example demonstrates.

%%configure -f { "conf": { "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"./environment/bin/python", "spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON":"./environment/bin/python", "spark.yarn.dist.archives":"s3://DOC-EXAMPLE-BUCKET/prefix/my_pyspark_venv.tar.gz#environment", "spark.submit.deployMode":"cluster" } }

You can similarly create a Spark executor environment.

%%configure -f { "conf": { "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"./environment/bin/python", "spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON":"./environment/bin/python", "spark.executorEnv.PYSPARK_PYTHON":"./environment/bin/python", "spark.yarn.dist.archives":"s3://DOC-EXAMPLE-BUCKET/prefix/my_pyspark_venv.tar.gz#environment", "spark.submit.deployMode":"cluster" } }

You can also use conda to install Python libraries. You don't need sudo access to use conda. You must connect to the primary node with SSH, and then run conda from the terminal. For more information, see Connect to the primary node using SSH.

Example – Installing kernels

The following example demonstrates installing the Kotlin kernel using a terminal command while connected to the primary node of a cluster:

sudo /emr/notebook-env/bin/conda install kotlin-jupyter-kernel -c jetbrains
Note

These instructions do not install kernel dependencies. If your kernel has third-party dependencies, you may need to take additional setup steps before you can use the kernel with your notebook.

Considerations and limitations with notebook-scoped libraries

When you use notebook-scoped libraries, consider the following:

  • Notebook-scoped libraries are available for clusters that you create with Amazon EMR releases 5.26.0 and higher.

  • Notebook-scoped libraries are intended to be used only with the PySpark kernel.

  • Any user can install additional notebook-scoped libraries from within a notebook cell. These libraries are only available to that notebook user during a single notebook session. If other users need the same libraries, or the same user needs the same libraries in a different session, the library must be re-installed.

  • You can uninstall only the libraries that were installed with the install_pypi_package API. You cannot uninstall any libraries that were pre-installed on the cluster.

  • If the same libraries with different versions are installed on the cluster and as notebook-scoped libraries, the notebook-scoped library version overrides the cluster library version.

Working with Notebook-scoped libraries

To install libraries, your Amazon EMR cluster must have access to the PyPI repository where the libraries are located.

The following examples demonstrate simple commands to list, install, and uninstall libraries from within a notebook cell using the PySpark kernel and APIs. For additional examples, see Install Python libraries on a running cluster with EMR Notebooks post on the Amazon Big Data Blog.

Example – Listing current libraries

The following command lists the Python packages available for the current Spark notebook session. This lists libraries installed on the cluster and notebook-scoped libraries.

sc.list_packages()
Example – Installing the Celery library

The following command installs the Celery library as a notebook-scoped library.

sc.install_pypi_package("celery")

After installing the library, the following command confirms that the library is available on the Spark driver and executors.

import celery sc.range(1,10000,1,100).map(lambda x: celery.__version__).collect()
Example – Installing the Arrow library, specifying the version and repository

The following command installs the Arrow library as a notebook-scoped library, with a specification of the library version and repository URL.

sc.install_pypi_package("arrow==0.14.0", "https://pypi.org/simple")
Example – Uninstalling a library

The following command uninstalls the Arrow library, removing it as a notebook-scoped library from the current session.

sc.uninstall_package("arrow")