Use the Nvidia RAPIDS Accelerator for Apache Spark - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Use the Nvidia RAPIDS Accelerator for Apache Spark

With Amazon EMR release 6.2.0 and later, you can use the RAPIDS Accelerator for Apache Spark plugin by Nvidia to accelerate Spark using EC2 graphics processing unit (GPU) instance types. RAPIDS Accelerator will GPU-accelerate your Apache Spark 3.0 data science pipelines without code changes, and speed up data processing and model training while substantially lowering infrastructure costs.

The following sections guide you through configuring your EMR cluster to use the Spark-RAPIDS Plugin for Spark.

Choose instance types

To use the Nvidia Spark-RAPIDS plugin for Spark, the core and task instance groups must use EC2 GPU instance types that meet the Hardware requirements of Spark-RAPIDS. To view a complete list of Amazon EMR supported GPU instance types, please see Supported instance types in the Amazon EMR Management Guide. Instance type for the primary instance group can be either GPU or non-GPU types, but ARM instance types aren't supported.

Set up application configurations for your cluster

1. Enable Amazon EMR to install the plugins on your new cluster

To install plugins, supply the following configuration when you create your cluster:

{ "Classification":"spark", "Properties":{ "enableSparkRapids":"true" } }

2. Configure YARN to use GPU

For details on how to use GPU on YARN, see Using GPU on YARN in Apache Hadoop documentation. The following examples show sample YARN configurations for Amazon EMR 6.x and 7.x releases:

Amazon EMR 7.x

Example YARN configuration for Amazon EMR 7.x

{ "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } },{ "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/spark-rapids-cgroup", "yarn-hierarchy":"yarn" } } ] }
Amazon EMR 6.x

Example YARN configuration for Amazon EMR 6.x

{ "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } },{ "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/sys/fs/cgroup", "yarn-hierarchy":"yarn" } } ] }

3. Configure Spark to use RAPIDS

Here are the required configurations to enable Spark to use RAPIDS plugin:

{ "Classification":"spark-defaults", "Properties":{ "spark.plugins":"com.nvidia.spark.SQLPlugin", "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" } }

XGBoost4J-Spark library in XGBoost documentation is also available when the Spark RAPIDS plugin is enabled on your cluster. You can use the following configuration to integrate XGBoost with you Spark job:

{ "Classification":"spark-defaults", "Properties":{ "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar" } }

For additional Spark configurations that you can use to tune a GPU-accelerated EMR cluster, please refer to the Rapids Accelerator for Apache Spark tuning guide in Nvidia.github.io documentation.

4. Configure YARN Capacity Scheduler

DominantResourceCalculator must be configured to enable GPU scheduling and isolation. For more information, please refer to Using GPU on YARN in Apache Hadoop documentation.

{ "Classification":"capacity-scheduler", "Properties":{ "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } }

5. Create a JSON file to include your configurations

You can create a JSON file that contains your configuration to use the RAPIDS plugin for your Spark cluster. You supply the file later when you launch your cluster.

You can store the file locally or on S3. For more information of how to supply application configurations for your clusters, see Configure applications.

Use the following sample files as templates to build your own configurations.

Amazon EMR 7.x

Example my-configurations.json file for Amazon EMR 7.x

[ { "Classification":"spark", "Properties":{ "enableSparkRapids":"true" } }, { "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } }, { "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/spark-rapids-cgroup", "yarn-hierarchy":"yarn" } } ] }, { "Classification":"spark-defaults", "Properties":{ "spark.plugins":"com.nvidia.spark.SQLPlugin", "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar", "spark.rapids.sql.concurrentGpuTasks":"1", "spark.executor.resource.gpu.amount":"1", "spark.executor.cores":"2", "spark.task.cpus":"1", "spark.task.resource.gpu.amount":"0.5", "spark.rapids.memory.pinnedPool.size":"0", "spark.executor.memoryOverhead":"2G", "spark.locality.wait":"0s", "spark.sql.shuffle.partitions":"200", "spark.sql.files.maxPartitionBytes":"512m" } }, { "Classification":"capacity-scheduler", "Properties":{ "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } } ]
Amazon EMR 6.x

Example my-configurations.json file for Amazon EMR 6.x

[ { "Classification":"spark", "Properties":{ "enableSparkRapids":"true" } }, { "Classification":"yarn-site", "Properties":{ "yarn.nodemanager.resource-plugins":"yarn.io/gpu", "yarn.resource-types":"yarn.io/gpu", "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" } }, { "Classification":"container-executor", "Properties":{ }, "Configurations":[ { "Classification":"gpu", "Properties":{ "module.enabled":"true" } }, { "Classification":"cgroups", "Properties":{ "root":"/sys/fs/cgroup", "yarn-hierarchy":"yarn" } } ] }, { "Classification":"spark-defaults", "Properties":{ "spark.plugins":"com.nvidia.spark.SQLPlugin", "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar", "spark.rapids.sql.concurrentGpuTasks":"1", "spark.executor.resource.gpu.amount":"1", "spark.executor.cores":"2", "spark.task.cpus":"1", "spark.task.resource.gpu.amount":"0.5", "spark.rapids.memory.pinnedPool.size":"0", "spark.executor.memoryOverhead":"2G", "spark.locality.wait":"0s", "spark.sql.shuffle.partitions":"200", "spark.sql.files.maxPartitionBytes":"512m" } }, { "Classification":"capacity-scheduler", "Properties":{ "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" } } ]

Add a bootstrap action for your cluster

For more information on how to supply bootstrap action scripts when you create your cluster, see Bootstrap action basics in the Amazon EMR Management Guide.

The following example scripts show how to make a bootstrap action file for Amazon EMR 6.x and 7.x:

Amazon EMR 7.x

Example my-bootstrap-action.sh file for Amazon EMR 7.x

To use YARN to manage GPU resources with Amazon EMR 7.x releases, you must manually mount CGroup v1 on your cluster. You can do this with as bootstrap action script, as shown in this example.

#!/bin/bash set -ex sudo mkdir -p /spark-rapids-cgroup/devices sudo mount -t cgroup -o devices cgroupv1-devices /spark-rapids-cgroup/devices sudo chmod a+rwx -R /spark-rapids-cgroup
Amazon EMR 6.x

Example my-bootstrap-action.sh file for Amazon EMR 6.x

For Amazon EMR 6.x releases, you must open CGroup permissions to YARN on your cluster. You can do this with a bootstrap action script, as shown in this example.

#!/bin/bash set -ex sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct sudo chmod a+rwx -R /sys/fs/cgroup/devices

Launch your cluster

The last step is to launch your cluster with the cluster configurations mentioned above. Here's an example command to launch a cluster from the Amazon EMR CLI:

aws emr create-cluster \ --release-label emr-7.0.0 \ --applications Name=Hadoop Name=Spark \ --service-role EMR_DefaultRole_V2 \ --ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \ InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \ InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \ --configurations file:///my-configurations.json \ --bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://my-bucket/my-bootstrap-action.sh