Setting up IAM roles for service accounts (IRSA) for spark-submit
The following sections explain how to set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts so you can run Spark applications stored in Amazon S3.
Prerequisites
Before trying any of the examples in this documentation, make sure that you have completed the following prerequisites:
-
Created an S3 bucket and uploaded the spark application jar
Configuring a Kubernetes service account to assume an IAM role
The following steps cover how to configure a Kubernetes service account to assume an Amazon Identity and Access Management (IAM) role. After you configure the pods to use the service account, they can then access any Amazon Web Services service that the role has permissions to access.
-
Create a policy file to allow read-only access to the Amazon S3 object you uploaded:
cat >my-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<
my-spark-jar-bucket
>", "arn:aws:s3:::<my-spark-jar-bucket
>/*" ] } ] } EOF -
Create the IAM policy.
aws iam create-policy --policy-name my-policy --policy-document file://my-policy.json
-
Create an IAM role and associate it with a Kubernetes service account for the Spark driver
eksctl create iamserviceaccount --name my-spark-driver-sa --namespace spark-operator \ --cluster my-cluster --role-name "my-role" \ --attach-policy-arn arn:aws:iam::111122223333:policy/my-policy --approve
-
Create a YAML file with the required permissions for the Spark driver service account:
cat >spark-rbac.yaml <<EOF apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: default name: emr-containers-role-spark rules: - apiGroups: - "" resources: - pods verbs: - "*" - apiGroups: - "" resources: - services verbs: - "*" - apiGroups: - "" resources: - configmaps verbs: - "*" - apiGroups: - "" resources: - persistentvolumeclaims verbs: - "*" --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: spark-role-binding namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: emr-containers-role-spark subjects: - kind: ServiceAccount name: emr-containers-sa-spark namespace: default EOF
-
Apply the cluster role binding configurations.
kubectl apply -f spark-rbac.yaml
-
The
kubectl
command should return confirmation of the created account.serviceaccount/emr-containers-sa-spark created clusterrolebinding.rbac.authorization.k8s.io/emr-containers-role-spark configured
Running the Spark application
Amazon EMR 6.10.0 and higher supports spark-submit for running Spark applications on an Amazon EKS cluster. To run the Spark application, follow these steps:
-
Make sure that you have completed the steps in Setting up spark-submit for Amazon EMR on EKS.
-
Set the values for the following environment variables:
export SPARK_HOME=spark-home export MASTER_URL=k8s://Amazon EKS-cluster-endpoint
-
Now, submit the Spark application with the following command:
$SPARK_HOME/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master $MASTER_URL \ --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.15.0:latest \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \ --deploy-mode cluster \ --conf spark.kubernetes.namespace=default \ --conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \ --conf "spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \ --conf "spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \ --conf "spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \ --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \ --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \ --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \ --conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \ --conf spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds="2000" \ --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem="2" \ --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem="true" \ s3://my-pod-bucket/spark-examples.jar 20
-
After the spark driver finishes the Spark job, you should see a log line at the end of the submission indicating that the Spark job has finished.
23/11/24 17:02:14 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with submission ID default:org-apache-spark-examples-sparkpi-4980808c03ff3115-driver finished 23/11/24 17:02:14 INFO ShutdownHookManager: Shutdown hook called
Cleanup
When you're done running your applications, you can perform cleanup with the following command.
kubectl delete -f spark-rbac.yaml