Setting up cluster access permissions with IAM roles for service accounts (IRSA) - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Setting up cluster access permissions with IAM roles for service accounts (IRSA)

This section uses an example to demonstrate how to configure a Kubernetes service account to assume an Amazon Identity and Access Management role. Pods that use the service account can then access any Amazon service that the role has permissions to access.

The following example runs a Spark application to count the words from a file in Amazon S3. To do this, you can set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts.

Note

This example uses the "spark-operator" namespace for the Spark operator and for the namespace where you submit the Spark application.

Prerequisites

Before you try the example on this page, complete the following prerequisites:

Configure a Kubernetes service account to assume an IAM role

Use the following steps to configure a Kubernetes service account to assume an IAM role that pods can use to access Amazon services that the role has permissions to access.

  1. After completing the Prerequisites, use the Amazon Command Line Interface to create an example-policy.json file that allows read-only access to the file that you uploaded to Amazon S3:

    cat >example-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-pod-bucket", "arn:aws:s3:::my-pod-bucket/*" ] } ] } EOF
  2. Then, create an IAM policy example-policy:

    aws iam create-policy --policy-name example-policy --policy-document file://example-policy.json
  3. Next, create an IAM role example-role and associate it with a Kubernetes service account for the Spark driver:

    eksctl create iamserviceaccount --name driver-account-sa --namespace spark-operator \ --cluster my-cluster --role-name "example-role" \ --attach-policy-arn arn:aws:iam::111122223333:policy/example-policy --approve
  4. Create a yaml file with the cluster role bindings that are required for the Spark driver service account:

    cat >spark-rbac.yaml <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: driver-account-sa --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: spark-role roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: edit subjects: - kind: ServiceAccount name: driver-account-sa namespace: spark-operator EOF
  5. Apply the cluster role binding configurations:

    kubectl apply -f spark-rbac.yaml

The kubectl command should confirm successful creation of the account:

serviceaccount/driver-account-sa created clusterrolebinding.rbac.authorization.k8s.io/spark-role configured

Running an application from the Spark operator

After you configure the Kubernetes service account, you can run a Spark application that counts the number of words in the text file that you uploaded as part of the Prerequisites.

  1. Create a new file word-count.yaml, with a SparkApplication definition for your word-count application.

    cat >word-count.yaml <<EOF apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: word-count namespace: spark-operator spec: type: Java mode: cluster image: "895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.10.0:latest" imagePullPolicy: Always mainClass: org.apache.spark.examples.JavaWordCount mainApplicationFile: local:///usr/lib/spark/examples/jars/spark-examples.jar arguments: - s3://my-pod-bucket/poem.txt hadoopConf: # EMRFS filesystem fs.s3.customAWSCredentialsProvider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem fs.AbstractFileSystem.s3.impl: org.apache.hadoop.fs.s3.EMRFSDelegate fs.s3.buffer.dir: /mnt/s3 fs.s3.getObject.initialSocketTimeoutMilliseconds: "2000" mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem: "2" mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem: "true" sparkConf: # Required for EMR Runtime spark.driver.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/* spark.driver.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native spark.executor.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/* spark.executor.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native sparkVersion: "3.3.1" restartPolicy: type: Never driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.3.1 serviceAccount: my-spark-driver-sa executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.3.1 EOF
  2. Submit the Spark application.

    kubectl apply -f word-count.yaml

    The kubectl command should return confirmation that you successfully created a SparkApplication object called word-count.

    sparkapplication.sparkoperator.k8s.io/word-count configured
  3. To check events for the SparkApplication object, run the following command:

    kubectl describe sparkapplication word-count -n spark-operator

    The kubectl command should return the description of the SparkApplication with the events:

    Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SparkApplicationSpecUpdateProcessed 3m2s (x2 over 17h) spark-operator Successfully processed spec update for SparkApplication word-count Warning SparkApplicationPendingRerun 3m2s (x2 over 17h) spark-operator SparkApplication word-count is pending rerun Normal SparkApplicationSubmitted 2m58s (x2 over 17h) spark-operator SparkApplication word-count was submitted successfully Normal SparkDriverRunning 2m56s (x2 over 17h) spark-operator Driver word-count-driver is running Normal SparkExecutorPending 2m50s spark-operator Executor [javawordcount-fdd1698807392c66-exec-1] is pending Normal SparkExecutorRunning 2m48s spark-operator Executor [javawordcount-fdd1698807392c66-exec-1] is running Normal SparkDriverCompleted 2m31s (x2 over 17h) spark-operator Driver word-count-driver completed Normal SparkApplicationCompleted 2m31s (x2 over 17h) spark-operator SparkApplication word-count completed Normal SparkExecutorCompleted 2m31s (x2 over 2m31s) spark-operator Executor [javawordcount-fdd1698807392c66-exec-1] completed

The application is now counting the words in your S3 file. To find the count of words, refer to the log files for your driver:

kubectl logs pod/word-count-driver -n spark-operator

The kubectl command should return the contents of the log file with the results of your word-count application.

INFO DAGScheduler: Job 0 finished: collect at JavaWordCount.java:53, took 5.146519 s Software: 1

For more information on how to submit applications to Spark through the Spark operator, see Using a SparkApplication in the Kubernetes Operator for Apache Spark (spark-on-k8s-operator) documentation on GitHub.