设置 spark-submit 的服务账户(IRSA)的 IAM 角色 - Amazon EMR
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

设置 spark-submit 的服务账户(IRSA)的 IAM 角色

以下部分介绍了如何设置服务账户的 IAM 角色(IRSA),对 Kubernetes 服务账户进行身份验证和授权,从而运行存储在 Amazon S3 中的 Spark 应用程序。

先决条件

在尝试本文档中的任何示例之前,请确保已满足以下先决条件:

配置 Kubernetes 服务账户以代入 IAM 角色

以下步骤介绍如何配置 Kubernetes 服务账户以代入 Amazon Identity and Access Management (IAM)角色。将这些 pod 配置为使用服务帐号后,他们就可以访问 Amazon Web Services 服务 该角色有权访问的任何内容。

  1. 创建一个策略文件,允许对上传的 Amazon S3 对象进行只读访问:

    cat >my-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<my-spark-jar-bucket>", "arn:aws:s3:::<my-spark-jar-bucket>/*" ] } ] } EOF
  2. 创建 IAM 策略。

    aws iam create-policy --policy-name my-policy --policy-document file://my-policy.json
  3. 创建一个 IAM 角色,将该角色与 Spark 驱动程序的 Kubernetes 服务账户关联:

    eksctl create iamserviceaccount --name my-spark-driver-sa --namespace spark-operator \ --cluster my-cluster --role-name "my-role" \ --attach-policy-arn arn:aws:iam::111122223333:policy/my-policy --approve
  4. 创建一个 YAML 文件,其中包含 Spark 驱动程序服务账户所需的权限

    cat >spark-rbac.yaml <<EOF apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: default name: emr-containers-role-spark rules: - apiGroups: - "" resources: - pods verbs: - "*" - apiGroups: - "" resources: - services verbs: - "*" - apiGroups: - "" resources: - configmaps verbs: - "*" - apiGroups: - "" resources: - persistentvolumeclaims verbs: - "*" --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: spark-role-binding namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: emr-containers-role-spark subjects: - kind: ServiceAccount name: emr-containers-sa-spark namespace: default EOF
  5. 应用集群角色绑定配置。

    kubectl apply -f spark-rbac.yaml
  6. kubectl 命令将返回已创建账户的确认信息。

    serviceaccount/emr-containers-sa-spark created clusterrolebinding.rbac.authorization.k8s.io/emr-containers-role-spark configured

运行 Spark 应用程序

Amazon EMR 6.10.0 及更高版本都支持 spark-submit 在 Amazon EKS 集群上运行 Spark 应用程序。要运行 Spark 应用程序,请按照下述步骤操作:

  1. 确保已完成设置 Amazon EMR on EKS 的 spark-submit 中的步骤。

  2. 设置以下环境变量的值:

    export SPARK_HOME=spark-home export MASTER_URL=k8s://Amazon EKS-cluster-endpoint
  3. 现在,使用以下命令提交 Spark 应用程序:

    $SPARK_HOME/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master $MASTER_URL \ --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.15.0:latest \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \ --deploy-mode cluster \ --conf spark.kubernetes.namespace=default \ --conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \ --conf "spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \ --conf "spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \ --conf "spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \ --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \ --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \ --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \ --conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \ --conf spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds="2000" \ --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem="2" \ --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem="true" \ s3://my-pod-bucket/spark-examples.jar 20
  4. Spark 驱动程序完成 Spark 作业后,您应该会在提交结束时看到一个日志行,指示 Spark 作业已完成。

    23/11/24 17:02:14 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with submission ID default:org-apache-spark-examples-sparkpi-4980808c03ff3115-driver finished 23/11/24 17:02:14 INFO ShutdownHookManager: Shutdown hook called

清理

运行完应用程序后,可使用以下命令执行清理。

kubectl delete -f spark-rbac.yaml