故障排除 - Amazon EMR
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

故障排除

本部分介绍如何对 Amazon EMR on EKS 的问题进行故障排除。有关如何排查一般性 Amazon EMR 问题的信息,请参阅《Amazon EMR 管理指南》中的 集群问题排查

安装 Helm 图表时未找到资源映射

安装 Helm 图表时可能会遇到以下错误消息。

Error: INSTALLATION FAILED: pulling from host 1234567890.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests 6.13.0]: 403 Forbidden Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "flink-operator-serving-cert" namespace: "<the namespace to install your operator>" from "": no matches for kind "Certificate" in version "cert-manager.io/v1" ensure CRDs are installed first, resource mapping not found for name: "flink-operator-selfsigned-issuer" namespace: "<the namespace to install your operator>" " from "": no matches for kind "Issuer" in version "cert-manager.io/v1" ensure CRDs are installed first].

要解决此错误,请安装 cert-manager 以启用添加 webhook 组件。你必须安装 cert-manager 到您使用的每个 Amazon EKS 集群。

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0

如果你看到 访问被拒绝 错误,请确认 Helm 图表values.yaml文件中的 IAM 角色具有正确的权限。operatorExecutionRoleArn此外,请确保 FlinkDeployment 规范中 executionRoleArn 下的 IAM 角色具有正确的权限。

如果您的 FlinkDeployment 卡在“被囚禁”状态,请按照以下步骤强制删除部署:

  1. 编辑部署运行。

    kubectl edit -n Flink Namespace flinkdeployments/App Name
  2. 移除此终结器。

    finalizers: - flinkdeployments.flink.apache.org/finalizer
  3. 删除部署。

    kubectl delete -n Flink Namespace flinkdeployments/App Name

如果您以可选方式运行 Flink 应用程序 Amazon Web Services 区域,则可能会看到以下错误:

Caused by: org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3://flink.txt: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: ABCDEFGHIJKL; S3 Extended Request ID: ABCDEFGHIJKLMNOP=; Proxy: null), S3 Extended Request ID: ABCDEFGHIJKLMNOP=:400 Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: ABCDEFGHIJKL; S3 Extended Request ID: ABCDEFGHIJKLMNOP=; Proxy: null)
Caused by: org.apache.hadoop.fs.s3a.AWSBadRequestException: getS3Region on flink-application: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 400, Request ID: ABCDEFGHIJKLMNOP, Extended Request ID: ABCDEFGHIJKLMNOPQRST==):null: null (Service: S3, Status Code: 400, Request ID: ABCDEFGHIJKLMNOP, Extended Request ID: AHl42uDNaTUFOus/5IIVNvSakBcMjMCH7dd37ky0vE6jhABCDEFGHIJKLMNOPQRST==)

要修复这些错误,请在 FlinkDeployment 定义文件中使用以下配置。

spec: flinkConfiguration: taskmanager.numberOfTaskSlots: "2" fs.s3a.endpoint.region: OPT_IN_AWS_REGION_NAME

我们还建议您使用 SDKv2 凭证提供商:

fs.s3a.aws.credentials.provider: software.amazon.awssdk.auth.credentials.WebIdentityTokenFileCredentialsProvider

如果您想使用 SDKv1 凭证提供商,请确保您的 SDK 支持您的选择加入区域。有关更多信息,请参阅aws-sdk-java GitHub 存储库

如果您在选择加入的区域运行 Flink SQL 语句时收到 S3 AWSBadRequestException,请务必在 Flink 配置规范中设置配置 fs.s3a.endpoint.region: OPT_IN_AWS_REGION_NAME

对于 Amazon EMR 发行版 6.15.0-7.2.0 版本,在中国区域运行 Flink 会话作业时可能会收到以下错误消息。包括中国(北京)和中国(宁夏):

Error: {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3://ABCDPath: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 400, Request ID: ABCDEFGH, Extended Request ID: ABCDEFGH:null: null (Service: S3, Status Code: 400, Request ID: ABCDEFGH, Extended Request ID: ABCDEFGH","additionalMetadata":{},"throwableList": [{"type":"org.apache.hadoop.fs.s3a.AWSBadRequestException","message":"getFileStatus on s3://ABCDPath: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 400, Request ID: ABCDEFGH, Extended Request ID: ABCDEFGH:null: null (Service: S3, Status Code: 400, Request ID: ABCDEFGH, Extended Request ID: ABCDEFGH","additionalMetadata":{}},{"type":"software.amazon.awssdk.services.s3.model.S3Exception","message":"null (Service: S3, Status Code: 400, Request ID: ABCDEFGH, Extended Request ID: ABCDEFGH","additionalMetadata":{}}]}

人们已经意识到这个问题。团队正在努力修补这些发行版的 Flink Operator。但在我们完成补丁之前,要修复这个错误,需要下载 Flink Operator Helm 图表,将其解压(提取压缩文件),然后在 Helm 图表中更改配置。

具体步骤如下:

  1. 更改为(特别是将目录更改为)Helm 图表的本地文件夹,然后运行以下命令行,拉取 Helm 图表并将其解压。

    helm pull oci://public.ecr.aws/emr-on-eks/flink-kubernetes-operator \ --version $VERSION \ --namespace $NAMESPACE
    tar -zxvf flink-kubernetes-operator-$VERSION.tgz
  2. 进入 Helm 图表文件夹,找到 templates/flink-operator.yaml 文件。

  3. 在中找到flink-operator-config ConfigMap 并添加以下fs.s3a.endpoint.region配置flink-conf.yaml。例如:

    {{- if .Values.defaultConfiguration.create }} apiVersion: v1 kind: ConfigMap metadata: name: flink-operator-config namespace: {{ .Release.Namespace }} labels: {{- include "flink-operator.labels" . | nindent 4 }} data: flink-conf.yaml: |+ fs.s3a.endpoint.region: {{ .Values.emrContainers.awsRegion }}
  4. 安装本地 Helm 图表 并运行作业。