将 Ray 作业迁移到 Amazon Elastic Kubernetes Service

Amazon Glue for Ray 终止支持

重要

Amazon Glue for Ray 不再向新客户开放。现有客户可以继续正常使用该服务。有关更多信息，请参阅 Amazon Glue for Ray 终止支持。

经过深思熟虑，我们决定自 2026 年 4 月 30 日起向新客户关闭 Amazon Glue for Ray。要使用 Amazon Glue for Ray，请在该日期之前注册。现有客户可以继续正常使用该服务。

Amazon 会继续投资提高 Amazon Glue for Ray 的安全性和可用性。但请注意，除安全性和可用性增强之外，我们并不计划为 Amazon Glue for Ray 引入新功能。

作为 Amazon Glue for Ray 的替代方案，我们建议使用 Amazon Elastic Kubernetes Service。Amazon Elastic Kubernetes Service 是一项完全托管式、经认证符合 Kubernetes 规范的服务，可简化在 Amazon 上构建、保护、运行和维护 Kubernetes 集群的过程。这是一个高度可定制的选项，使用开源 KubeRay Operator 在 Kubernetes 上部署和管理 Ray 集群，从而可提高资源利用率，简化基础设施管理，并且完全支持各种 Ray 功能。

将 Ray 作业迁移到 Amazon Elastic Kubernetes Service

本节介绍了从 Amazon Glue for Ray 迁移到 Amazon Elastic Kubernetes Service 上的 Ray 的步骤。这些步骤对于以下两种迁移场景十分实用：

标准迁移（x86/amd64）：对于此类使用案例，迁移策略使用 OpenSource Ray 容器进行基础实施，并直接在基础容器上执行脚本。
ARM64 迁移：对于此类使用案例，迁移策略支持使用自定义容器构建来满足 ARM64 特定的依赖项和架构要求。

迁移的先决条件

安装以下 CLI 工具：aws、kubectl、eksctl、helm、Python 3.9+。这些 CLI 工具是预调配和管理 EKS 上的 Ray 环境所必需的。eksctl 可简化 EKS 集群的创建和管理。kubectl 是用于在集群上部署和排查工作负载问题的标准 Kubernetes CLI。helm 用于安装和管理 KubeRay（在 Kubernetes 上运行 Ray 的操作符）。Ray 本身需要使用 Python 3.9+ 才能在本地运行作业提交脚本。

安装 eksctl

按照 Installation options for Eksctl 中的说明进行操作，或按以下说明进行安装。

对于 macOS：


brew tap weaveworks/tap
brew install weaveworks/tap/eksctl

对于 Linux：


curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp

# Move the extracted binary to /usr/local/bin
sudo mv /tmp/eksctl /usr/local/bin

# Test the installation
eksctl version

安装 kubectl

按照 Set up kubectl and eksctl 中的说明进行操作，或按以下说明进行安装。

对于 macOS：


brew install kubectl

对于 Linux：


curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

安装 helm

按照 Installing Helm 中的说明进行操作，或按以下说明进行安装。

对于 macOS：


brew install helm

对于 Linux：


curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

步骤 1：构建或选择一个适用于 Ray 的 Docker 映像

选项 1：使用官方 Ray 映像（无需构建）

此选项使用 Docker Hub 上的官方 Ray Docker 映像，例如由 Ray 项目维护的 rayproject/ray:2.4.0-py39。

注意

此映像仅支持 amd64。如果您的依赖项与 amd64 兼容，并且不需要特定于 ARM 的版本，则可以使用此选项。

选项 2：构建并发布自己的 arm64 Ray 2.4.0 映像

此选项在使用 Graviton（ARM）节点时非常实用，与 Amazon Glue for Ray 内部使用的节点一致。您可以创建固定到与 Amazon Glue for Ray 相同的依赖项版本的自定义映像，从而减少兼容性不匹配问题。

在本地创建 Dockerfile：


# Build an ARM64 image
FROM --platform=linux/arm64 python:3.9-slim-bullseye
# Handy tools: wget for KubeRay probes; CA certs; keep image small
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Keep pip/setuptools modern enough for wheels resolution
RUN python -m pip install -U "pip<24" "setuptools<70" wheel

# ---- Install Ray 2.4.0 (ARM64 / Py3.9) and Glue-like dependencies ----
# 1) Download the exact Ray 2.4.0 wheel for aarch64 (no network at runtime)
RUN python -m pip download --only-binary=:all: --no-deps --dest /tmp/wheels ray==2.4.0

# 2) Core libs used in Glue (pin to Glue-era versions)
#    + the dashboard & jobs API dependencies compatible with Ray 2.4.0.
#    (Pins matter: newer major versions break 2.4.0's dashboard.)
RUN python -m pip install --no-cache-dir \
    /tmp/wheels/ray-2.4.0-*.whl \
    "pyarrow==11.0.0" \
    "pandas==1.5.3" \
    "boto3==1.26.133" \
    "botocore==1.29.133" \
    "numpy==1.24.3" \
    "fsspec==2023.4.0" \
    "protobuf<4" \
    # --- dashboard / jobs server deps ---
    "aiohttp==3.8.5" \
    "aiohttp-cors==0.7.0" \
    "yarl<1.10" "multidict<7.0" "frozenlist<1.4" "aiosignal<1.4" "async_timeout<5" \
    "pydantic<2" \
    "opencensus<0.12" \
    "prometheus_client<0.17" \
    # --- needed if using py_modules ---
    "smart_open[s3]==6.4.0"

# Optional: prove Ray & arch at container start
ENV PYTHONUNBUFFERED=1
WORKDIR /app

# KubeRay overrides the start command; this is just a harmless default
CMD ["python","-c","import ray,platform; print('Ray', ray.__version__, 'on', platform.machine())"]


# Set environment variables
export AWS_REGION=us-east-1
export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REPO=ray-2-4-arm64
export IMAGE=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO}:v1

# Create repository and login
aws ecr create-repository --repository-name $REPO >/dev/null 2>&1 || true
aws ecr get-login-password --region $AWS_REGION \
  | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com

# Enable Buildx (for cross-builds on non-ARM hosts)
docker buildx create --name multi --driver docker-container --use 2>/dev/null || true

# Build & push ARM64 image
docker buildx build \
  --platform linux/arm64 \
  -t "$IMAGE" \
  . --push

# Verify the image architecture remotely
aws ecr batch-get-image \
  --repository-name $REPO \
  --image-ids imageTag=v1 \
  --accepted-media-types application/vnd.docker.distribution.manifest.v2+json \
  | jq -r '.images[0].imageManifest' \
  | jq -r 'fromjson.config.digest'

完成后，在 RayCluster 规范中使用 nodeSelector: { kubernetes.io/arch: arm64 } 引用此 ARM64 映像。


spec:
  rayVersion: "2.4.0"
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: <your ECR image>

步骤 2：将 Amazon Glue for Ray 作业配置转换到 Amazon Elastic Kubernetes Service 上的 Ray

Amazon Glue for Ray 作业支持一组用于配置 Worker 节点、依赖项、内存和日志记录的作业参数。使用 KubeRay 迁移到 Amazon Elastic Kubernetes Service 时，需要将这些参数转换为 RayCluster 规范字段或 Ray 作业运行时环境设置。

作业参数映射

将 Amazon Glue for Ray 参数映射到 EKS 上的 Ray 等效参数
Amazon Glue for Ray 参数	在 Amazon Glue for Ray 中的作用	Amazon Elastic Kubernetes Service 上的 Ray 等效参数
`--min-workers`	该作业必须分配的最低 Worker 节点数。	RayCluster 中的 `workerGroupSpecs[].minReplicas`
`--working-dir`	将 zip（S3 URI）分发到所有节点。	如果从本地文件提交，则使用 Ray 运行时环境 `working_dir`；要将 S3 zip 文件指向 S3 构件，请使用 `py_modules`
`--s3-py-modules`	添加来自 S3 的 Python wheels/dists。	使用 Ray 运行时环境：`py_modules: ["s3://.../xxx.whl", ...]`
`--pip-install`	为作业安装额外的 PyPI 软件包。	Ray 运行时环境：`pip: ["pkg==ver", ...]`（Ray Job CLI `--runtime-env-json` 或 RayJob `runtimeEnvYAML`）。
`--object_store_memory_head`	用于头节点 Plasma 存储的内存百分比。	RayCluster 中的 `headGroupSpec[].rayStartParams.object-store-memory`。请注意，此参数的单位应为字节。Amazon Glue 使用百分比，而 Ray 使用字节。
`--object_store_memory_worker`	用于 Worker 节点 Plasma 存储的内存百分比。	同上，但在每个 Worker 节点组的 `rayStartParams.object-store-memory`（字节）中设置。
`--object_spilling_config`	配置 Ray 对象溢出。	`headGroupSpec[].rayStartParams.object-spilling-config`
`--logging_configuration`	Amazon Glue 托管式日志（CloudWatch、S3）。	检查容器组 stdout/stderr：使用 `kubectl -n ray logs <pod-name> --follow`。检查来自 Ray 控制面板（端口转发到 :8265）的日志，您还可以在其中查看任务和作业日志。

作业配置映射

将 Amazon Glue for Ray 作业配置映射到 EKS 上的 Ray 等效配置
配置	在 Amazon Glue for Ray 中的作用	EKS 上的 Ray 等效配置
Worker 类型	设置作业运行时允许的预定义 Worder 节点类型。默认为 Z 2X（8vCPU，64 GB RAM）。	EKS 中的节点组实例类型（例如，ARM 为 r7g.2xlarge ≈ 8 vCPU/64 GB，x86 为 r7a.2xlarge）。
最大 Worder 节点数	您希望 Amazon Glue 分配给此作业的 Worker 节点数量。	`workerGroupSpecs[].maxReplicas` 的设置应与您在 Amazon Glue 中使用的数量相同。这是自动扩缩的上限。同样将 `minReplicas` 设置为下限。您可以首先使用 `replicas: 0`、`minReplicas: 0`。

步骤 3：设置 Amazon Elastic Kubernetes Service

您可以创建新的 Amazon Elastic Kubernetes Service 集群，也可以再利用现有的 Amazon Elastic Kubernetes Service 集群。如果使用现有的集群，请越过创建集群命令跳至添加节点组、IRSA，然后安装 KubeRay。

创建 Amazon Elastic Kubernetes Service 集群

注意

如果您已有现成的 Amazon Elastic Kubernetes Service 集群，请跳过创建新集群的命令，只需添加节点组即可。


# Environment Variables
export AWS_REGION=us-east-1
export CLUSTER=ray-eks
export NS=ray # namespace for your Ray jobs (you can reuse another if you like)

# Create a cluster (OIDC is required for IRSA)
eksctl create cluster \
  --name $CLUSTER \
  --region $AWS_REGION \
  --with-oidc \
  --managed

添加节点组


# ARM/Graviton (matches Glue's typical runtime):
eksctl create nodegroup \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --name arm64-ng \
  --node-type m7g.large \
  --nodes 2 --nodes-min 1 --nodes-max 5 \
  --managed \
  --node-labels "workload=ray"

# x86/amd64 (use if your image is amd64-only):
eksctl create nodegroup \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --name amd64-ng \
  --node-type m5.large \
  --nodes 2 --nodes-min 1 --nodes-max 5 \
  --managed \
  --node-labels "workload=ray"

注意

如果您使用现有的 Amazon Elastic Kubernetes Service 集群，请在添加节点组时使用 --with-oidc 启用 OIDC。

为 S3 的服务账户（IRSA）创建命名空间 + IAM 角色

Kubernetes 命名空间是资源（容器组、服务、角色等）的逻辑分组。您可以创建命名空间，也可以再利用现有的命名空间。您还需要为 S3 创建一个符合 Amazon Glue 作业访问权限的 IAM 策略。使用与您的 Amazon Glue 作业角色相同的自定义权限（通常是对特定存储桶的 S3 读/写权限）。要向 Amazon Elastic Kubernetes Service 授予与 AWSGlueServiceRole 类似的权限，请创建一个绑定到此 IAM 策略的服务账户（IRSA）。有关设置此服务账户的说明，请参阅 IAM Roles for Service Accounts。


# Create (or reuse) namespace
kubectl create namespace $NS || true


{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::YOUR-BUCKET",
      "arn:aws:s3:::YOUR-BUCKET/*"
    ]
  }]
}


# Create the IAM policy and wire IRSA:
aws iam create-policy \
  --policy-name RayS3Policy \
  --policy-document file://example.json || true

# Create a service account (IRSA) bound to that policy.
eksctl create iamserviceaccount \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --namespace $NS \
  --name ray-s3-access \
  --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT}:policy/RayS3Policy \
  --approve \
  --override-existing-serviceaccounts

安装 KubeRay 操作符（在 K8s 上运行 Ray 的控制器）


helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm upgrade --install kuberay-operator kuberay/kuberay-operator \
  --namespace kuberay-system \
  --create-namespace

# Validate the operator pod Running
kubectl -n kuberay-system get pods

步骤 4：快速启动 Ray 集群

创建一个 YAML 文件来定义 Ray 集群。以下是一个示例配置（raycluster.yaml）：


apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: glue-like
  namespace: ray
spec:
  rayVersion: "2.4.0"
  headGroupSpec:
    template:
      spec:
        nodeSelector:
          kubernetes.io/arch: amd64
        serviceAccountName: ray-s3-access
        containers:
        - name: ray-head
          image: rayproject/ray:2.4.0-py39
          imagePullPolicy: Always
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "1", memory: "2Gi" }
  workerGroupSpecs:
  - groupName: workers
    replicas: 0 # start with just a head (like small Glue dev job) and turn number of replicas later
    minReplicas: 0
    maxReplicas: 5
    template:
      spec:
        nodeSelector:
          kubernetes.io/arch: amd64
        serviceAccountName: ray-s3-access
        containers:
        - name: ray-worker
          image: rayproject/ray:2.4.0-py39
          imagePullPolicy: Always
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "1", memory: "2Gi" }

在 Amazon Elastic Kubernetes Service 集群上部署 Ray 集群


kubectl apply -n $NS -f raycluster.yaml

# Validate that the head pod turns to READY/ RUNNING state
kubectl -n $NS get pods -l ray.io/cluster=glue-like -w

如果需要修改已部署的 yaml，请先删除该集群，然后重新应用更新后的 yaml：


kubectl -n $NS delete raycluster glue-like
kubectl -n $NS apply -f raycluster.yaml

访问 Ray 控制面板

您可以通过使用 kubectl 启用端口转发来访问 Ray 控制面板：


# Get service
SVC=$(kubectl -n $NS get svc -l ray.io/cluster=glue-like,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

# Make the Ray dashboard accessible at http://localhost:8265 on your local machine.
kubectl -n $NS port-forward svc/$SVC 8265:8265

步骤 5。提交 Ray 作业

要提交 Ray 作业，请使用 Ray 作业 CLI。CLI 版本可以比集群更新，并且向后兼容。作为一项先决条件，请将作业脚本存储到某个本地文件中，例如 job.py。


python3 -m venv ~/raycli && source ~/raycli/bin/activate
pip install "ray[default]==2.49.2"

# Submit your ray job by supplying all python dependencies that was added to your Glue job
ray job submit --address http://127.0.0.1:8265 --working-dir . \
  --runtime-env-json '{
    "pip": ["boto3==1.28.*","pyarrow==12.*","pandas==2.0.*"]
  }' \
  -- python job.py

可以在 Ray 控制面板上监控作业。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

什么是 Amazon Glue？

工作原理