使用 Amazon ParallelCluster API - Amazon ParallelCluster
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用 Amazon ParallelCluster API

在本教程中,您将使用 Amazon API Gateway 和 Amazon ParallelCluster CloudFormation 模板构建和测试 API。然后,您将使用 GitHub 上提供的示例客户端来使用该 API。有关使用 API 的更多信息,请参阅 Amazon ParallelCluster API

本教程摘自面向公共部门客户的 HPC 研讨会

使用 Amazon ParallelCluster 命令行界面 (CLI) 或 API 时,您只需为创建或更新 Amazon ParallelCluster 映像和集群时创建的 Amazon 资源付费。有关更多信息,请参阅 Amazon ParallelCluster 使用的 Amazon 服务

Amazon ParallelCluster UI 基于无服务器的架构而构建,在大多数情况下,可以在 Amazon Free Tier 类别中使用。有关更多信息,请参阅 Amazon ParallelCluster UI 成本

先决条件

留在您的主用户目录中并激活您的虚拟环境:
  1. 安装有用的 JSON 命令行处理器。

    $ sudo yum groupinstall -y "Development Tools" sudo yum install -y jq python3-devel
  2. 运行以下命令以获取您的 Amazon ParallelCluster 版本并将其分配给环境变量。

    $ PCLUSTER_VERSION=$(pcluster version | jq -r '.version') echo "export PCLUSTER_VERSION=${PCLUSTER_VERSION}" |tee -a ~/.bashrc
  3. 创建环境变量并将您的区域 ID 分配给该变量。

    $ export AWS_DEFAULT_REGION="us-east-1" echo "export AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION}" |tee -a ~/.bashrc
  4. 运行以下命令以部署 API。

    API_STACK_NAME="pc-api-stack" echo "export API_STACK_NAME=${API_STACK_NAME}" |tee -a ~/.bashrc
    aws cloudformation create-stack \ --region ${AWS_DEFAULT_REGION} \ --stack-name ${API_STACK_NAME} \ --template-url https://${AWS_DEFAULT_REGION}-aws-parallelcluster.s3.${AWS_DEFAULT_REGION}.amazonaws.com/parallelcluster/${PCLUSTER_VERSION}/api/parallelcluster-api.yaml \ --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \ --parameters ParameterKey=EnableIamAdminAccess,ParameterValue=true { "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/my-api-stack/abcd1234-ef56-gh78-ei90-1234abcd5678" }

    过程完成后,继续执行下一步。

  1. 登录到 Amazon Web Services Management Console。

  2. 导航到 Amazon API Gateway 控制台

  3. 选择您的 API 部署。

    显示可供您选择的网关列表的 Amazon API Gateway 控制台。
  4. 选择阶段,然后选择一个阶段。

    可供您选择的阶段的控制台视图。您还可以查看 API Gateway 为您的 API 提供的 URL。
  5. 记下 API Gateway 提供的用于访问或调用您的 API 的 URL。它以蓝色突出显示。

  6. 选择资源,然后选择 /clusters 下面的 GET

  7. 选择测试图标,然后向下滚动并选择测试图标。

    API 资源和测试机制的控制台视图。

    将显示 /clusters GET 的响应。

    API 资源、测试机制和测试请求响应的控制台视图。

将 Amazon ParallelCluster 源代码 cd 克隆到 api 目录,并安装 Python 客户端库。

  1. $ git clone -b v${PCLUSTER_VERSION} https://github.com/aws/aws-parallelcluster aws-parallelcluster-v${PCLUSTER_VERSION} cd aws-parallelcluster-v${PCLUSTER_VERSION}/api
    $ pip3 install client/src
  2. 导航回您的主用户目录。

  3. 导出客户端在运行时使用的 API Gateway 基本 URL。

    $ export PCLUSTER_API_URL=$( aws cloudformation describe-stacks --stack-name ${API_STACK_NAME} --query 'Stacks[0].Outputs[?OutputKey==`ParallelClusterApiInvokeUrl`].OutputValue' --output text ) echo "export PCLUSTER_API_URL=${PCLUSTER_API_URL}" |tee -a ~/.bashrc
  4. 导出客户端用于创建集群的集群名称。

    $ export CLUSTER_NAME="test-api-cluster" echo "export CLUSTER_NAME=${CLUSTER_NAME}" |tee -a ~/.bashrc
  5. 运行以下命令以存储示例客户端用于访问 API 的凭证。

    $ export PCLUSTER_API_USER_ROLE=$( aws cloudformation describe-stacks --stack-name ${API_STACK_NAME} --query 'Stacks[0].Outputs[?OutputKey==`ParallelClusterApiUserRole`].OutputValue' --output text ) echo "export PCLUSTER_API_USER_ROLE=${PCLUSTER_API_USER_ROLE}" |tee -a ~/.bashrc
  1. 将以下示例客户端代码复制到您的主用户目录中的 test_pcluster_client.py。客户端代码发出执行以下操作的请求:

    • 创建集群。

    • 描述集群。

    • 列出集群。

    • 描述计算实例集。

    • 描述集群实例。

    # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: MIT-0 # # Permission is hereby granted, free of charge, to any person obtaining a copy of this # software and associated documentation files (the "Software"), to deal in the Software # without restriction, including without limitation the rights to use, copy, modify, # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. # # Author: Evan F. Bollig (Github: bollig) import time, datetime import os import pcluster_client from pprint import pprint from pcluster_client.api import ( cluster_compute_fleet_api, cluster_instances_api, cluster_operations_api ) from pcluster_client.model.create_cluster_request_content import CreateClusterRequestContent from pcluster_client.model.cluster_status import ClusterStatus region=os.environ.get("AWS_DEFAULT_REGION") # Defining the host is optional and defaults to http://localhost # See configuration.py for a list of all supported configuration parameters. configuration = pcluster_client.Configuration( host = os.environ.get("PCLUSTER_API_URL") ) cluster_name=os.environ.get("CLUSTER_NAME") # Enter a context with an instance of the API client with pcluster_client.ApiClient(configuration) as api_client: cluster_ops = cluster_operations_api.ClusterOperationsApi(api_client) fleet_ops = cluster_compute_fleet_api.ClusterComputeFleetApi(api_client) instance_ops = cluster_instances_api.ClusterInstancesApi(api_client) # Create cluster build_done = False try: with open('cluster-config.yaml', encoding="utf-8") as f: body = CreateClusterRequestContent(cluster_name=cluster_name, cluster_configuration=f.read()) api_response = cluster_ops.create_cluster(body, region=region) except pcluster_client.ApiException as e: print("Exception when calling create_cluster: %s\n" % e) build_done = True time.sleep(60) # Confirm cluster status with describe_cluster while not build_done: try: api_response = cluster_ops.describe_cluster(cluster_name, region=region) pprint(api_response) if api_response.cluster_status == ClusterStatus('CREATE_IN_PROGRESS'): print('. . . working . . .', end='', flush=True) time.sleep(60) elif api_response.cluster_status == ClusterStatus('CREATE_COMPLETE'): print('READY!') build_done = True else: print('ERROR!!!!') build_done = True except pcluster_client.ApiException as e: print("Exception when calling describe_cluster: %s\n" % e) # List clusters try: api_response = cluster_ops.list_clusters(region=region) pprint(api_response) except pcluster_client.ApiException as e: print("Exception when calling list_clusters: %s\n" % e) # DescribeComputeFleet try: api_response = fleet_ops.describe_compute_fleet(cluster_name, region=region) pprint(api_response) except pcluster_client.ApiException as e: print("Exception when calling compute fleet: %s\n" % e) # DescribeClusterInstances try: api_response = instance_ops.describe_cluster_instances(cluster_name, region=region) pprint(api_response) except pcluster_client.ApiException as e: print("Exception when calling describe_cluster_instances: %s\n" % e)
  2. 创建集群配置。

    $ pcluster configure --config cluster-config.yaml
  3. API 客户端库将自动从您的环境变量(例如 AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_SESSION_TOKEN)或 $HOME/.aws 中检测配置详细信息。以下命令可将您当前的 IAM 角色切换到指定的 ParallelClusterApiUserRole。

    $ eval $(aws sts assume-role --role-arn ${PCLUSTER_API_USER_ROLE} --role-session-name ApiTestSession | jq -r '.Credentials | "export AWS_ACCESS_KEY_ID=\(.AccessKeyId)\nexport AWS_SECRET_ACCESS_KEY=\(.SecretAccessKey)\nexport AWS_SESSION_TOKEN=\(.SessionToken)\n"')

    需要注意的错误:

    如果您看到类似于下面的错误,则表示您已经假定 ParallelClusterApiUserRole 和您的 AWS_SESSION_TOKEN 已过期。

    An error occurred (AccessDenied) when calling the AssumeRole operation: 
    User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/ParallelClusterApiUserRole-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/ApiTestSession 
    is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::XXXXXXXXXXXX:role/ParallelClusterApiUserRole-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

    删除该角色,然后重新运行 aws sts assume-role 命令以使用 ParallelClusterApiUserRole。

    $ unset AWS_SESSION_TOKEN unset AWS_SECRET_ACCESS_KEY unset AWS_ACCESS_KEY_ID

    要向您当前的用户提供 API 访问权限,您必须扩展资源策略

  4. 运行以下命令以测试示例客户端。

    $ python3 test_pcluster_client.py {'cluster_configuration': 'Region: us-east-1\n' 'Image:\n' ' Os: alinux2\n' 'HeadNode:\n' ' InstanceType: t2.micro\n' ' Networking . . . :\n' ' SubnetId: subnet-1234567890abcdef0\n' ' Ssh:\n' ' KeyName: adpc\n' 'Scheduling:\n' ' Scheduler: slurm\n' ' SlurmQueues:\n' ' - Name: queue1\n' ' ComputeResources:\n' ' - Name: t2micro\n' ' InstanceType: t2.micro\n' ' MinCount: 0\n' ' MaxCount: 10\n' ' Networking . . . :\n' ' SubnetIds:\n' ' - subnet-1234567890abcdef0\n', 'cluster_name': 'test-api-cluster'} {'cloud_formation_stack_status': 'CREATE_IN_PROGRESS', 'cloudformation_stack_arn': 'arn:aws:cloudformation:us-east-1:123456789012:stack/test-api-cluster/abcd1234-ef56-gh78-ij90-1234abcd5678', 'cluster_configuration': {'url': 'https://parallelcluster-021345abcdef6789-v1-do-not-delete...}, 'cluster_name': 'test-api-cluster', 'cluster_status': 'CREATE_IN_PROGRESS', 'compute_fleet_status': 'UNKNOWN', 'creation_time': datetime.datetime(2022, 4, 28, 16, 18, 47, 972000, tzinfo=tzlocal()), 'last_updated_time': datetime.datetime(2022, 4, 28, 16, 18, 47, 972000, tzinfo=tzlocal()), 'region': 'us-east-1', 'tags': [{'key': 'parallelcluster:version', 'value': '3.1.3'}], 'version': '3.1.3'} . . . . . . working . . . {'cloud_formation_stack_status': 'CREATE_COMPLETE', 'cloudformation_stack_arn': 'arn:aws:cloudformation:us-east-1:123456789012:stack/test-api-cluster/abcd1234-ef56-gh78-ij90-1234abcd5678', 'cluster_configuration': {'url': 'https://parallelcluster-021345abcdef6789-v1-do-not-delete...}, 'cluster_name': 'test-api-cluster', 'cluster_status': 'CREATE_COMPLETE', 'compute_fleet_status': 'RUNNING', 'creation_time': datetime.datetime(2022, 4, 28, 16, 18, 47, 972000, tzinfo=tzlocal()), 'head_node': {'instance_id': 'i-abcdef01234567890', 'instance_type': 't2.micro', 'launch_time': datetime.datetime(2022, 4, 28, 16, 21, 46, tzinfo=tzlocal()), 'private_ip_address': '172.31.27.153', 'public_ip_address': '52.90.156.51', 'state': 'running'}, 'last_updated_time': datetime.datetime(2022, 4, 28, 16, 18, 47, 972000, tzinfo=tzlocal()), 'region': 'us-east-1', 'tags': [{'key': 'parallelcluster:version', 'value': '3.1.3'}], 'version': '3.1.3'} READY!
  1. 将以下示例客户端代码复制到 delete_cluster_client.py。客户端代码发出删除集群的请求。

    # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: MIT-0 # # Permission is hereby granted, free of charge, to any person obtaining a copy of this # software and associated documentation files (the "Software"), to deal in the Software # without restriction, including without limitation the rights to use, copy, modify, # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. # # Author: Evan F. Bollig (Github: bollig) import time, datetime import os import pcluster_client from pprint import pprint from pcluster_client.api import ( cluster_compute_fleet_api, cluster_instances_api, cluster_operations_api ) from pcluster_client.model.create_cluster_request_content import CreateClusterRequestContent from pcluster_client.model.cluster_status import ClusterStatus region=os.environ.get("AWS_DEFAULT_REGION") # Defining the host is optional and defaults to http://localhost # See configuration.py for a list of all supported configuration parameters. configuration = pcluster_client.Configuration( host = os.environ.get("PCLUSTER_API_URL") ) cluster_name=os.environ.get("CLUSTER_NAME") # Enter a context with an instance of the API client with pcluster_client.ApiClient(configuration) as api_client: cluster_ops = cluster_operations_api.ClusterOperationsApi(api_client) # Delete the cluster gone = False try: api_response = cluster_ops.delete_cluster(cluster_name, region=region) except pcluster_client.ApiException as e: print("Exception when calling delete_cluster: %s\n" % e) time.sleep(60) # Confirm cluster status with describe_cluster while not gone: try: api_response = cluster_ops.describe_cluster(cluster_name, region=region) pprint(api_response) if api_response.cluster_status == ClusterStatus('DELETE_IN_PROGRESS'): print('. . . working . . .', end='', flush=True) time.sleep(60) except pcluster_client.ApiException as e: gone = True print("DELETE COMPLETE or Exception when calling describe_cluster: %s\n" % e)
  2. 运行以下命令以删除集群。

    $ python3 delete_cluster_client.py {'cloud_formation_stack_status': 'DELETE_IN_PROGRESS', 'cloudformation_stack_arn': 'arn:aws:cloudformation:us-east-1:123456789012:stack/test-api-cluster/abcd1234-ef56-gh78-ij90-1234abcd5678', 'cluster_configuration': {'url': 'https://parallelcluster-021345abcdef6789-v1-do-not-delete...}, 'cluster_name': 'test-api-cluster', 'cluster_status': 'DELETE_IN_PROGRESS', 'compute_fleet_status': 'UNKNOWN', 'creation_time': datetime.datetime(2022, 4, 28, 16, 50, 47, 943000, tzinfo=tzlocal()), 'head_node': {'instance_id': 'i-abcdef01234567890', 'instance_type': 't2.micro', 'launch_time': datetime.datetime(2022, 4, 28, 16, 53, 48, tzinfo=tzlocal()), 'private_ip_address': '172.31.17.132', 'public_ip_address': '34.201.100.37', 'state': 'running'}, 'last_updated_time': datetime.datetime(2022, 4, 28, 16, 50, 47, 943000, tzinfo=tzlocal()), 'region': 'us-east-1', 'tags': [{'key': 'parallelcluster:version', 'value': '3.1.3'}], 'version': '3.1.3'} . . . . . . working . . . {'cloud_formation_stack_status': 'DELETE_IN_PROGRESS', 'cloudformation_stack_arn': 'arn:aws:cloudformation:us-east-1:123456789012:stack/test-api-cluster/abcd1234-ef56-gh78-ij90-1234abcd5678', 'cluster_configuration': {'url': 'https://parallelcluster-021345abcdef6789-v1-do-not-delete...}, 'cluster_name': 'test-api-cluster', 'cluster_status': 'DELETE_IN_PROGRESS', 'compute_fleet_status': 'UNKNOWN', 'creation_time': datetime.datetime(2022, 4, 28, 16, 50, 47, 943000, tzinfo=tzlocal()), 'last_updated_time': datetime.datetime(2022, 4, 28, 16, 50, 47, 943000, tzinfo=tzlocal()), 'region': 'us-east-1', 'tags': [{'key': 'parallelcluster:version', 'value': '3.1.3'}], 'version': '3.1.3'} . . . working . . . DELETE COMPLETE or Exception when calling describe_cluster: (404) Reason: Not Found . . . HTTP response body: {"message":"Cluster 'test-api-cluster' does not exist or belongs to an incompatible ParallelCluster major version."}
  3. 测试完成后,取消设置环境变量。

    $ unset AWS_SESSION_TOKEN unset AWS_SECRET_ACCESS_KEY unset AWS_ACCESS_KEY_ID

您可以使用 Amazon Web Services Management Console或 Amazon CLI 删除您的 API。

  1. 在 Amazon CloudFormation 控制台中,选择 API 堆栈,然后选择删除

  2. 如果使用的是 Amazon CLI,请运行以下命令。

    使用Amazon CloudFormation。

    $ aws cloudformation delete-stack --stack-name ${API_STACK_NAME}