使用 Terraform 创建集群 - Amazon ParallelCluster
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用 Terraform 创建集群

使用时 Amazon ParallelCluster,您只需为创建或更新 Amazon ParallelCluster 映像和集群时创建的 Amazon 资源付费。有关更多信息,请参阅 Amazon 使用的服务 Amazon ParallelCluster

先决条件

定义一个 Terraform 项目

在本教程中,您将定义一个简单的 Terraform 项目来部署集群。

  1. 创建一个名为的目录my-clusters

    您创建的所有文件都将位于此目录中。

  2. 创建文件terraform.tf以导入 ParallelCluster 提供程序。

    terraform { required_version = ">= 1.5.7" required_providers { aws-parallelcluster = { source = "aws-tf/aws-parallelcluster" version = "1.0.0" } } }
  3. 创建文件providers.tf以配置 ParallelCluster 和 Amazon 提供程序。

    provider "aws" { region = var.region profile = var.profile } provider "aws-parallelcluster" { region = var.region profile = var.profile api_stack_name = var.api_stack_name use_user_role = true }
  4. 使用 ParallelCluster模块创建文件main.tf以定义资源。

    module "pcluster" { source = "aws-tf/parallelcluster/aws" version = "1.0.0" region = var.region api_stack_name = var.api_stack_name api_version = var.api_version deploy_pcluster_api = false template_vars = local.config_vars cluster_configs = local.cluster_configs config_path = "config/clusters.yaml" }
  5. 创建文件clusters.tf以将多个集群定义为 Terraform 局部变量。

    注意

    可以在cluster_config元素中定义多个聚类。对于每个群集,您可以在局部变量中显式定义群集属性(请参阅DemoCluster01)或引用外部文件(请参阅DemoCluster02)。

    要查看可在配置元素中设置的群集属性,请参阅集群配置文件

    要查看您可以为创建集群设置的选项,请参阅pcluster create-cluster

    locals { cluster_configs = { DemoCluster01 : { region : local.config_vars.region rollbackOnFailure : false validationFailureLevel : "WARNING" suppressValidators : [ "type:KeyPairValidator" ] configuration : { Region : local.config_vars.region Image : { Os : "alinux2" } HeadNode : { InstanceType : "t3.small" Networking : { SubnetId : local.config_vars.subnet } Iam : { AdditionalIamPolicies : [ { Policy : "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" } ] } } Scheduling : { Scheduler : "slurm" SlurmQueues : [{ Name : "queue1" CapacityType : "ONDEMAND" Networking : { SubnetIds : [local.config_vars.subnet] } Iam : { AdditionalIamPolicies : [ { Policy : "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" } ] } ComputeResources : [{ Name : "compute" InstanceType : "t3.small" MinCount : "1" MaxCount : "4" }] }] SlurmSettings : { QueueUpdateStrategy : "TERMINATE" } } } } DemoCluster02 : { configuration : "config/cluster_config.yaml" } } }
  6. 创建文件config/clusters.yaml以将多个集群定义为 YAML 配置。

    DemoCluster03: region: ${region} rollbackOnFailure: true validationFailureLevel: WARNING suppressValidators: - type:KeyPairValidator configuration: config/cluster_config.yaml DemoCluster04: region: ${region} rollbackOnFailure: false configuration: config/cluster_config.yaml
  7. 创建文件config/cluster_config.yaml,这是一个标准 ParallelCluster 配置文件,可以在其中注入 Terraform 变量。

    要查看可在配置元素中设置的群集属性,请参阅集群配置文件

    Region: ${region} Image: Os: alinux2 HeadNode: InstanceType: t3.small Networking: SubnetId: ${subnet} Iam: AdditionalIamPolicies: - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore Scheduling: Scheduler: slurm SlurmQueues: - Name: queue1 CapacityType: ONDEMAND Networking: SubnetIds: - ${subnet} Iam: AdditionalIamPolicies: - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore ComputeResources: - Name: compute InstanceType: t3.small MinCount: 1 MaxCount: 5 SlurmSettings: QueueUpdateStrategy: TERMINATE
  8. 创建文件clusters_vars.tf以定义可以注入到集群配置中的变量。

    此文件允许您定义可在集群配置中使用的动态值,例如区域和子网。

    此示例直接从项目变量中检索值,但您可能需要使用自定义逻辑来确定它们。

    locals { config_vars = { subnet = var.subnet_id region = var.cluster_region } }
  9. 创建文件variables.tf以定义可以为此项目注入的变量。

    variable "region" { description = "The region the ParallelCluster API is deployed in." type = string default = "us-east-1" } variable "cluster_region" { description = "The region the clusters will be deployed in." type = string default = "us-east-1" } variable "profile" { type = string description = "The AWS profile used to deploy the clusters." default = null } variable "subnet_id" { type = string description = "The id of the subnet to be used for the ParallelCluster instances." } variable "api_stack_name" { type = string description = "The name of the CloudFormation stack used to deploy the ParallelCluster API." default = "ParallelCluster" } variable "api_version" { type = string description = "The version of the ParallelCluster API." }
  10. 创建文件terraform.tfvars以设置变量的任意值。

    下面的文件使用现有 ParallelCluster API 3.10.0 在子网eu-west-1中部署集群subnet-123456789,该API 3.10.0 已使用堆栈名称部署在子网中us-east-1MyParallelClusterAPI-310

    region = "us-east-1" api_stack_name = "MyParallelClusterAPI-310" api_version = "3.10.0" cluster_region = "eu-west-1" subnet_id = "subnet-123456789"
  11. 创建文件outputs.tf以定义此项目返回的输出。

    output "clusters" { value = module.pcluster.clusters }

    项目目录是:

    my-clusters ├── config │ ├── cluster_config.yaml - Cluster configuration, where terraform variables can be injected.. │ └── clusters.yaml - File listing all the clusters to deploy. ├── clusters.tf - Clusters defined as Terraform local variables. ├── clusters_vars.tf - Variables that can be injected into cluster configurations. ├── main.tf - Terraform entrypoint where the ParallelCluster module is configured. ├── outputs.tf - Defines the cluster as a Terraform output. ├── providers.tf - Configures the providers: ParallelCluster and AWS. ├── terraform.tf - Import the ParallelCluster provider. ├── terraform.tfvars - Defines values for variables, e.g. region, PCAPI stack name. └── variables.tf - Defines the variables, e.g. region, PCAPI stack name.

部署集群

要部署集群,请按顺序运行标准的 Terraform 命令。

注意

此示例假设您已经在账户中部署了 ParallelCluster API。

  1. 构建项目:

    terraform init
  2. 定义部署计划:

    terraform plan -out tfplan
  3. 部署计划:

    terraform apply tfplan

使用集群部署 ParallelCluster API

如果您尚未部署 ParallelCluster API,但想将其与集群一起部署,请更改以下文件:

  • main.tf

    module "pcluster" { source = "aws-tf/aws/parallelcluster" version = "1.0.0" region = var.region api_stack_name = var.api_stack_name api_version = var.api_version deploy_pcluster_api = true template_vars = local.config_vars cluster_configs = local.cluster_configs config_path = "config/clusters.yaml" }
  • providers.tf

    provider "aws-parallelcluster" { region = var.region profile = var.profile endpoint = module.pcluster.pcluster_api_stack_outputs.ParallelClusterApiInvokeUrl role_arn = module.pcluster.pcluster_api_stack_outputs.ParallelClusterApiUserRole }

所需的权限

您需要以下权限才能使用 Terraform 部署集群:

  • 担任 ParallelCluster API 角色,该角色负责与 ParallelCluster API 交互

  • 描述 ParallelCluster API 的 Amazon CloudFormation 堆栈以验证其存在并检索其参数和输出

{ "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Resource": "arn:PARTITION:iam::ACCOUNT:role/PCAPIUserRole-*", "Effect": "Allow", "Sid": "AssumePCAPIUserRole" }, { "Action": [ "cloudformation:DescribeStacks" ], "Resource": "arn:PARTITION:cloudformation:REGION:ACCOUNT:stack/*", "Effect": "Allow", "Sid": "CloudFormation" } ] }