FSx 使用 kubectl 部署来自亚马逊 S3 和亚马逊的自定义微调模型 - 亚马逊 SageMaker AI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

FSx 使用 kubectl 部署来自亚马逊 S3 和亚马逊的自定义微调模型

以下步骤向您展示了如何使用 kubectl 将存储在 Amazon S3 或亚马逊上的模型部署 FSx 到亚马逊 SageMaker HyperPod 集群。

以下说明包含专为在 Jupyter 笔记本环境(例如 Amazon SageMaker Studio 或 SageMaker 笔记本实例)中运行而设计的代码单元和命令。每个代码块代表一个应按顺序执行的笔记本单元。交互式元素(包括模型发现表和状态监控命令)已针对笔记本界面进行了优化,在其他环境中可能无法正常运行。在继续操作之前,请确保您拥有必要的 Amazon 权限访问笔记本环境。

先决条件

确认您已在 Amazon SageMaker HyperPod 集群上设置了推理功能。有关更多信息,请参阅 设置 HyperPod 集群以进行模型部署

设置和配置

将所有占位符值替换为实际资源标识符。

  1. 初始化您的集群名称。这标识了您的模型将在哪个 HyperPod 集群中部署。

    # Specify your hyperpod cluster name here hyperpod_cluster_name="<Hyperpod_cluster_name>" # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU instance_type="ml.g5.8xlarge"
  2. 初始化集群命名空间。您的集群管理员应该已经在您的命名空间中创建了 hypod-Inference 服务账户。

    cluster_namespace="<namespace>"
  3. 定义用于创建 YAML 文件以进行部署的辅助方法

    以下辅助函数生成部署模型所需的 Kubernetes YAML 配置文件。此函数会根据您的模型存储在 Amazon S3 还是 Amazon 上创建不同的 YAML 结构 FSx,并自动处理特定于存储的配置。在接下来的章节中,您将使用此函数为所选存储后端生成部署文件。

    def generate_inferenceendpointconfig_yaml(deployment_name, model_id, namespace, instance_type, output_file_path, region, tls_certificate_s3_location, model_location, sagemaker_endpoint_name, fsxFileSystemId="", isFsx=False, s3_bucket=None): """ Generate a InferenceEndpointConfig YAML file for S3 storage with the provided parameters. Args: deployment_name (str): The deployment name model_id (str): The model ID namespace (str): The namespace instance_type (str): The instance type output_file_path (str): Path where the YAML file will be saved region (str): Region where bucket exists tls_certificate_s3_location (str): S3 location for TLS certificate model_location (str): Location of the model sagemaker_endpoint_name (str): Name of the SageMaker endpoint fsxFileSystemId (str): FSx filesystem ID (optional) isFsx (bool): Whether to use FSx storage (optional) s3_bucket (str): S3 bucket where model exists (optional, only needed when isFsx is False) """ # Create the YAML structure model_config = { "apiVersion": "inference.sagemaker.aws.amazon.com/v1alpha1", "kind": "InferenceEndpointConfig", "metadata": { "name": deployment_name, "namespace": namespace }, "spec": { "modelName": model_id, "endpointName": sagemaker_endpoint_name, "invocationEndpoint": "invocations", "instanceType": instance_type, "modelSourceConfig": {}, "worker": { "resources": { "limits": { "nvidia.com/gpu": 1, }, "requests": { "nvidia.com/gpu": 1, "cpu": "30000m", "memory": "100Gi" } }, "image": "763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0", "modelInvocationPort": { "containerPort": 8080, "name": "http" }, "modelVolumeMount": { "name": "model-weights", "mountPath": "/opt/ml/model" }, "environmentVariables": [ { "name": "HF_MODEL_ID", "value": "/opt/ml/model" }, { "name": "SAGEMAKER_PROGRAM", "value": "inference.py", }, { "name": "SAGEMAKER_SUBMIT_DIRECTORY", "value": "/opt/ml/model/code", }, { "name": "MODEL_CACHE_ROOT", "value": "/opt/ml/model" }, { "name": "SAGEMAKER_ENV", "value": "1", } ] }, "tlsConfig": { "tlsCertificateOutputS3Uri": tls_certificate_s3_location, } }, } if (not isFsx): if s3_bucket is None: raise ValueError("s3_bucket is required when isFsx is False") model_config["spec"]["modelSourceConfig"] = { "modelSourceType": "s3", "s3Storage": { "bucketName": s3_bucket, "region": region, }, "modelLocation": model_location } else: model_config["spec"]["modelSourceConfig"] = { "modelSourceType": "fsx", "fsxStorage": { "fileSystemId": fsxFileSystemId, }, "modelLocation": model_location } # Write to YAML file with open(output_file_path, 'w') as file: yaml.dump(model_config, file, default_flow_style=False) print(f"YAML file created successfully at: {output_file_path}")

从 Amazon S3 或亚马逊部署您的模型 FSx

Stage the model to Amazon S3
  1. 创建 Amazon S3 存储桶来存储您的模型构件。S3 存储桶必须与您的 HyperPod 集群位于同一区域。

    s3_client = boto3.client('s3', region_name=region_name, config=boto3_config) base_name = "hyperpod-inference-s3-beta" def get_account_id(): sts = boto3.client('sts') return sts.get_caller_identity()["Account"] account_id = get_account_id() s3_bucket = f"{base_name}-{account_id}" try: s3_client.create_bucket( Bucket=s3_bucket, CreateBucketConfiguration={"LocationConstraint": region_name} ) print(f"Bucket '{s3_bucket}' is created successfully.") except botocore.exceptions.ClientError as e: error_code = e.response["Error"]["Code"] if error_code in ("BucketAlreadyExists", "BucketAlreadyOwnedByYou"): print(f"Bucket '{s3_bucket}' already exists. Skipping creation.") else: raise # Re-raise unexpected exceptions
  2. 获取部署 YAML 以从 S3 存储桶数据部署模型。

    # Get current time in format suitable for endpoint name current_time = datetime.now().strftime("%Y%m%d%H%M%S") model_id = "deepseek15b" ## Can be a name of your choice deployment_name = f"{model_id}-{current_time}" model_location = "deepseek15b" ## This is the folder on your s3 file where the model is located sagemaker_endpoint_name=f"{model_id}-{current_time}" output_file_path=f"inferenceendpointconfig-s3-model-{model_id}.yaml" generate_inferenceendpointconfig_yaml( deployment_name=deployment_name, model_id=model_id, model_location=model_location, namespace=cluster_namespace, instance_type=instance_type, output_file_path=output_file_path, sagemaker_endpoint_name=sagemaker_endpoint_name, s3_bucket=s3_bucket, region=region_name, tls_certificate_s3_location=tls_certificate_s3_location ) os.environ["INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH"]=output_file_path os.environ["MODEL_ID"]=model_id
Stage the model to Amazon FSx
  1. (可选)创建 FSx 卷。此步骤是可选的,因为您可能已经拥有与要使用的 HyperPod 集群相同的 VPC、安全组和子网 ID 的现有 FSx 文件系统。

    # Initialize the subnet ID and Security Group for FSx. These should be the same as that of the HyperPod cluster. SUBNET_ID = "<HyperPod-subnet-id>" SECURITY_GROUP_ID = "<HyperPod-security-group-id>" # Configuration CONFIG = { 'SUBNET_ID': SUBNET_ID, 'SECURITY_GROUP_ID': SECURITY_GROUP_ID, 'STORAGE_CAPACITY': 1200, 'DEPLOYMENT_TYPE': 'PERSISTENT_2', 'THROUGHPUT': 250, 'COMPRESSION_TYPE': 'LZ4', 'LUSTRE_VERSION': '2.15' } JUMPSTART_MODEL_LOCATION_ON_S3 = "s3://jumpstart-cache-prod-us-east-2/deepseek-llm/deepseek-llm-r1-distill-qwen-1-5b/artifacts/inference-prepack/v2.0.0/" # Create FSx client fsx = boto3.client('fsx') # Create FSx for Lustre file system response = fsx.create_file_system( FileSystemType='LUSTRE', FileSystemTypeVersion=CONFIG['LUSTRE_VERSION'], StorageCapacity=CONFIG['STORAGE_CAPACITY'], SubnetIds=[CONFIG['SUBNET_ID']], SecurityGroupIds=[CONFIG['SECURITY_GROUP_ID']], LustreConfiguration={ 'DeploymentType': CONFIG['DEPLOYMENT_TYPE'], 'PerUnitStorageThroughput': CONFIG['THROUGHPUT'], 'DataCompressionType': CONFIG['COMPRESSION_TYPE'], } ) # Get the file system ID file_system_id = response['FileSystem']['FileSystemId'] print(f"Creating FSx filesystem with ID: {file_system_id}") print(f"In subnet: {CONFIG['SUBNET_ID']}") print(f"With security group: {CONFIG['SECURITY_GROUP_ID']}") # Wait for the file system to become available while True: response = fsx.describe_file_systems(FileSystemIds=[file_system_id]) status = response['FileSystems'][0]['Lifecycle'] if status == 'AVAILABLE': break print(f"Waiting for file system to become available... Current status: {status}") time.sleep(30) dns_name = response['FileSystems'][0]['DNSName'] mount_name = response['FileSystems'][0]['LustreConfiguration']['MountName'] # Print the file system details print("\nFile System Details:") print(f"File System ID: {file_system_id}") print(f"DNS Name: {dns_name}") print(f"Mount Name: {mount_name}")
  2. (可选)将数据从 S3 装载 FSx 并复制到 FSx。此步骤是可选的,因为您的模型数据可能已经存在于 FSx 文件系统中。仅当您要将数据从 S3 复制到时,才需要执行此步骤 FSx。

    注意

    用你的 FSX 替换 file_system_id、dns_name 和 mount_name 的值,以防不使用上一步中的 fsx 而是使用自己的 FSX。

    ## NOTE: Replace values of file_system_id, dns_name, and mount_name with your FSx in case you are not using the FSx filesystem from the previous step and using your own FSx filesystem. # file_system_id = response['FileSystems'][0]['FileSystemId'] # dns_name = response['FileSystems'][0]['DNSName'] # mount_name = response['FileSystems'][0]['LustreConfiguration']['MountName'] # print(f"File System ID: {file_system_id}") # print(f"DNS Name: {dns_name}") # print(f"Mount Name: {mount_name}") # FSx file system details mount_point = f'/mnt/fsx_{file_system_id}' # This will create something like /mnt/fsx_20240317_123456 print(f"Creating mount point at: {mount_point}") # Create mount directory if it doesn't exist !sudo mkdir -p {mount_point} # Mount the FSx Lustre file system mount_command = f"sudo mount -t lustre {dns_name}@tcp:/{mount_name} {mount_point}" !{mount_command} # Verify the mount !df -h | grep fsx print(f"File system mounted at {mount_point}") !sudo chmod 777 {mount_point} !aws s3 cp $JUMPSTART_MODEL_LOCATION_ON_S3 $mount_point/deepseek-1-5b --recursive !ls $mount_point !sudo umount {mount_point} !sudo rm -rf {mount_point}
  3. 获取部署 YAML 以根据 FSx 数据部署模型。

    # Get current time in format suitable for endpoint name current_time = datetime.now().strftime("%Y%m%d%H%M%S") model_id = "deepseek15b" ## Can be a name of your choice deployment_name = f"{model_id}-{current_time}" model_location = "deepseek-1-5b" ## This is the folder on your s3 file where the model is located sagemaker_endpoint_name=f"{model_id}-{current_time}" output_file_path=f"inferenceendpointconfig-fsx-model-{model_id}.yaml" generate_inferenceendpointconfig_yaml( deployment_name=deployment_name, model_id=model_id, model_location=model_location, namespace=cluster_namespace, instance_type=instance_type, output_file_path=output_file_path, region=region_name, tls_certificate_s3_location=tls_certificate_s3_location, sagemaker_endpoint_name=sagemaker_endpoint_name, fsxFileSystemId=file_system_id, isFsx=True ) os.environ["INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH"]=output_file_path os.environ["MODEL_ID"]=model_id
将模型部署到您的集群
  1. 从集群 ARN 中获取用于 kubectl 身份验证的 Amazon EKS HyperPod 集群名称。

    cluster_arn = !aws sagemaker describe-cluster --cluster-name $hyperpod_cluster_name --query "Orchestrator.Eks.ClusterArn" --region $region_name cluster_name = cluster_arn[0].strip('"').split('/')[-1] print(cluster_name)
  2. 将 kubectl 配置为使用凭据向 Hyperpod EKS 集群进行身份验证 Amazon

    !aws eks update-kubeconfig --name $cluster_name --region $region_name
  3. 部署您的InferenceEndpointConfig模型。

    !kubectl apply -f $INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH

验证您的部署状态

  1. 检查模型是否成功部署。

    !kubectl describe InferenceEndpointConfig $deployment_name -n $cluster_namespace

    该命令返回的输出类似于下方内容:

    Name:                             deepseek15b-20250624043033
    Reason:                           NativeDeploymentObjectFound
    Status:
      Conditions:
        Last Transition Time:  2025-07-10T18:39:51Z
        Message:               Deployment, ALB Creation or SageMaker endpoint registration creation for model is in progress
        Reason:                InProgress
        Status:                True
        Type:                  DeploymentInProgress
        Last Transition Time:  2025-07-10T18:47:26Z
        Message:               Deployment and SageMaker endpoint registration for model have been created successfully
        Reason:                Success
        Status:                True
        Type:                  DeploymentComplete
  2. 检查终端节点是否已成功创建。

    !kubectl describe SageMakerEndpointRegistration $sagemaker_endpoint_name -n $cluster_namespace

    该命令返回的输出类似于下方内容:

    Name:         deepseek15b-20250624043033
    Namespace:    ns-team-a
    Kind:         SageMakerEndpointRegistration
    
    Status:
      Conditions:
        Last Transition Time:  2025-06-24T04:33:42Z
        Message:               Endpoint created.
        Status:                True
        Type:                  EndpointCreated
        State:                 CreationCompleted
  3. 测试已部署的端点以验证其是否正常工作。此步骤确认您的模型已成功部署并且可以处理推理请求。

    import boto3 prompt = "{\"inputs\": \"How tall is Mt Everest?\"}}" runtime_client = boto3.client('sagemaker-runtime', region_name=region_name, config=boto3_config) response = runtime_client.invoke_endpoint( EndpointName=sagemaker_endpoint_name, ContentType="application/json", Body=prompt ) print(response["Body"].read().decode())
    [{"generated_text":"As of the last update in July 2024, Mount Everest stands at a height of **8,850 meters** (29,029 feet) above sea level. The exact elevation can vary slightly due to changes caused by tectonic activity and the melting of ice sheets."}]

管理您的部署

完成部署测试后,使用以下命令清理资源。

注意

继续操作之前,请确认您不再需要已部署的模型或存储的数据。

清除资源
  1. 删除推理部署和关联的 Kubernetes 资源。这将停止正在运行的模型容器并移除 SageMaker端点。

    !kubectl delete inferenceendpointconfig.inference.sagemaker.aws.amazon.com/$deployment_name
  2. (可选)删除该 FSx 卷。

    try: # Delete the file system response = fsx.delete_file_system( FileSystemId=file_system_id ) print(f"Deleting FSx filesystem: {file_system_id}") # Optional: Wait for deletion to complete while True: try: response = fsx.describe_file_systems(FileSystemIds=[file_system_id]) status = response['FileSystems'][0]['Lifecycle'] print(f"Current status: {status}") time.sleep(30) except fsx.exceptions.FileSystemNotFound: print("File system deleted successfully") break except Exception as e: print(f"Error deleting file system: {str(e)}")
  3. 验证清理是否成功完成。

    # Check that Kubernetes resources are removed kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $cluster_namespace # Verify SageMaker endpoint is deleted (should return error or empty) aws sagemaker describe-endpoint --endpoint-name $sagemaker_endpoint_name --region $region_name
故障排除
  1. 检查 Kubernetes 部署状态。

    !kubectl describe deployment $deployment_name -n $cluster_namespace
  2. 检查 InferenceEndpointConfig 状态以查看高级部署状态和任何配置问题。

    kubectl describe InferenceEndpointConfig $deployment_name -n $cluster_namespace
  3. 检查所有 Kubernetes 对象的状态。全面了解你的命名空间中所有相关的 Kubernetes 资源。这可以让你快速了解正在运行的内容和可能缺少的内容。

    !kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $cluster_namespace