模型并行故障排查 - Amazon SageMaker
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

模型并行故障排查

如果遇到错误,可以使用以下列表尝试排除培训工作故障。如果问题仍存在,请联系 Amazon Support。

将 SageMaker 调试器与 SageMaker 分布式模型并行使用时的注意事项

SageMaker 调试器不适用于 SageMaker 分布式模型并行。默认情况下,所有 SageMaker TensorFlow 和 PyTorch 培训作业都启用了调试器,并且您可能会看到类似于以下内容的错误:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading

要修复此问题,请通过将调试器传递debugger_hook_config=False创建框架时estimator,如以下示例所示。

bucket=sagemaker.Session().default_bucket() base_job_name="sagemaker-checkpoint-test" checkpoint_in_bucket="checkpoints" # The S3 URI to store the checkpoints checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket) estimator = TensorFlow( ... distribution={"smdistributed": {"modelparallel": { "enabled": True }}}, checkpoint_s3_uri=checkpoint_s3_bucket, checkpoint_local_path="/opt/ml/checkpoints", debugger_hook_config=False )

保存检查点

在 SageMaker 上保存大型模型的检查点时,您可能会遇到以下错误:

InternalServerError: We encountered an internal error. Please try again

这可能是由于在培训期间将本地检查点上传到 Amazon S3 时存在的 SageMaker 限制造成的。要在 SageMaker 中禁用检查点,请使用以下示例显式上传检查点。

如果遇到前面的错误,请不要使用checkpoint_s3_uri与 SageMakerestimator调用。在为较大模型保存检查点时,我们建议将检查点保存到自定义目录中,并将其传递给辅助函数(作为local_path参数)。

import os def aws_s3_sync(source, destination): """aws s3 sync in quiet mode and time profile""" import time, subprocess cmd = ["aws", "s3", "sync", "--quiet", source, destination] print(f"Syncing files from {source} to {destination}") start_time = time.time() p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) p.wait() end_time = time.time() print("Time Taken to Sync: ", (end_time-start_time)) return def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'): """ sample function to sync checkpoints from local path to s3 """ import boto3 #check if local path exists if not os.path.exists(local_path): raise RuntimeError("Provided local path {local_path} does not exist. Please check") #check if s3 bucket exists s3 = boto3.resource('s3') if not s3_uri.startswith("s3://"): raise ValueError(f"Provided s3 uri {s3_uri} is not valid.") s3_bucket = s3_uri.replace('s3://','').split('/')[0] print(f"S3 Bucket: {s3_bucket}") try: s3.meta.client.head_bucket(Bucket=s3_bucket) except Exception as e: raise e aws_s3_sync(local_path, s3_uri) return def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'): """ sample function to sync checkpoints from s3 to local path """ import boto3 #try to create local path if it does not exist if not os.path.exists(local_path): print(f"Provided local path {local_path} does not exist. Creating...") try: os.makedirs(local_path) except Exception as e: raise RuntimeError(f"Failed to create {local_path}") #check if s3 bucket exists s3 = boto3.resource('s3') if not s3_uri.startswith("s3://"): raise ValueError(f"Provided s3 uri {s3_uri} is not valid.") s3_bucket = s3_uri.replace('s3://','').split('/')[0] print(f"S3 Bucket: {s3_bucket}") try: s3.meta.client.head_bucket(Bucket=s3_bucket) except Exception as e: raise e aws_s3_sync(s3_uri, local_path) return

帮助函数的用法:

#base_s3_uri - user input s3 uri or save to model directory (default) #curr_host - to save checkpoints of current host #iteration - current step/epoch during which checkpoint is saved # save checkpoints on every node using local_rank if smp.local_rank() == 0: base_s3_uri = os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', ''))) curr_host = os.environ['SM_CURRENT_HOST'] full_s3_uri = f'{base_s3_uri}/checkpoints/{curr_host}/{iteration}' sync_local_checkpoints_to_s3(local_path=checkpoint_dir, s3_uri=full_s3_uri)

基于模型并行和 TensorFlow 的收敛

在 TensorFlow 和分布式模型并行使用 SageMaker 多节点训练时,由于每个节点上的训练输入文件的顺序可能不同,因此损失可能无法按预期收敛。这可能会导致同一模型并行组中的不同等级处理不同的输入文件,从而导致不一致。为防止这种情况,请确保输入文件在转换为 TensorFlow 数据集之前在所有队列中以相同的方式排序。实现这一目标的一种方法是在训练脚本中对输入文件名进行排序。