Troubleshooting - Amazon SageMaker
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

Troubleshooting

如果您遇到错误,则可以使用以下列表尝试对训练作业进行故障排除。如果问题仍存在,请联系 AWS Support。

  • 在启用 调试程序(默认情况下为所有 SageMaker TensorFlow 和 PyTorch 作业启用)后,您可能会看到如下所示的错误:

    FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading

    要修复此问题,请在 调试程序 Python 开发工具包中创建框架 debugger_hook_config=False 时传递 estimator 以禁用 SageMaker,如以下示例所示。

    estimator = TensorFlow( role=role, instance_count=1, instance_type=instance_type, debugger_hook_config=False )
  • 在 SageMaker 上保存大型模型的检查点时,您可能会遇到以下错误:

    InternalServerError: We encountered an internal error. Please try again

    在训练期间,这可能是由于将本地检查点上传到 SageMaker 时的 Amazon S3 限制导致的。要在 SageMaker 中禁用检查点操作并按照以下示例明确上传检查点。

    如果您遇到上述错误,请不要将 checkpoint_s3_uri 与 SageMaker estimator 调用结合使用。在为较大的模型保存检查点时,我们建议将检查点保存到自定义目录并将相同的检查点传递到帮助程序函数(作为 local_path 参数)。

    import os def aws_s3_sync(source, destination): """aws s3 sync in quiet mode and time profile""" import time, subprocess cmd = ["aws", "s3", "sync", "--quiet", source, destination] print(f"Syncing files from {source} to {destination}") start_time = time.time() p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) p.wait() end_time = time.time() print("Time Taken to Sync: ", (end_time-start_time)) return def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'): """ sample function to sync checkpoints from local path to s3 """ import boto3 #check if local path exists if not os.path.exists(local_path): raise RuntimeError("Provided local path {local_path} does not exist. Please check") #check if s3 bucket exists s3 = boto3.resource('s3') if not s3_uri.startswith("s3://"): raise ValueError(f"Provided s3 uri {s3_uri} is not valid.") s3_bucket = s3_uri.replace('s3://','').split('/')[0] print(f"S3 Bucket: {s3_bucket}") try: s3.meta.client.head_bucket(Bucket=s3_bucket) except Exception as e: raise e aws_s3_sync(local_path, s3_uri) return def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'): """ sample function to sync checkpoints from s3 to local path """ import boto3 #try to create local path if it does not exist if not os.path.exists(local_path): print(f"Provided local path {local_path} does not exist. Creating...") try: os.makedirs(local_path) except Exception as e: raise RuntimeError(f"Failed to create {local_path}") #check if s3 bucket exists s3 = boto3.resource('s3') if not s3_uri.startswith("s3://"): raise ValueError(f"Provided s3 uri {s3_uri} is not valid.") s3_bucket = s3_uri.replace('s3://','').split('/')[0] print(f"S3 Bucket: {s3_bucket}") try: s3.meta.client.head_bucket(Bucket=s3_bucket) except Exception as e: raise e aws_s3_sync(s3_uri, local_path) return

    帮助程序函数的用法:

    #base_s3_uri - user input s3 uri or save to model directory (default) #curr_host - to save checkpoints of current host #iteration - current step/epoch during which checkpoint is saved # save checkpoints on every node using local_rank if smp.local_rank() == 0: base_s3_uri = os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', ''))) curr_host = os.environ['SM_CURRENT_HOST'] full_s3_uri = f'{base_s3_uri}/checkpoints/{curr_host}/{iteration}' sync_local_checkpoints_to_s3(local_path=checkpoint_dir, s3_uri=full_s3_uri)