Amazon Batch support for SageMaker AI training jobs - Amazon SageMaker AI
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Batch support for SageMaker AI training jobs

An Amazon Batch job queue stores and prioritizes submitted jobs before they run on compute resources. You can submit SageMaker AI training jobs to a job queue in order to take advantage of the serverless job scheduling and prioritization tools provided by Amazon Batch.

How it works

The following steps describe the workflow of how to use an Amazon Batch job queue with SageMaker AI training jobs. For more detailed tutorials and example notebooks, see the Get started section.

  • Set up Amazon Batch and any necessary permissions. For more information, see Setting up Amazon Batch in the Amazon Batch User Guide.

  • Create the following Amazon Batch resources in the console or using the Amazon CLI:

  • Configure your details and request for a SageMaker AI training job, such as your training container image. To submit a training job to an Amazon Batch queue, you can use the Amazon CLI, the Amazon SDK for Python (Boto3), or the SageMaker AI Python SDK.

  • Submit your training jobs to the job queue. You can use the following options to submit jobs:

    • Use the Amazon Batch SubmitServiceJob API.

    • Use the aws_batch module from the SageMaker AI Python SDK. After creating a TrainingQueue object and a model training object (such as an Estimator or ModelTrainer), you can submit training jobs to the TrainingQueue using the queue.submit() method.

  • After submitting jobs, view your job queue and job status with the Amazon Batch console, the Amazon Batch DescribeServiceJob API, or the SageMaker AI DescribeTrainingJob API.

Cost and availability

For detailed pricing information about training jobs, see Amazon SageMaker AI pricing. With Amazon Batch, you only pay for any Amazon resources used, such as Amazon EC2 instances. For more information, see Amazon Batch pricing.

You can use Amazon Batch for SageMaker AI training jobs in any Amazon Web Services Region where training jobs are available. For more information, see Amazon SageMaker AI endpoints and quotas.

To ensure you have the required capacity when you need it, you can use SageMaker AI Flexible Training Plans (FTP). These plans allow you to reserve capacity for your training jobs. When combined with Amazon Batch's queuing capabilities, you can maximize utilization during your plan's duration. For more information, see Reserve training plans for you training jobs or HyperPod clusters.

Get started

For a tutorial on how to set up an Amazon Batch job queue and submit SageMaker AI training jobs, see Getting started with Amazon Batch on SageMaker AI in the Amazon Batch User Guide.

For Jupyter notebooks that show how to use the aws_batch module in the SageMaker AI Python SDK, see the Amazon Batch for SageMaker AI Training jobs notebook examples in the amazon-sagemaker-examples GitHub repository.