Use Managed Spot Training in Amazon SageMaker - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Use Managed Spot Training in Amazon SageMaker

Amazon SageMaker makes it easy to train machine learning models using managed Amazon EC2 Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf.

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Metrics and logs generated during training runs are available in CloudWatch.

Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, can use managed spot training. For more information on automatic model tuning, see Perform Automatic Model Tuning with SageMaker.

Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting. For more information about checkpointing, see Use checkpoints in Amazon SageMaker.

Note

Unless your training job will complete quickly, we recommend you use checkpointing with managed spot training. SageMaker built-in algorithms and marketplace algorithms that do not checkpoint are currently limited to a MaxWaitTimeInSeconds of 3600 seconds (60 minutes).

Using Managed Spot Training

To use managed spot training, create a training job. Set EnableManagedSpotTraining to True and specify the MaxWaitTimeInSeconds. MaxWaitTimeInSeconds must be larger than MaxRuntimeInSeconds. For more information about creating a training job, see DescribeTrainingJob.

You can calculate the savings from using managed spot training using the formula (1 - (BillableTimeInSeconds / TrainingTimeInSeconds)) * 100. For example, if BillableTimeInSeconds is 100 and TrainingTimeInSeconds is 500, this means that your training job ran for 500 seconds, but you were billed for only 100 seconds. Your savings is (1 - (100 / 500)) * 100 = 80%.

To learn how to run training jobs on Amazon SageMaker spot instances and how managed spot training works and reduces the billable time, see the following example notebooks:

Managed Spot Training Lifecycle

You can monitor a training job using TrainingJobStatus and SecondaryStatus returned by DescribeTrainingJob. The list below shows how TrainingJobStatus and SecondaryStatus values change depending on the training scenario:

  • Spot instances acquired with no interruption during training

    1. InProgress: StartingDownloadingTrainingUploading

  • Spot instances interrupted once. Later, enough spot instances were acquired to finish the training job.

    1. InProgress: StartingDownloadingTrainingInterruptedStartingDownloadingTrainingUploading

  • Spot instances interrupted twice and MaxWaitTimeInSeconds exceeded.

    1. InProgress: StartingDownloadingTrainingInterruptedStartingDownloadingTrainingInterruptedDownloadingTraining

    2. Stopping: Stopping

    3. Stopped: MaxWaitTimeExceeded

  • Spot instances were never launched.

    1. InProgress: Starting

    2. Stopping: Stopping

    3. Stopped: MaxWaitTimeExceeded