Distributed computing with SageMaker best practices - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Distributed computing with SageMaker best practices

This best practices page presents various flavors of distributed computing for machine learning (ML) jobs in general. The term distributed computing in this page encompasses distributed training for machine learning tasks and parallel computing for data processing, data generation, feature engineering, and reinforcement learning. In this page, we discuss about common challenges in distributed computing, and available options in SageMaker Training and SageMaker Processing. For additional reading materials about distributed computing, see What Is Distributed Computing?.

You can configure ML tasks to run in a distributed manner across multiple nodes (instances), accelerators (NVIDIA GPUs, Amazon Trainium chips), and vCPU cores. By running distributed computation, you can achieve a variety of goals such as computing operations faster, handling large datasets, or training large ML models.

The following list covers common challenges that you might face when you run an ML training job at scale.

  • You need to make decisions on how to distribute computation depending on ML tasks, software libraries you want to use, and compute resources.

  • Not all ML tasks are straightforward to distribute. Also, not all ML libraries support distributed computation.

  • Distributed computation might not always result in a linear increase in compute efficiency. In particular, you need to identify if data I/O and inter-GPU communication have bottlenecks or cause overhead.

  • Distributed computation might disturb numerical processes and change model accuracy. Specifically to data-parallel neural network training, when you change the global batch size while scaling up to a larger compute cluster, you also need to adjust the learning rate accordingly.

SageMaker provides distributed training solutions to ease such challenges for various use cases. Choose one of the following options that best fits your use case.

Option 1: Use a SageMaker built-in algorithm that supports distributed training

SageMaker provides built-in algorithms that you can use out of the box through the SageMaker console or the SageMaker Python SDK. Using the built-in algorithms, you don’t need to spend time for code customization, understanding science behind the models, or running Docker on provisioned Amazon EC2 instances.

A subset of the SageMaker built-in algorithms support distributed training. To check if the algorithm of your choice supports distributed training, see the Parallelizable column in the Common Information About Built-in Algorithms table. Some of the algorithms support multi-instance distributed training, while the rest of the parallelizable algorithms support parallelization across multiple GPUs in a single instance, as indicated in the Parallelizable column.

Option 2: Run a custom ML code in the SageMaker managed training or processing environment

SageMaker jobs can instantiate distributed training environment for specific use cases and frameworks. This environment acts as a ready-to-use whiteboard, where you can bring and run your own ML code.

If your ML code uses a deep learning framework

You can launch distributed training jobs using the Deep Learning Containers (DLC) for SageMaker Training, which you can orchestrate either through the dedicated Python modules in the SageMaker Python SDK, or through the SageMaker APIs with Amazon CLI, Amazon SDK for Python (Boto3). SageMaker provides training containers for machine learning frameworks, including PyTorch, TensorFlow, Hugging Face Transformers, and Apache MXNet. You have two options to write deep learning code for distributed training.

  • The SageMaker distributed training libraries

    The SageMaker distributed training libraries propose Amazon-managed code for neural network data parallelism and model parallelism. SageMaker distributed training also comes with launcher clients built into the SageMaker Python SDK, and you don’t need to author parallel launch code. To learn more, see SageMaker's data parallelism library and SageMaker's model parallelism library.

  • Open-source distributed training libraries

    Open source frameworks have their own distribution mechanisms such as DistributedDataParallelism (DDP) in PyTorch or tf.distribute modules in TensorFlow. You can choose to run these distributed training frameworks in the SageMaker-managed framework containers. For example, the sample code for training MaskRCNN in SageMaker shows how to use both PyTorch DDP in the SageMaker PyTorch framework container and Horovod in the SageMaker TensorFlow framework container.

SageMaker ML containers also come with MPI preinstalled, so you can parallelize your entry point script using mpi4py. Using the MPI integrated training containers is a great option when you launch a third-party distributed training launcher or write ad-hoc parallel code in the SageMaker managed training environment.

Notes for data-parallel neural network training on GPUs

  • Scale to multi-GPU and multi-machine parallelism when appropriate

    We often run neural network training jobs on multiple-CPU or multiple-GPU instances. Each GPU-based instance usually contains multiple GPU devices. Consequently, distributed GPU computing can happen either within a single GPU instance with multiple GPUs (single-node multi-GPU training), or across multiple GPU instances with multiple GPU cores in each (multi-node multi-GPU training). Single-instance training is easier to write code and debug, and the intra-node GPU-to-GPU throughput is usually faster than the inter-node GPU-to-GPU throughput. Therefore, it is a good idea to scale data parallelism vertically first (use one GPU instance with multiple GPUs) and expand to multiple GPU instances if needed. This might not apply to cases where the CPU budget is high (for example, a massive workload for data pre-processing) and when the CPU-to-GPU ratio of a multi-GPU instance is too low. In all cases, you need to experiment with different combinations of instance types based on your own ML training needs and workload.

  • Monitor the quality of convergence

    When training a neural network with data parallelism, increasing the number of GPUs while keeping the mini-batch size per GPU constant leads to increasing the size of global mini-batch for the mini-batch stochastic gradient descent (MSGD) process. The size of mini-batches for MSGD is known to impact the descent noise and convergence. For properly scaling while preserving accuracy, you need to adjust other hyperparameters such as the learning rate [Goyal et al. (2017)].

  • Monitor I/O bottlenecks

    As you increase the number of GPUs, the throughput for reading and writing storage should also increase. Make sure that your data source and pipeline don’t become bottlenecks.

  • Modify your training script as needed

    Training scripts written for single-GPU training must be modified for multi-node multi-GPU training. In most data parallelism libraries, script modification is required to do the following.

    • Assign batches of training data to each GPU.

    • Use an optimizer that can deal with gradient computation and parameter updates across multiple GPUs.

    • Assign responsibility of checkpointing to a specific host and GPU.

If your ML code involves tabular data processing

PySpark is a Python frontend of Apache Spark, which is an open-source distributed computing framework. PySpark has been widely adopted for distributed tabular data processing for large-scale production workloads. If you want to run tabular data processing code, consider using the SageMaker Processing PySpark containers and running parallel jobs. You can also run data processing jobs in parallel using SageMaker Training and SageMaker Processing APIs in Amazon SageMaker Studio Classic, which is integrated with Amazon EMR and Amazon Glue.

Option 3: Write your own custom distributed training code

When you submit a training or processing job to SageMaker, SageMaker Training and SageMaker Processing APIs launch Amazon EC2 compute instances. You can customize training and processing environment in the instances by running your own Docker container or installing additional libraries in the Amazon managed containers. For more information about Docker with SageMaker Training, see Adapting your own Docker container to work with SageMaker and Create a container with your own algorithms and models. For more information about Docker with SageMaker Processing, see Use Your Own Processing Code.

Every SageMaker training job environment contains a configuration file at /opt/ml/input/config/resourceconfig.json, and every SageMaker processing job environment contains a similar configuration file at /opt/ml/config/resourceconfig.json. Your code can read this file to find hostnames and establish inter-node communications. To learn more, including the schema of the JSON file, see Distributed Training Configuration and How Amazon SageMaker Processing Configures Your Processing Container. You can also install and use third-party distributed computing libraries such as Ray or DeepSpeed in SageMaker.

You can also use SageMaker Training and SageMaker Processing to run custom distributed computations that do not require inter-worker communication. In the computing literature, those tasks are often described as embarrassingly parallel or share-nothing. Examples include parallel processing of data files, training models in parallel on different configurations, or running batch inference on a collection of records. You can trivially parallelize such share-nothing use cases with Amazon SageMaker. When you launch a SageMaker Training or SageMaker Processing job on a cluster with multiple nodes, SageMaker by default replicates and launches your training code (in Python or Docker) on all the nodes. Tasks requiring random spread of input data across such multiple nodes can be facilitated by setting S3DataDistributionType=ShardedByS3Key in the data input configuration of the SageMaker TrainingInput API.

Option 4: Launch multiple jobs in parallel or sequentially

You can also distribute an ML compute workflow into smaller parallel or sequential compute tasks, each represented by its own SageMaker Training or SageMaker Processing job. Splitting a task into multiple jobs can be beneficial for the following situations or tasks:

  • When you have specific data channels and metadata entries (such as hyperparameters, model configuration, or instance types) for each sub-tasks.

  • When you implement retry steps at a sub-task level.

  • When you vary the configuration of the sub-tasks over the course of the workload, such as when training on increasing batch sizes.

  • When you need to run an ML task that takes longer than the maximum training time allowed for a single training job (28 days maximum).

  • When different steps of a compute workflow require different instance types.

For the specific case of hyperparameter search, use SageMaker Automated Model Tuning. SageMaker Automated Model Tuning is a serverless parameter search orchestrator that launches multiple training jobs on your behalf, according to a search logic that can be random, Bayesian, or HyperBand.

Additionally, to orchestrate multiple training jobs, you can also consider workflow orchestration tools, such as SageMaker Pipelines, Amazon Step Functions, and Apache Airflow supported by Amazon Managed Workflows for Apache Airflow (MWAA) and SageMaker Workflows.