Amazon SageMaker distributed data parallelism library FAQ - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon SageMaker distributed data parallelism library FAQ

Use the following to find answers to commonly asked questions about the SMDDP library.

Q: When using the library, how are the allreduce-supporting CPU instances managed? Do I have to create heterogeneous CPU-GPU clusters, or does the SageMaker service create extra C5s for jobs that use the SMDDP library?

The SMDDP library only supports GPU instances, more specificcally, P4d and P4de instances with NVIDIA A100 GPUs and EFA. No additional C5 or CPU instances are launched; if your SageMaker training job is on an 8-node P4d cluster, only 8 ml.p4d.24xlarge instances are used. No additional instances are provisioned.

Q: I have a training job taking 5 days on a single ml.p3.24xlarge instance with a set of hyperparameters H1 (learning rate, batch size, optimizer, etc). Is using SageMaker's data parallelism library and a five-time bigger cluster enough to achieve an approximate five-time speedup? Or do I have to revisit its training hyperparameters after activating the SMDDP library?

The library changes the overall batch size. The new overall batch size is scaled linearly with the number of training instances used. As a result of this, hyperparameters, such as learning rate, have to be changed to ensure convergence.

Q: Does the SMDDP library support Spot?

Yes. You can use managed spot training. You specify the path to the checkpoint file in the SageMaker training job. You save and restore checkpoints in their training script as mentioned in the last steps of Use the SMDDP library in your TensorFlow training script (deprecated) and Use the SMDDP library in your PyTorch training script.

Q: Is the SMDDP library relevant in a single-host, multi-device setup?

The library can be used in single-host multi-device training but the library offers performance improvements only in multi-host training.

Q: Where should the training dataset be stored?

The training dataset can be stored in an Amazon S3 bucket or on an Amazon FSx drive. See this document for various supported input file systems for a training job.

Q: When using the SMDDP library, is it mandatory to have training data in FSx for Lustre? Can Amazon EFS and Amazon S3 be used?

We generally recommend you use Amazon FSx because of its lower latency and higher throughput. If you prefer, you can use Amazon EFS or Amazon S3.

Q: Can the library be used with CPU nodes?

No. To find instance types supported by the SMDDP library, see Supported instance types.

Q: What frameworks and framework versions are currently supported by the SMDDP library at launch?

the SMDDP library currently supports PyTorch v1.6.0 or later and TensorFlow v2.3.0 or later. It doesn't support TensorFlow 1.x. For more information about which version of the SMDDP library is packaged within Amazon deep learning containers, see Release Notes for Deep Learning Containers.

Q: Does the library support AMP?

Yes, the SMDDP library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to use AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker data parallelism library runs its AllReduce operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:

Q: How do I identify if my distributed training job is slowed down due to I/O bottleneck?

With a larger cluster, the training job requires more I/O throughput, and therefore the training throughput might take longer (more epochs) to ramp up to the maximum performance. This indicates that I/O is being bottlenecked and cache is harder to build up as you scale nodes up (higher throughput requirement and more complex network topology). For more information about monitoring the Amazon FSx throughput on CloudWatch, see Monitoring FSx for Lustre in the FSx for Lustre User Guide.

Q: How do I resolve I/O bottlenecks when running a distributed training job with data parallelism?

We highly recommend that you use Amazon FSx as your data channel if you are using Amazon S3. If you are already using Amazon FSx but still having I/O bottleneck problems, you might have set up your Amazon FSx file system with a low I/O throughput and a small storage capacity. For more information about how to estimate and choose the right size of I/O throughput capacity, see Use Amazon FSx and set up an optimal storage and throughput capacity.

Q: (For the library v1.4.0 or later) How do I resolve the Invalid backend error while initializing process group.

If you encounter the error message ValueError: Invalid backend: 'smddp' when calling init_process_group, this is due to the breaking change in the SMDDP library v1.4.0 and later. You must import the PyTorch client of the library, smdistributed.dataparallel.torch.torch_smddp, which registers smddp as a backend for PyTorch. To learn more, see Use the SMDDP library in your PyTorch training script.

Q: (For the SMDDP library v1.4.0 or later) I would like to call the collective primitives of the torch.distributed interface. Which primitives does the smddp backend support?

In v1.4.0, the SMDDP library supports all_reduce, broadcast, reduce, all_gather, and barrier of of the torch.distributed interface.

Q: (For the SMDDP library v1.4.0 or later) Does this new API work with other custom DDP classes or libraries like Apex DDP?

The SMDDP library is tested with other third-party distributed data parallel libraries and framework implementations that use the torch.distribtued modules. Using the SMDDP library with custom DDP classes works as long as the collective operations used by the custom DDP classes are supported by the SMDDP library. See the preceding question for a list of supported collectives. If you have these use cases and need further support, reach out to the SageMaker team through the Amazon Support Center or Amazon Developer Forums for Amazon SageMaker.

Q: Does the SMDDP library support the bring-your-own-container (BYOC) option? If so, how do I install the library and run a distributed training job by writing a custom Dockerfile?

If you want to integrate the SMDDP library and its minimum dependencies into your own Docker container, BYOC is the right approach. You can build your own container using the binary file of the library. The recommended process is to write a custom Dockerfile with the library and its dependencies, build the Docker container, host it in Amazon ECR, and use the ECR image URI to launch a training job using the SageMaker generic estimator class. For more instructions on how to prepare a custom Dockerfile for distributed training in SageMaker with the SMDDP library, see Create your own Docker container with the SageMaker distributed data parallel library.