Use the PyTorch framework estimators in the SageMaker Python SDK
You can launch distributed training by adding the distribution argument to
the SageMaker AI framework estimators, PyTorchTensorFlow
- PyTorch
-
The following launcher options are available for launching PyTorch distributed training.
-
pytorchddp– This option runsmpirunand sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to thedistributionparameter.{ "pytorchddp": { "enabled": True } } -
torch_distributed– This option runstorchrunand sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to thedistributionparameter.{ "torch_distributed": { "enabled": True } } -
smdistributed– This option also runsmpirunbut withsmddprunthat sets up environment variables needed for running PyTorch distributed training on SageMaker AI.{ "smdistributed": { "dataparallel": { "enabled": True } } }
If you chose to replace NCCL
AllGatherto SMDDPAllGather, you can use all three options. Choose one option that fits with your use case.If you chose to replace NCCL
AllReducewith SMDDPAllReduce, you should choose one of thempirun-based options:smdistributedorpytorchddp. You can also add additional MPI options as follows.{ "pytorchddp": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } }{ "smdistributed": { "dataparallel": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } } }The following code sample shows the basic structure of a PyTorch estimator with distributed training options.
from sagemaker.pytorch import PyTorch pt_estimator = PyTorch( base_job_name="training_job_name_prefix", source_dir="subdirectory-to-your-code", entry_point="adapted-training-script.py", role="SageMakerRole", py_version="py310", framework_version="2.0.1", # For running a multi-node distributed training job, specify a value greater than 1 # Example: 2,3,4,..8 instance_count=2, # Instance types supported by the SageMaker AI data parallel library: # ml.p4d.24xlarge, ml.p4de.24xlarge instance_type="ml.p4d.24xlarge", # Activate distributed training with SMDDP distribution={ "pytorchddp": { "enabled": True } } # mpirun, activates SMDDP AllReduce OR AllGather # distribution={ "torch_distributed": { "enabled": True } } # torchrun, activates SMDDP AllGather # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } # mpirun, activates SMDDP AllReduce OR AllGather ) pt_estimator.fit("s3://bucket/path/to/training/data")Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the SageMaker AI PyTorch DLCs. Create the following
requirements.txtfile and save in the source directory where you save the training script.# requirements.txt pytorch-lightning lightning-boltsFor example, the tree-structured directory should look like the following.
├──pytorch_training_launcher_jupyter_notebook.ipynb└── sub-folder-for-your-code ├──adapted-training-script.py└──requirements.txtFor more information about specifying the source directory to place the
requirements.txtfile along with your training script and a job submission, see Using third-party librariesin the Amazon SageMaker AI Python SDK documentation. Considerations for activating SMDDP collective operations and using the right distributed training launcher options
-
SMDDP
AllReduceand SMDDPAllGatherare not mutually compatible at present. -
SMDDP
AllReduceis activated by default when usingsmdistributedorpytorchddp, which arempirun-based launchers, and NCCLAllGatheris used. -
SMDDP
AllGatheris activated by default when usingtorch_distributedlauncher, andAllReducefalls back to NCCL. -
SMDDP
AllGathercan also be activated when using thempirun-based launchers with an additional environment variable set as follows.export SMDATAPARALLEL_OPTIMIZE_SDP=true
-
- TensorFlow
-
Important
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see TensorFlow (deprecated).
from sagemaker.tensorflow import TensorFlow tf_estimator = TensorFlow( base_job_name = "training_job_name_prefix", entry_point="", role="adapted-training-script.pySageMakerRole", framework_version="2.11.0", py_version="py38", # For running a multi-node distributed training job, specify a value greater than 1 # Example: 2,3,4,..8 instance_count=2, # Instance types supported by the SageMaker AI data parallel library: #ml.p4d.24xlarge,ml.p3dn.24xlarge, andml.p3.16xlargeinstance_type="ml.p3.16xlarge", # Training using the SageMaker AI data parallel distributed training strategy distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } ) tf_estimator.fit("s3://bucket/path/to/training/data")