Use the SMDDP library in your PyTorch Lightning training script - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Use the SMDDP library in your PyTorch Lightning training script

If you want to bring your PyTorch Lightning training script and run a distributed data parallel training job in SageMaker, you can run the training job with minimal changes in your training script. The necessary changes include the following: import the smdistributed.dataparallel library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept the SageMaker environment variables that are preset by the SageMaker training toolkit, and activate the SMDDP library by setting the process group backend to "smddp". To learn more, walk through the following instructions that break down the steps with code examples.

Note

The PyTorch Lightning support is available in the SageMaker data parallel library v1.5.0 and later.

  1. Import the pytorch_lightning library and the smdistributed.dataparallel.torch modules.

    import lightning as pl import smdistributed.dataparallel.torch.torch_smddp
  2. Instantiate the LightningEnvironment.

    from lightning.fabric.plugins.environments.lightning import LightningEnvironment env = LightningEnvironment() env.world_size = lambda: int(os.environ["WORLD_SIZE"]) env.global_rank = lambda: int(os.environ["RANK"])
  3. For PyTorch DDP – Create an object of the DDPStrategy class with "smddp" for process_group_backend and "gpu" for accelerator, and pass that to the Trainer class.

    import lightning as pl from lightning.pytorch.strategies import DDPStrategy ddp = DDPStrategy( cluster_environment=env, process_group_backend="smddp", accelerator="gpu" ) trainer = pl.Trainer( max_epochs=200, strategy=ddp, devices=num_gpus, num_nodes=num_nodes )

    For PyTorch FSDP – Create an object of the FSDPStrategy class (with wrapping policy of choice) with "smddp" for process_group_backend and "gpu" for accelerator, and pass that to the Trainer class.

    import lightning as pl from lightning.pytorch.strategies import FSDPStrategy from functools import partial from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy policy = partial( size_based_auto_wrap_policy, min_num_params=10000 ) fsdp = FSDPStrategy( auto_wrap_policy=policy, process_group_backend="smddp", cluster_environment=env ) trainer = pl.Trainer( max_epochs=200, strategy=fsdp, devices=num_gpus, num_nodes=num_nodes )

After you have completed adapting your training script, proceed to Step 2: Launch a distributed training job using the SageMaker Python SDK.

Note

When you construct a SageMaker PyTorch estimator and submit a training job request in Step 2: Launch a distributed training job using the SageMaker Python SDK, you need to provide requirements.txt to install pytorch-lightning and lightning-bolts in the SageMaker PyTorch training container.

# requirements.txt pytorch-lightning lightning-bolts

For more information about specifying the source directory to place the requirements.txt file along with your training script and a job submission, see Using third-party libraries in the Amazon SageMaker Python SDK documentation.