FP16 Training with Model Parallelism
For FP16 training, apply the following modifications to your training script and estimator.
Note
This feature is available for PyTorch in the SageMaker model parallelism library v1.10.0 and later.
Adapt your PyTorch training script
-
Wrap your model using the smdistributed.modelparallel.torch.model_creation()
context manager. # fp16_training_script.py import torch import smdistributed.modelparallel.torch as smp with smp.model_creation( dtype=torch.float16 if args.fp16 else torch.get_default_dtype() ): model = ...Tip
If you are using tensor parallelism, add
tensor_parallelism=smp.tp_size() > 1to thesmp.model_creationcontext manager. Adding this line also helps automatically detect whether tensor parallelism is activated or not.with smp.model_creation( ... , tensor_parallelism=smp.tp_size() > 1 ): model = ... -
When you wrap the optimizer with
smdistributed.modelparallel.torch.DistributedOptimizer, set either thestatic_loss_scalingordynamic_loss_scalingargument. By default,static_loss_scalingis set to1.0, anddynamic_loss_scalingis set toFalse. If you setdynamic_loss_scale=True, you can feed dynamic loss scaling options as a dictionary through thedynamic_loss_argsargument. In most cases, we recommend you use dynamic loss scaling with the default options. For more information, options, and examples of the optimizer wrapper function, see the smdistributed.modelparallel.torch.DistributedOptimizerAPI. The following code is an example of wrapping an
Adadeltaoptimizer object with dynamic loss scaling for FP16 training.optimizer = torch.optim.Adadelta(...) optimizer = smp.DistributedOptimizer( optimizer, static_loss_scale=None, dynamic_loss_scale=True, dynamic_loss_args={ "scale_window":1000, "min_scale":1, "delayed_shift":2} )
Configure a SageMaker PyTorch estimator
Add the FP16 parameter ("fp16") to the distribution configuration for
model parallelism when creating a SageMaker PyTorch estimator object. For a complete list of
the configuration parameters for model parallelism, see Parameters for smdistributed
from sagemaker.pytorch import PyTorch smp_options = { "enabled": True, "parameters": { "microbatches":4, "pipeline_parallel_degree":2, "tensor_parallel_degree":2, ..., "fp16":True} } fp16_estimator = PyTorch( entry_point="fp16_training_script.py", # Specify your train script ..., distribution={ "smdistributed": {"modelparallel": smp_options}, "mpi": {...} } ) fp16_estimator.fit(...)
When FP16 training starts, the model and the optimizer are wrapped by
FP16_Module and FP16_Optimizer respectively, which are
modified smdistributed versions of the Apex
utilsFP16_Module converts the model to FP16 dtype and deals
with the forward pass in FP16.
Tip
You can apply gradient clipping by calling clip_master_grads before
optimizer.step.
optimizer.clip_master_grads(max_norm) # max_norm(float or int): max norm of the gradients
Tip
When using torch.optim.lr_scheduler and FP16 training, you need to
pass optimizer.optimizer to the LR scheduler rather than the optimizer.
See the following example code.
from torch.optim.lr_scheduler import StepLR scheduler = StepLR( optimizer.optimizer if smp.state.cfg.fp16 else optimizer, step_size=1, gamma=args.gamma )