Activation Offloading - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Activation Offloading

When activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one, activation offloading is an additional feature that can further reduce memory usage. Activation offloading asynchronously moves the checkpointed activations corresponding to their microbatches that are not currently running in the CPU. Right before the GPU needs the activations for the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the CPU.

Note

This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

How to Use Activation Offloading

Use activation offloading to reduce memory usage when the number of microbatches is greater than 1, and activation checkpointing is turned on (see Activation Checkpointing). When the activation checkpointing is not used, activation offloading has no effect. When it is used with only one microbatch, it does not save memory.

To use activation offloading, set "offload_activations": True in the modelparallel configuration.

Activation offloading moves the checkpointed activations in nn.Sequential modules to CPU asynchronously. The data transfer over the PCIe link overlaps with GPU computation. The offloading happens immediately, as soon as the forward pass for a particular checkpointed layer is computed. The activations are loaded back to the GPU shortly before they are needed for the backward pass of a particular microbatch. The CPU-GPU transfer similarly overlaps with computation.

To adjust how early the activations are loaded back into the GPU, you can use the configuration parameter "activation_loading_horizon" (default is set to 4, must be int larger than 0). A larger activation loading horizon would cause the activations to be loaded back to the GPU earlier. If the horizon is too large, the memory-saving impact of activation offloading might be diminished. If the horizon is too small, the activations may not be loaded back in time, reducing the amount of overlap and degrading performance.

Tip

Activation offloading can be useful for large models with over a hundred billion parameters.

Configure a SageMaker PyTorch estimator

mpi_options = { "enabled" : True, "processes_per_host" : 8, # 8 processes "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none " } smp_options = { "enabled":True, "parameters": { "microbatches": 4, "pipeline_parallel_degree": 2, # alias for "partitions" "placement_strategy": "cluster", "tensor_parallel_degree": 2, # tp over 2 devices "ddp": True, "offload_activations": True, "activation_loading_horizon": 4 # optional. default is 4. } }