HyperPod GPU Slurm environment setup Launch the training job

HyperPod Slurm cluster pre-training tutorial (GPU)

The following tutorial sets up Slurm environment and starts a training job on a Llama 8 billion parameter model.

Prerequisites

Before you start setting up your environment to run the recipe, make sure you have:

Set up a HyperPod GPU Slurm cluster.
- Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or an NFS system that's accessible from the cluster nodes.
Data in one of the following formats:
- JSON
- JSONGZ (Compressed JSON)
- ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see User access tokens.

HyperPod GPU Slurm environment setup

To initiate a training job on a HyperPod GPU Slurm cluster, do the following:

SSH into the head node of your Slurm cluster.
After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.
```
#set up a virtual environment
python3 -m venv ${PWD}/venv
source venv/bin/activate
```

Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location.


git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt

Create a squash file using Enroot. To find the most recent release of the SMP container, see Release notes for the SageMaker model parallelism library. To gain a deeper understanding of how to use the Enroot file, see Build Amazon-optimized Nemo-Launcher image.


REGION="<region>"
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"

To use the Enroot squash file to start training, use the following example to modify the recipes_collection/config.yaml file.
```
container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
```

Launch the training job

After you install the dependencies, start a training job from the sagemaker-hyperpod-recipes/launcher_scripts directory. You get the dependencies by cloning the SageMaker HyperPod recipes repository:

First, pick your training recipe from Github, the model name is specified as part of the recipe. We use the launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh script to launch a Llama 8b with sequence length 8192 pre-training recipe, llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain, in the following example.

IMAGE: The container from the environment setup section.
(Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:
```
recipes.model.hf_access_token=<your_hf_token>
```


#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment ouput directory
EXP_DIR="${YOUR_EXP_DIR}"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
  recipes=training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
  base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
  recipes.run.name="hf_llama3_8b" \
  recipes.exp_manager.exp_dir="$EXP_DIR" \
  recipes.model.data.train_dir="$TRAIN_DIR" \
  recipes.model.data.val_dir="$VAL_DIR" \
  container="${IMAGE}" \
  +cluster.container_mounts.0="/fsx:/fsx"

After you've configured all the required parameters in the launcher script, you can run the script using the following command.


bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh

For more information about the Slurm cluster configuration, see Running a training job on HyperPod Slurm.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Tutorials

Trainium pre-training with Slurm clusters