Distributed GPU Training
This section is for distributed training on GPU-based clusters.
Make sure that your cluster has GPU nodes before you run the examples.
If you do not have GPU nodes in your cluster, use the following command to add a nodegroup to your cluster.
Be sure to select an Amazon EC2 instancenode-type
) in the Accelerated Computing category.
eksctl create nodegroup --cluster
$CLUSTER_NAME
--region$CLUSTER_REGION
\ --nodes2
--nodes-min1
--nodes-max3
--node-typep3.2xlarge
For a complete list of Deep Learning Containers, see Deep Learning Containers Images.
PyTorch distributed GPU training
This tutorial guides you through training a classification model on MNIST with PyTorch
-
Create a PyTorchJob.
-
Verify that the PyTorch custom resource is installed.
kubectl get crd
The output should include
pytorchjobs.kubeflow.org
. -
Ensure that the NVIDIA plugin
daemonset
is running.kubectl get daemonset -n kube-system
The output should look similar to the following.
NDESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin-daemonset 3 3 3 3 3 <none> 35h
-
Use the following text to create a gloo-based distributed data parallel job. Save it in a file named
pt_distributed.yaml
.apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name:
"kubeflow-pytorch-gpu-dist-job"
spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name:"pytorch"
image:"763104351884.dkr.ecr.us-west-2.amazonaws.com/aws-samples-pytorch-training:2.0-gpu-py310-ec2"
args: - "--backend" - "gloo" - "--epochs" - "5" Worker: replicas: 2 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name:"pytorch"
image:"763104351884.dkr.ecr.us-west-2.amazonaws.com/aws-samples-pytorch-training:2.0-gpu-py310-ec2"
args: - "--backend" - "gloo" - "--epochs" - "5" resources: limits: nvidia.com/gpu: 1 -
Run a distributed training job.
kubectl create -f pt_distributed.yaml -n ${NAMESPACE}
-
-
Monitor your PyTorchJob.
-
See the status section to monitor the job status. Here is an example of output when the job is successfully completed.
kubectl get -o yaml pytorchjobs kubeflow-pytorch-gpu-dist-job ${NAMESPACE}
-
Check the logs for each pod.
The first command prints a list of pods for a specific PyTorchJob, as shown in the following example.
kubectl get pods -l job-name=kubeflow-pytorch-gpu-dist-job -o name -n ${NAMESPACE}
The second command tails the logs for a specific pod.
kubectl logs
pod name
-n ${NAMESPACE}
-
See Cleanup for information about cleaning up a cluster after you finish using it.
TensorFlow with Horovod distributed GPU training
This tutorial guides you through distributed training with Horovod
The example requires a GPU instance with at least 2 GPUs.
You can use node-type=p3.16xlarge
or above.
-
Create an MPIJob.
-
Verify that the TensorFlow custom resource is installed.
kubectl get crd
The output should include
mpijobs.kubeflow.org
. -
Ensure that the NVIDIA plugin
daemonset
is running.kubectl get daemonset -n kube-system
The output should look similar to the following.
NDESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin-daemonset 3 3 3 3 3 <none> 35h
-
Use the following text to create an MPIJob. Save it in a file named
tf_distributed.yaml.
.apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name:
tensorflow-tf-dist
spec: slotsPerWorker: 1 cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - image:763104351884.dkr.ecr.us-west-2.amazonaws.com/aws-samples-tensorflow-training:2.12-gpu-py310-ec2
name:tensorflow-launcher
command: - mpirun - -mca - btl_tcp_if_exclude - lo - -mca - pml - ob1 - -mca - btl - ^openib - --bind-to - none - -map-by - slot - -x - LD_LIBRARY_PATH - -x - PATH - -x - NCCL_SOCKET_IFNAME=eth0 - -x - NCCL_DEBUG=INFO - -x - MXNET_CUDNN_AUTOTUNE_DEFAULT=0 - python - /deep-learning-models/models/resnet/tensorflow2/train_tf2_resnet.py args: - --num_epochs - "10" - --synthetic Worker: replicas: 2 template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - image:763104351884.dkr.ecr.us-west-2.amazonaws.com/aws-samples-tensorflow-training:2.12-gpu-py310-ec2
name:tensorflow-worker
resources: limits: nvidia.com/gpu: 1 -
Run a distributed training job.
kubectl create -f tf_distributed.yaml -n ${NAMESPACE}
-
-
Monitor your PyTorchJob.
-
See the status section to monitor the job status. Here is an example of output after the job is successfully completed.
kubectl get -o yaml mpijob tensorflow-tf-dist -n ${NAMESPACE}
-
Check the logs for each pod.
The first command prints a list of pods for a specific PyTorchJob, such as the following example.
kubectl get -o yaml mpijob tensorflow-tf-dist -n ${NAMESPACE}
The second command tails the logs for a specific pod.
kubectl logs
pod name
-n ${NAMESPACE}
-
See Cleanup for information about cleaning up a cluster after you finish using it.