CPU Training
This section shows how to train a model on CPU instances by using Kubeflow training operators
For a complete list of Deep Learning Containers, see Deep Learning Containers Images. For tips about configuration settings when using the Intel Math Kernel Library (MKL), see Amazon Deep Learning Containers Intel Math Kernel Library (MKL) Recommendations.
PyTorch CPU training
Your deployment of Kubeflow on Amazon comes with PyTorchJob
This tutorial guides you through training a classification model on MNIST with PyTorch
-
To create a PyTorchJob, follow these instructions.
-
Create the job configuration file.
Open
vi
orvim
, then copy and paste the following content. Save this file aspytorch.yaml
.apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name:
pytorch-training
spec: pytorchReplicaSpecs: Worker: restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name:pytorch
imagePullPolicy: Always image:763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-cpu-py310-ubuntu20.04-ec2
command: - "/bin/sh" - "-c" args: - "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda --epochs=1" env: - name: OMP_NUM_THREADS value: "36" - name: KMP_AFFINITY value: "granularity=fine,verbose,compact,1,0" - name: KMP_BLOCKTIME value: "1" -
Deploy the PyTorchJob configuration file using kubectl to start training.
kubectl create -f pytorch.yaml -n ${NAMESPACE}
The job creates a pod running the container from Deep Learning Containers. This is referenced in the field
spec.containers.image
in the YAML file above under the container namepytorch
. -
You should see the following output.
pytorchjob.kubeflow.org/pytorch-training created
-
Check the status.
The name of the job
pytorch-training
appears in the status. It might take some time for the job to reach aRunning
state. Run the followingwatch command
to monitor the state of the job.kubectl get pods -n ${NAMESPACE} -w
You should see the following output.
NAME READY STATUS RESTARTS AGE pytorch-training 0/1 Running 8 19m
-
-
Monitor your PyTorchJob
-
Check the logs to watch the training progress.
kubectl logs pytorch-training-worker-0 -n ${NAMESPACE}
You should see something similar to the following output.
Cloning into 'examples'... Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz 9920512it [00:00, 40133996.38it/s] Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw 32768it [00:00, 831315.84it/s] Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz 1654784it [00:00, 13019129.43it/s] Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz 8192it [00:00, 337197.38it/s] Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw Processing... Done! Train Epoch: 1 [0/60000 (0%)] Loss: 2.300039 Train Epoch: 1 [640/60000 (1%)] Loss: 2.213470 Train Epoch: 1 [1280/60000 (2%)] Loss: 2.170460 Train Epoch: 1 [1920/60000 (3%)] Loss: 2.076699 Train Epoch: 1 [2560/60000 (4%)] Loss: 1.868078 Train Epoch: 1 [3200/60000 (5%)] Loss: 1.414199 Train Epoch: 1 [3840/60000 (6%)] Loss: 1.000870
-
Monitor the job state.
Run the following command to refresh the job state. When the status changes to
Succeeded
, the training job is done.watch -n 5 kubectl get pytorchjobs pytorch-training -n ${NAMESPACE}
-
See Cleanup for information on cleaning up a cluster after you are done using it.
TensorFlow CPU training
Your deployment of Kubeflow on Amazon comes with TFJob
This tutorial guides you through training a classification model on MNIST with Keras
-
Create a TFJob.
-
Create the job configuration file.
Open
vi
orvim
, then copy and paste the following content. Save this file astf.yaml
.apiVersion: kubeflow.org/v1 kind: TFJob metadata: name:
tensorflow-training
spec: tfReplicaSpecs: Worker: restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name:tensorflow
image:763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.12.0-cpu-py310-ubuntu20.04-ec2
command: ["/bin/sh","-c"] args: ["git clone https://github.com/keras-team/keras-io.git && python keras-io/examples/vision/mnist_convnet.py"] -
To start training, deploy the TFJob configuration file using
kubectl
.kubectl create -f tf.yaml -n ${NAMESPACE}
The job creates a pod by running the container from Deep Learning Containers that you referenced in the field
spec.containers.image
in the YAML file above under the container nametensorflow
. -
You should see the following output.
pod/tensorflow-training created
-
Check the status.
The name of the job
tensorflow-training
appears in the status. It might take some time for the job to reach aRunning
state. Run the following watch command to monitor the state of the job.kubectl get pods -n ${NAMESPACE} -w
You should see the following output.
NAME READY STATUS RESTARTS AGE tensorflow-training 0/1 Running 8 19m
-
-
Monitor your TFJob.
-
Check the logs to watch the training progress.
kubectl logs tensorflow-training-worker-0 -n ${NAMESPACE}
You should see something similar to the following output.
Cloning into 'keras'... Using TensorFlow backend. Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz 8192/11490434 [..............................] - ETA: 0s 6479872/11490434 [===============>..............] - ETA: 0s 8740864/11490434 [=====================>........] - ETA: 0s 11493376/11490434 [==============================] - 0s 0us/step x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, validate on 10000 samples Epoch 1/12 2019-03-19 01:52:33.863598: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F 2019-03-19 01:52:33.867616: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 128/60000 [..............................] - ETA: 10:43 - loss: 2.3076 - acc: 0.0625 256/60000 [..............................] - ETA: 5:59 - loss: 2.2528 - acc: 0.1445 384/60000 [..............................] - ETA: 4:24 - loss: 2.2183 - acc: 0.1875 512/60000 [..............................] - ETA: 3:35 - loss: 2.1652 - acc: 0.1953 640/60000 [..............................] - ETA: 3:05 - loss: 2.1078 - acc: 0.2422 ...
-
Monitor the job state.
Run the following command to refresh the job state. When the status changes to
Succeeded
, the training job is done.watch -n 5 kubectl get tfjobs tensorflow-training -n ${NAMESPACE}
-
See Cleanup for information on cleaning up a cluster after you are done using it.
Next steps
To learn CPU-based inference on Kubeflow on Amazon using PyTorch or TensorFlow with Deep Learning Containers, see Inference.