Kubeflow on Amazon Setup
This section provides installation instructions to set up a deep learning environment using Amazon Deep Learning Containers with Kubeflow on Amazon, an open source distribution of Kubeflow. After you finish Kubeflow on Amazon setup, you can continue with training tutorials in this series.
Deploy Kubeflow on Amazon
To deploy Kubeflow on Amazon, follow the Vanilla
deployment option
If you deployed a GPU cluster following the previous instructions, the NVIDIA device plug-in for Kubernetes is already installed. You do not need any additional setup.
Note
The following tutorials use the Vanilla version of Kubeflow on Amazon as an example. However, you can run all training and inference tutorials in this Kubeflow on Amazon section with any other deployment option of Kubeflow on Amazon.
For information about setting up and
configuring Amazon RDS, Amazon S3, and Amazon Cognito resources as part of your Kubeflow on Amazon
deployment, see Deployment options
After you have set up your Amazon EKS cluster, you can verify that your context points to your cluster in the following section.
Verify cluster connection
These steps show how to verify your context. This is to make sure that you interact with the correct cluster.
-
First, confirm that the cluster is active by running the following command.
aws eks --region
<region>
describe-cluster --name<cluster-name>
--query cluster.statusYou should see the following output.
"ACTIVE"
-
To check your current context, run this command. The
current-context
field in the output should contain your cluster name.kubectl config view
If your
current-context
is not the cluster you want to interact with, run the following command to update it. For more information about updating yourkubeconfig
, visit Amazon EKS documentationaws eks update-kubeconfig --region
<region>
--name<cluster-name>
After you have deployed Kubeflow on Amazon and updated your current context, verify that your Kubeflow user profile uses the right namespace in the following section.
Verify your namespace
These steps show how to verify that your active Kubeflow user profile uses the namespace kubeflow-user-example-com
.
All tutorials in this series run in this namespace.
-
Note
In Kubeflow, all namespaces should be created via profiles
. Kubeflow on Amazon Vanilla installation creates a user profile with the namespace kubeflow-user-example-com
by default.Ensure that a namespace named
kubeflow-user-example-com
exists by running the following command.kubectl get namespace
If the namespace does not appear in the output, create a new Kubeflow profile as follows.
-
Open
vi
orvim
, then copy and paste the following content. Save this profile description file asprofile.yaml
. Make sure to replace the email underowner.name
with your email.apiVersion: kubeflow.org/v1beta1 kind: Profile metadata: # replace with the name of profile you want, this is the user's namespace name name:
kubeflow-user-example-com
spec: owner: kind: User # replace with the email of the user name:user@example.com
-
Run the following command to create the corresponding profile resource.
kubectl apply -f profile.yaml
-
Export the
NAMESPACE
variable.export NAMESPACE=kubeflow-user-example-com
We refer to this namespace as the variable
${NAMESPACE}
in all Kubeflow on Amazon tutorials.
Next steps
Now that you have finished the Kubeflow on Amazon setup, you can continue with the training and inference tutorials.
To learn about training and inference with Deep Learning Containers on Kubeflow on Amazon, see the Training or Inference guides.
Cleanup
This section provides cleanup instructions after you have finished running your tutorials.
Clean Jobs
You can delete a specific training job when you are done running an example. To list the jobs of a specific type (PyTorchJob, MPIJob, TfJob) running in a given namespace, run the following command.
kubectl get
job_type
-n ${NAMESPACE}
Retrieve the name of the job you want to delete, then run the following command.
kubectl delete
job_type
job_name
-n ${NAMESPACE}
Your output should look similar to the following.
job_type
.kubeflow.org "job_name
" deleted
Uninstall Kubeflow on Amazon
Kubeflow on Amazon documentation provides uninstall commands.
Make sure that you run the command that corresponds to your deployment method: Kustomize, Helm
Delete an Amazon EKS cluster
Kubeflow on Amazon documentation provides a single command to delete your entire Amazon EKS cluster