Kubeflow on Amazon Setup - Amazon Deep Learning Containers
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Kubeflow on Amazon Setup

This section provides installation instructions to set up a deep learning environment using Amazon Deep Learning Containers with Kubeflow on Amazon, an open source distribution of Kubeflow. After you finish Kubeflow on Amazon setup, you can continue with training tutorials in this series.

Deploy Kubeflow on Amazon

To deploy Kubeflow on Amazon, follow the Vanilla deployment option in the Kubeflow on Amazon documentation. Make sure that you follow all the prerequisites. The installation instructions guide you through creating an Amazon EKS cluster before deploying Kubeflow on Amazon.

If you deployed a GPU cluster following the previous instructions, the NVIDIA device plug-in for Kubernetes is already installed. You do not need any additional setup.

Note

The following tutorials use the Vanilla version of Kubeflow on Amazon as an example. However, you can run all training and inference tutorials in this Kubeflow on Amazon section with any other deployment option of Kubeflow on Amazon.

For information about setting up and configuring Amazon RDS, Amazon S3, and Amazon Cognito resources as part of your Kubeflow on Amazon deployment, see Deployment options in the Kubeflow on Amazon documentation.

After you have set up your Amazon EKS cluster, you can verify that your context points to your cluster in the following section.

Verify cluster connection

These steps show how to verify your context. This is to make sure that you interact with the correct cluster.

  1. First, confirm that the cluster is active by running the following command.

    aws eks --region <region> describe-cluster --name <cluster-name> --query cluster.status

    You should see the following output.

    "ACTIVE"
  2. To check your current context, run this command. The current-context field in the output should contain your cluster name.

    kubectl config view

    If your current-context is not the cluster you want to interact with, run the following command to update it. For more information about updating your kubeconfig, visit Amazon EKS documentation

    aws eks update-kubeconfig --region <region> --name <cluster-name>

After you have deployed Kubeflow on Amazon and updated your current context, verify that your Kubeflow user profile uses the right namespace in the following section.

Verify your namespace

These steps show how to verify that your active Kubeflow user profile uses the namespace kubeflow-user-example-com. All tutorials in this series run in this namespace.

  1. Note

    In Kubeflow, all namespaces should be created via profiles. Kubeflow on Amazon Vanilla installation creates a user profile with the namespace kubeflow-user-example-com by default.

    Ensure that a namespace named kubeflow-user-example-com exists by running the following command.

    kubectl get namespace

    If the namespace does not appear in the output, create a new Kubeflow profile as follows.

  2. Open vi or vim, then copy and paste the following content. Save this profile description file as profile.yaml. Make sure to replace the email under owner.name with your email.

    apiVersion: kubeflow.org/v1beta1 kind: Profile metadata: # replace with the name of profile you want, this is the user's namespace name name: kubeflow-user-example-com spec: owner: kind: User # replace with the email of the user name: user@example.com
  3. Run the following command to create the corresponding profile resource.

    kubectl apply -f profile.yaml
  4. Export the NAMESPACE variable.

    export NAMESPACE=kubeflow-user-example-com

    We refer to this namespace as the variable ${NAMESPACE} in all Kubeflow on Amazon tutorials.

Next steps

Now that you have finished the Kubeflow on Amazon setup, you can continue with the training and inference tutorials.

To learn about training and inference with Deep Learning Containers on Kubeflow on Amazon, see the Training or Inference guides.

Cleanup

This section provides cleanup instructions after you have finished running your tutorials.

Clean Jobs

You can delete a specific training job when you are done running an example. To list the jobs of a specific type (PyTorchJob, MPIJob, TfJob) running in a given namespace, run the following command.

kubectl get job_type -n ${NAMESPACE}

Retrieve the name of the job you want to delete, then run the following command.

kubectl delete job_type job_name -n ${NAMESPACE}

Your output should look similar to the following.

job_type.kubeflow.org "job_name" deleted

Uninstall Kubeflow on Amazon

Kubeflow on Amazon documentation provides uninstall commands. Make sure that you run the command that corresponds to your deployment method: Kustomize, Helm, or Terraform.

Delete an Amazon EKS cluster

Kubeflow on Amazon documentation provides a single command to delete your entire Amazon EKS cluster.