Leveraging Kubernetes for AI deployments is becoming increasingly popular. Chances are if your business is involved in AI/ML with Kubernetes you are using tools like Kubeflow to reduce complexity, costs and deployment time. Or, you may be missing out!
With AI/ML being the tech topics of the world, GPUs play a critical role in the space. NVIDIA, a prominent player in the GPU space is one of the top choices for most stakeholders in the field. Nvidia takes their commitment to the space a step ahead with the launch of the GPU Operator open-source project at Mobile World Congress LA.
What is the GPU Operator
The GPU, being a high performance compute resource in the cluster requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime, etc. With the GPU Operator, you can manage resources in a Kubernetes cluster and automate bootstrapping GPU nodes tasks.
The NVIDIA GPU Operator currently supports and has been validated with the following:
● Pascal+ GPUs are supported (incl. Tesla V100 and T4)
● Kubernetes v1.13+
- Canonical’s Charmed Kubernetes v1.15 has been tested with and supports NVIDIA Nvidia GPU Operator. The GPU Operator works out the box with Canonical’s Charmed Kubernetes and is supported from day one.
– Note: Helm may fail to initialize in Kubernetes v1.16. The Helm installation step above includes a workaround for this. More details can be found in the Github issue.
● Helm 2
● Ubuntu 18.04.3 LTS
● The GPU Operator includes the following NVIDIA components:
● Docker CE 19.03.2
● NVIDIA Container Toolkit 1.0.5
● NVIDIA Kubernetes Device Plugin 1.0.0-beta4
● NVIDIA Tesla Driver 418.87.01
The GPU Operator has a few prerequisites:
- It requires a fresh configuration of nodes – nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).
- i2c_core and ipmi_msghandler kernel modules need to be loaded
The following command ensures these modules are loaded:
$ sudo modprobe -a i2c_core ipmi_msghandler
The module loading step is not persistent and refreshes after a reboot. To make module loading persistent add the modules to the config file as shown:
$ echo -e “i2c_core\nipmi_msghandler” | sudo tee /etc/modules-load.d/driver.conf
- Node Feature Discovery (NFD) is required on each node. By default, NFD master and worker are automatically deployed .
If NFD is already running in the cluster prior to the deployment of the operator, set the variable nfd.enabled=false at the helm install step:
$ helm install –devel –set nfd.enabled=false nvidia/gpu-operator -n test-operator
See notes on NFD setup
$ curl -L https://git.io/get_helm.sh | bash
Create service-account for helm
$ kubectl create serviceaccount -n kube-system tiller
$ kubectl create clusterrolebinding tiller-cluster-rule –clusterrole=cluster-admin –serviceaccount=kube-system:tiller
$ helm init –service-account tiller –wait
Note that if you have Helm already deployed in your cluster and you are adding a new node, run this instead
$ helm init –client-only
Install the GPU Operator
Note that after running this command, NFD will be automatically deployed.
$ helm install –devel nvidia/gpu-operator -n test-operator –wait
$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/manifests/cr/sro_cr_sched_none.yaml
To check the gpu-operator version
$ helm ls
Running a Sample GPU Application
Create a tensorflow notebook example
$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml
Grab the token from the pod once it is created
$ kubectl get pod tf-notebook
$ kubectl logs tf-notebook
Use the following URL in your browser when you connect for the first time, to login with a token:
You can now access the notebook on http://localhost:30001/?token=MY_TOKEN
NVIDIA and Canonical will continue partnering to improve the AI/ML space and enable innovators. One area of interest is extending the GPU Operator to MicroK8s. MicroK8s takes the Kubernetes simplification one step ahead; a lightweight Kubernetes distribution with Kubeflow, GPUs, Helm and GPU Operator all in one package -Get started in seconds!.
If you find a bug, have technical issues or would like to contribute to the NVIDIA GPU Operator, please visit the official Github page.
For issues or contributing to Canonical’s Kubernetes, please visit the Github page. You can also reach out to us on Twitter @canonical @ubuntu.
Canonical and NVIDIA look forward to your valuable feedback!
With Charmed Kubeflow, deployment and operations of Kubeflow are easy for any scenario.
Charmed Kubeflow is a collection of Python operators that define integration of the apps inside Kubeflow, like katib or pipelines-ui.
Use Kubeflow on-prem, desktop, edge, public cloud and multi-cloud.
Kubeflow makes deployments of Machine Learning workflows on Kubernetes simple, portable and scalable.
Kubeflow is the machine learning toolkit for Kubernetes. It extends Kubernetes ability to run independent and configurable steps, with machine learning specific frameworks and libraries.
The Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple,
portable and scalable.
You can install Kubeflow on your workstation, local server or public cloud VM. It is easy to install with MicroK8s on any of these environments and can be scaled to high-availability.