NVIDIA GPU Operator – Simplifying AI/ML Deployments on the Canonical Platform

anaqvi

on 22 October 2019

Leveraging Kubernetes for AI deployments is becoming increasingly popular. Chances are if your business is involved in AI/ML with Kubernetes you are using tools like Kubeflow to reduce complexity, costs and deployment time. Or, you may be missing out!

With AI/ML being the tech topics of the world, GPUs play a critical role in the space. NVIDIA, a prominent player in the GPU space is one of the top choices for most stakeholders in the field. Nvidia takes their commitment to the space a step ahead with the launch of the GPU Operator open-source project at Mobile World Congress LA.

What is the GPU Operator

The GPU, being a high performance compute resource in the cluster requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime, etc. With the GPU Operator, you can manage resources in a Kubernetes cluster and automate bootstrapping GPU nodes tasks. 

Supported Platforms

The NVIDIA GPU Operator currently supports and has been validated with the following:

●     Pascal+ GPUs are supported (incl. Tesla V100 and T4)

●     Kubernetes v1.13+

  • Canonical’s Charmed Kubernetes v1.15 has been tested with and supports NVIDIA Nvidia GPU Operator. The GPU Operator works out the box with Canonical’s Charmed Kubernetes and is supported from day one.

– Note: Helm may fail to initialize in Kubernetes v1.16. The Helm installation step above includes a workaround for this. More details can be found in the Github issue.

●     Helm 2

●     Ubuntu 18.04.3 LTS

●     The GPU Operator includes  the following NVIDIA components:

●     Docker CE 19.03.2

●     NVIDIA Container Toolkit 1.0.5

●      NVIDIA Kubernetes Device Plugin 1.0.0-beta4

●      NVIDIA Tesla Driver 418.87.01

 Set-Up

Prerequisites

The GPU Operator has a few prerequisites:

  • It requires a fresh configuration of nodes – nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).
  • i2c_core and ipmi_msghandler kernel modules need to be loaded

The following command ensures these modules are loaded:

$ sudo modprobe -a i2c_core ipmi_msghandler

The module loading step is not persistent and refreshes after a reboot. To make module loading persistent add the modules to the config file as shown:

$ echo -e “i2c_core\nipmi_msghandler” | sudo tee /etc/modules-load.d/driver.conf

  • Node Feature Discovery (NFD) is required on each node. By default, NFD master and worker are automatically deployed .

If NFD is already running in the cluster prior to the deployment of the operator, set the variable nfd.enabled=false at the helm install step:

$ helm install –devel –set nfd.enabled=false nvidia/gpu-operator -n test-operator

See notes on NFD setup

Install Helm

$ curl -L https://git.io/get_helm.sh | bash

Create service-account for helm

$ kubectl create serviceaccount -n kube-system tiller

$ kubectl create clusterrolebinding tiller-cluster-rule –clusterrole=cluster-admin –serviceaccount=kube-system:tiller

Initialize Helm

$ helm init –service-account tiller –wait

Note that if you have Helm already deployed in your cluster and you are adding a new node, run this instead

$ helm init –client-only

 

Install the GPU Operator

Note that after running this command, NFD will be automatically deployed.

$ helm install –devel nvidia/gpu-operator -n test-operator –wait

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/manifests/cr/sro_cr_sched_none.yaml

To check the gpu-operator version

$ helm ls

Running a Sample GPU Application

Create a tensorflow notebook example

$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml

Grab the token from the pod once it is created

$ kubectl get pod tf-notebook

$ kubectl logs tf-notebook

Use the following URL in your browser when you connect for the first time, to login with a token:

http://localhost:8888/?token=MY_TOKEN

You can now access the notebook on http://localhost:30001/?token=MY_TOKEN

What’s next

NVIDIA and Canonical will continue partnering to improve the AI/ML space and enable innovators.  One area of interest is extending the GPU Operator to MicroK8s. MicroK8s takes the Kubernetes simplification one step ahead; a lightweight Kubernetes distribution with Kubeflow, GPUs, Helm and GPU Operator all in one package -Get started in seconds!.

Contributing

If you find a bug, have technical issues or would like to contribute to the NVIDIA GPU Operator, please visit the official Github page.

For issues or contributing to Canonical’s Kubernetes, please visit the Github page. You can also reach out to us on Twitter @canonical @ubuntu.

Canonical and NVIDIA look forward to your valuable feedback!

kubeflow logo

Run Kubeflow anywhere, easily

With Charmed Kubeflow, deployment and operations of Kubeflow are easy for any scenario.

Charmed Kubeflow is a collection of Python operators that define integration of the apps inside Kubeflow, like katib or pipelines-ui.

Use Kubeflow on-prem, desktop, edge, public cloud and multi-cloud.

Learn more about Charmed Kubeflow ›

kubeflow logo

What is Kubeflow?

Kubeflow makes deployments of Machine Learning workflows on Kubernetes simple, portable and scalable.

Kubeflow is the machine learning toolkit for Kubernetes. It extends Kubernetes ability to run independent and configurable steps, with machine learning specific frameworks and libraries.

Learn more about Kubeflow ›

kubeflow logo

Install Kubeflow

The Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple, portable and scalable.

You can install Kubeflow on your workstation, local server or public cloud VM. It is easy to install with MicroK8s on any of these environments and can be scaled to high-availability.

Install Kubeflow ›

Newsletter signup

Select topics you’re
interested in

In submitting this form, I confirm that I have read and agree to Canonical’s Privacy Notice and Privacy Policy.

Related posts

Distribute ROS 2 across machines with MicroK8s

Introduction Our simple ROS 2 talker and listener setup runs well on a single Kubernetes node, now let’s distribute it out across multiple computers. This...

Canonical’s Open Operator Collection extends Kubernetes operators to traditional Linux and Windows applications

13th November 2020: Canonical’s Open Operator Collection, the largest collection of application operators, now supports both cloud-native and traditional...

ROS 2 on Kubernetes: a simple talker and listener setup

Kubernetes and ROS make a great match, but ROS2 can be tricky to set up. In this post we demonstrate a scalable system with MicroK8s on Ubuntu.