Building Kubeflow pipelines: Data science workflows on Kubernetes – Part 2

Rui Vasconcelos

on 2 July 2020

Tags: AI , deep learning , Kubeflow , machine learning , MLOps , pipeline

This article was last updated 1 year ago.

This blog series is part of the joint collaboration between Canonical and Manceps.
Visit our AI consulting and delivery services page to know more.

Introduction

Kubeflow Pipelines are a great way to build portable, scalable machine learning workflows. It is a part of the Kubeflow project that aims to reduce the complexity and time involved with training and deploying machine learning models at scale. For more on Kubeflow, read our Kubernetes for data science: meet Kubeflow post.

In this blog series, we demystify Kubeflow pipelines and showcase this method to produce reusable and reproducible data science. 🚀

In Part 1, we covered WHY Kubeflow brings the right standardization to data science workflows. Now, let’s see HOW you can accomplish that with Kubeflow Pipelines.

In Part 2 of this blog series, we’ll work on building your first Kubeflow Pipeline as you gain an understanding of how it’s used to deploy reusable and reproducible ML pipelines. 🚀

Now, it is time to get our hands dirty! 👨🏻‍🔬

Building your first Kubeflow pipeline

In this experiment, we will make use of the fashion MNIST dataset and the Basic classification with Tensorflow example and turn it into a Kubeflow pipeline, so you can repeat the same process with any notebook or script you already have worked on.

You can follow the process of migration into the pipeline on this Jupyter notebook.

Ready? 🚀

Step 1: Deploy Kubeflow and access the dashboard

If you haven’t had the opportunity to launch Kubeflow, that is ok! You can deploy Kubeflow easily using Microk8s by following the tutorial – Deploy Kubeflow on Ubuntu, Windows and MacOS.

We recommend deploying Kubeflow on your workstation if you have a machine with 16GB of RAM or more. Otherwise, spin up a virtual machine with these resources (e.g. t2.xlarge EC2 instance) and follow the same deployment process.

You can find alternative deployment options here.

Step 2: Launch notebook server

Once you have access to the Kubeflow dashboard, setting up a Jupyter notebook server is fairly straightforward. You can follow the steps here.

Launch the server, wait a few seconds, and connect to it.

Step 3: Git clone example notebook

Once in the Notebook server, launch a new terminal from the menu on the right (New > Terminal).

In the terminal, download the notebook from GitHub:

$ git clone https://github.com/manceps/manceps-canonical.git

Now, open the “KF_Fashion_MNIST” notebook:

Jupyter notebook for this experiment – download here.

Step 4: Initiate Kubeflow pipelines SDK

Now that we’re on the same page, we can kickstart our project together in the browser. As you see, the first section is adapted from the Basic classification with Tensorflow example. Let’s skip that and get on with converting this model into a running pipeline.

To ensure access to the packages needed through your Jupyter notebook instance, begin by installing Kubeflow Pipelines SDK (kfp) in the current userspace:

!pip install -q kfp --upgrade --user

Step 5: Convert Python scripts to docker containers

The Kubeflow Python SDK allows you to build lightweight components by defining python functions and converting them using func_to_container_0p.

To package our python code inside containers you define a standard python function that contains a logical step in your pipeline. In this case, we have defined two functions: train and predict.

The train component will train, evaluate, and save our model.

The predict component takes the model and makes a prediction on an image from the test dataset.

# Grab an image from the test dataset
Img = test_images[image_number]

# Predict the label of the image
predictions = probability_model.predict(img)

The code used in these components is in the second part of the Basic classification with Tensorflow example, in the “Build the model” section.

The final step in this section is to transform these functions into container components. You can do this with the func_to_container_op method as follows.

train_op = comp.func_to_container_op(train, base_image='tensorflow/tensorflow:latest-gpu-py3')
predict_op = comp.func_to_container_op(predict, base_image='tensorflow/tensorflow:latest-gpu-py3')

Step 6: Define Kubeflow pipeline

Kubeflow uses Kubernetes resources which are defined using YAML templates. Kubeflow Pipelines SDK allows you to define how your code is run, without having to manually manipulate YAML files.

At compile time, Kubeflow creates a compressed YAML file that defines your pipeline. This file can later be reused or shared, making the pipeline both scalable and reproducible.

Start by initiating a Kubeflow client that contains client libraries for the Kubeflow Pipelines API, allowing you to further create experiments and runs within those experiments from the Jupyter notebook.

client = kfp.Client()

We then define the pipeline name and description, which can be visualized on the Kubeflow dashboard.

Next, define the pipeline by adding the arguments that will be fed into it.

In this case, define the path for where data will be written, the file where the model is to be stored, and an integer representing the index of an image in the test dataset:

Step 7: Create a persistent volume

One additional concept we need to add is the concept of Persistent Volumes. Without adding persistent volumes, we would lose all the data if our notebook was terminated for any reason. kfp allows for the creation of persistent volumes using the VolumeOp object.

VolumeOp parameters include:

name – the name displayed for the volume creation operation in the UI
resource_name – name which can be referenced by other resources.
size – size of the volume claim
modes – access mode for the volume (See Kubernetes docs for more details on access mode).

Step 8: Define pipeline components

It is now time to define your pipeline components and dependencies. We do this with ContainerOp, an object that defines a pipeline component from a container.

The train_op and predict_op components take arguments which were declared in the original python function. At the end of the function we attach our VolumeOp with a dictionary of paths and associated Persistent Volumes to be mounted to the container before execution.

Notice that while train_op is using the vop.volume value in the pvolumes dictionary, the <Container_Op>.pvolume argument used by the other components ensures that the volume from the previous ContainerOp is used, rather than creating a new one.

This inherently tells Kubeflow about our intended order of operations. Consequently, Kubeflow will only mount that volume once the previous <Container_Op> has completed execution.

The final print_prediction component is defined somewhat differently. Here we define a container to be used and add arguments to be executed at runtime.

This is done by directly using the ContainerOp object.

ContainerOp parameters include:

name – the name displayed for the component execution during runtime.
image – image tag for the Docker container to be used.
pvolumes – dictionary of paths and associated Persistent Volumes to be mounted to the container before execution.
arguments – command to be run by the container at runtime.

Step 9: Compile and run

Finally, this notebook compiles your pipeline code and runs it within an experiment. The name of the run and of the experiment (a group of runs) is specified in the notebook and then presented in the Kubeflow dashboard. You can now view your pipeline running in the Kubeflow Pipelines UI by clicking on the notebook link run.

Results

Now that the pipeline has been created and set to run, it is time to check out the results. Navigate to the Kubeflow Pipelines dashboard by clicking on the notebook link run or Pipelines → Experiments → fasion_mnist_kubeflow. The components defined in the pipeline will be displayed. As they complete the path of the data pipeline will be updated.

To see the details for a component, we can click directly on the component and dig into a few tabs. Click on the logs tab to see the logs generated while running the component.

Once the echo_result component finishes executing, you can check the result by observing the logs for that component. It will display the class of the image being predicted, the confidence of the model on its prediction, and the actual label for the image.

Final thoughts

Kubeflow and Kubeflow Pipelines promise to revolutionize the way data science and operations teams handle machine learning operations (MLOps) and pipelines workflows. However, this fast-evolving technology can be challenging to keep up with.

In this blog series we went through a conceptual overview, in part 1, and a hands-on demonstration in part 2. We hope this will get you started on your road to faster development, easier experimentation, and convenient sharing between data science and DevOps teams.

How to try Kubeflow?

To try Kubeflow on your Windows, macOS, or Ubuntu machine follow one of these:

Tutorial here
Video:

Would you like us to deploy and maintain your Kubeflow deployments? Find out more on Canonical’s Kubeflow consulting page.

Run Kubeflow anywhere, easily

With Charmed Kubeflow, deployment and operations of Kubeflow are easy for any scenario.

Charmed Kubeflow is a collection of Python operators that define integration of the apps inside Kubeflow, like katib or pipelines-ui.

Use Kubeflow on-prem, desktop, edge, public cloud and multi-cloud.

Learn more about Charmed Kubeflow ›

What is Kubeflow?

Kubeflow makes deployments of Machine Learning workflows on Kubernetes simple, portable and scalable.

Kubeflow is the machine learning toolkit for Kubernetes. It extends Kubernetes ability to run independent and configurable steps, with machine learning specific frameworks and libraries.

Learn more about Kubeflow ›

Install Kubeflow

The Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple, portable and scalable.

You can install Kubeflow on your workstation, local server or public cloud VM. It is easy to install with MicroK8s on any of these environments and can be scaled to high-availability.

Install Kubeflow ›