Cloud Native Computing Foundation Kubernetes AI Day EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: A Deep Dive into Kubeflow Pipelines - Senthil Raja Chermapandian, Ericsson

Description

A Deep Dive into Kubeflow Pipelines - Senthil Raja Chermapandian, Ericsson

A Machine Learning model is only a tiny piece in a series of multiple processing steps executed as part of an ML workflow. A pipeline is a description of an ML workflow, including all the components in the workflow and how they combine in the form of a graph. Kubeflow Pipelines (KFP) is an open-source project that helps to run Cloud-native ML pipelines on Kubernetes. While most previous talks on KFP have focused on Data Scientists and Data Engineers, this talk will dive deep into KFP, covering its architecture, platform components and how the platform components work together in executing the workflow.

A

Hi everybody, it's so happy and I'm very glad that I'm able to meet you all in the kubernetes ai day, europe 2022. So the title of my talk is a deep dive into cube flow pipelines.

A

While you might have heard- or you might have watched several talks about kubeflow pipelines before I presume most of those talks would have been highly data. Scientist or data engineer focused, meaning the more focus would have been to how to write pipelines in a more efficient manner, how to build components from scratch or how to convert a python function into components and then eventually build a pipeline right. So that is fine.

A

But today I'm going to talk to you about q flow pipelines, which will be more from an let's say, ml engineer, point of view or an ml ops, engineer, point of view or even a devops point of view. So today I try to cover how cube flow pipelines are composed. What are the components that comes along with kubeflow pipelines and how these components interact with each other and eventually how these components are able to execute the pipeline that is submitted to the kubeflow pipelines?

A

So, let's enter into the talk, I am central. I am working as principal software engineer in ericsson and my job in ericsson is to primarily architect cloud native aiml platforms, so these are platforms that are highly distributed in nature and use kubernetes as the underlying platform for compute and other resources, and apart from work, I take time to participate in other aspirations of mine.

A

For instance, I am the organizer of kubernetes community days chennai, so this is going to happen on 3rd and 4th june of this year.

A

I am the maintainer of an open source project called as cubefletch, so this project is actually an operator which will help you to cache container images directly on the worker nodes of a kubernetes cluster, and I am also an occasional speaker. I would say I am not you know very active in speaking, but whenever I talk, I love to talk about kubernetes cloud native technologies and very recently. I have also picked up an interest in talking about mlaps and I am a tech blogger.

A

You can watch my blogs in medium and nowadays I am a little bit not that active in tech blogging. Due to my preoccupation with organizing kubernetes community days chennai- and I am fairly active with social media sites like twitter and linkedin, so do check out my profiles on the social media platforms.

A

So, let's get into the talk.

A

The agenda for us today is actually very simple: I'm going to talk about ml workflows and the various ml pipelining tools and I'm going to pick out cube flow and I'm going to talk about cube flow. What are the platform components that comprise of kubeflow pipeline I'll be talking at length about the cube flow pipeline architecture? That is where I will talk about the various components that make up q flow pipeline and how these components interact with each other and I'll. Try to dig more deeper into q flow pipeline.

A

I'll also go and talk about what is an argo workflow executor, okay and I'll also talk about other notable features of kfb queue, flow pipelines and, finally, I'll finish it off with the cube flow pipeline demo, a very simple demo, which will help us understand the theoretical part that we see during the talk and by the way, we know very well what an ml workflow looks like right.

A

So there are distinct steps right and each and every step performs a certain portion of the overall work that needs to be done, and each and every step is also self-contained.

A

It has its own distinct set of input and it has its own distinct set of output, and the input can be a very simple parameter like a string or integer or float, or the input can be a huge data set which is stored somewhere in a data store. Okay. Similarly, the output can be a very simple file or the output can be a huge data set that is, for instance, pushed into a kafka or that is, for instance, stored into a mini object, storage, so whatever it may be.

A

We all know that machine learning systems are typically workflows, so you need to build a machine learning system in the form of a workflow and as the execution proceeds from each and every step of the workflow, there is a distinct work done, and data gets processed in each and every step and a model is being built in each and every step, and eventually the model is deployed into production, and then the monitoring happens where the drift detection and all these things are coming into play.

A

Now, whenever we talk about workflows, not necessarily about ml workflows, any workflows in general, so we make use of pipelining tools so, for instance, if you are from a devops, ci cd background, you know that we need a tool like jenkins in order to perform the cicd workflows.

A

So similarly, in the ml world, for us to execute the ml workflows, we need ml pipeline tools. So that is how we can be more productive right and there are a plethora of tools available for us to build ml, workflows and run these workflows in production.

A

And today I am going to focus about one single tool, which is called as cube flow and cube flow is, by the way, an open source project, which also provides you not only with pipelining capabilities, and it also provides you with a game of features and functionalities that you would expect from an end-to-end machine learning platform. Okay, for instance, there is a k-serve which takes care of serving the models in production at scale, and it provides features like a b testing.

A

Multi-Armed bandits and things like that and kubeflow also provides you with development capabilities where you can make use of jupiter notebooks in order to make use of various machine learning frameworks to develop your model. It provides you with capabilities of training, your model retraining, your model and things like that.

A

But for this talk I will be focusing only on cube flow pipelines.

A

Okay, so now cube flow, as I said earlier, it is actually being tutored or it is actually being branded as a machine learning toolkit for kubernetes right and it is highly kubernetes native, and it makes use of many of the features that are available in native kubernetes, and that's why I call it as kubernetes native and cube flow by the way it started as an open sourcing of the way.

A

Google ran tensorflow models internally right, so we know that tensorflow is a very popular machine learning framework that is widely used and once a tensorflow model is developed, so you need to run this model. So google was using some of the features that you find today in kubeflow internally to run their tensorflow models.

A

In fact, it began as just a simpler way to run tenfold tensorflow jobs on kubernetes, okay, it actually uh aimed for removing the complexities associated with running tensorflow jobs on kubernetes, and that is how it all started, and ever since uh that q flow has even expanded into a multi-architecture multi-cloud framework for running end-to-end machine learning, workflows.

A

Okay, so what I mean by end to end, is it caters to each and every step of a typical machine learning life cycle, starting from data exploration or even starting from defining your model, accuracy, criteria and metrics criteria up till deploying the model and monitoring the model in production? So it offers an end-to-end platform and kubeflow provides components, as I said earlier, for each and every stage in the ml life cycle, for exploration, for training, for deployment, for monitoring for retraining and things like that. Okay,.

A

So what are the installation options available for queue flow so either you can install kubeflow pipelines as a standalone framework or a platform, so that is available or you can choose to install the complete kubeflow platform and then use only the kubeflow pipelines, part of it. Okay- or there is a third option you can consume kubeflow as a fully managed service, consume queue flow pipelines as a fully managed service.

A

This is offered by google cloud, ai platform pipelines or, if you are you, if you are trying to use kubeflow pipelines just for testing purposes, you can also install it on local kubernetes distributions like a3s, so that is also available. Okay.

A

So when we talk about cube flow pipelines, it is predominantly built of these four components. Okay, so the first and foremost you have an user interface for managing and tracking the various machine, learning, experiments, jobs and runs, and there is a very core workflow engine that actually performs the hard work of executing the workflow.

A

Okay, we will talk about what this engine is made up of, and things like that later and a third more important feature of q flow pipelines is it provides you with an sdk for you to write your pipeline okay and for you to even build components, reusable components for pipeline, so that these components can then be used in different pipelines. Okay, so it provides you with sdk and there is also a rest api.

A

So if you want to re, consume kfp by in the form of rest, apis that is also available and whereas, if you want to do it in the using the sdk, that is also possible or if you want to just use the ui and then submit, submit the submit the jobs via the ui and then see the artifacts, and things like that. That is also possible and kfb also provides you with some inbuilt notebooks for you to easily interact with kfp using the sdk. So that is also available.

A

So, let's get- uh or let's spend more time on the slide where this is, where you see the architecture of cube flow okay, so at the top of it at the top of it, you have the ui and the ui is served by the pipeline web server and the ui itself has several capabilities. For instance, you can actually submit a pipeline in the ui and once the pipeline is submitted once you have, you have run the pipeline.

A

You can see the history of the runs in the ui and you can see several metadata. You can in fact, even drill down more deeper into the job history and see what are the steps that were executed? What up? What was the input for that step? What was the output for that step and, in fact, if you, you can also see where that output is stored? Okay- and you can also use it for debugging and things like that, and there is also a capability for you to visualize the run.

A

So if you get a results out of training, your machine learning model, for instance, if you are trying your machine learning model using various hyper parameters right, so you can visually see how the model is performing with these various hyper parameters. So so the ui is actually catering to a wide wide set of features. That is one good thing about q flow pipelines and underneath you have the system, which is the primary orchestration engine which performs all all the hard work necessary for executing a kubeflow pipeline.

A

So on top of everything you have the pipeline service, so the goal of pipeline service or the responsibility of pipeline services. Whatever pipeline you submit to kfp, it is the pipeline service that interprets it. It parses it. So it understands the python dsl that is actually defined for writing the pipeline. It understands the dsl, it parses the dsl and then eventually it compiles.

A

It compiles the pipeline code and then it prepares the pipeline yaml. So that is the job of pipeline service and whatever is done by the pipeline servers every at every point in time it makes sure to store the metadata into the metadata database and by the way, it's a mysql database and it stores all the metadata into this mysql database and once it has determined what actions or what tasks have to be performed for a particular pipeline run.

A

It goes ahead and creates the necessary kubernetes resources that are required for executing the pipeline, okay and for and and in kfp, each and every step of the pipeline is executed as a kubernetes part. Okay, so there is a container image and each and every container is run within the kubernetes part. Okay, so essentially, what happens is whatever kubernetes resources that are necessary to execute this pipeline are created by the pipeline service, and the pipeline persistence agent basically persists all these kubernetes resources.

A

The state of these kubernetes resources, the output that these kubernetes resources create. It is the job of the pipeline persistence agent to persist all this into the metadata store or even in the artifact storage.

A

Okay, let's move on now underneath the orchestration system, you will have a bunch of orchestration controllers, so q flow pipeline is built in such a way that it can support multiple orchestration controllers. So one primary controller that we use for task driven workflows is the argo, workflow and argo workflow is again a separate cncf project for executing workflows. So you will also see instances where ml pipelines are written directly in argo workflow using yaml constructs okay, but whereas in kubeflow pipeline you have a pipeline servers, you have an sdk.

A

There is a v2 version of the sdk. You have a dsl, you have a dsl compiler and you get everything on top of that. Okay.

A

But for this presentation we will stick to the orgo workflow controller and yes, once the resources are created, whatever output that are created by these resources, for instance, by resources I mean the pods that actually execute the step is eventually stored in the data artifact by default it is menio and there is an option to use other data artifacts as well.

A

So, let's get moving, choosing an argo workflow executor. So, as I said earlier, queue flow pipelines run on argo workflows. So argo workflow is the primary workflow engine. Okay, that actually executes the ml workflow and you can either use the docker executor for argo workflow or you can use the very latest emissary executor and by the way, emissary executor is the default executor from version 1.8.0 onwards.

A

Docker executor is actually has some limitations.

A

For instance, it supports only the docker container runtime and we know very well that in version 1.24 of kubernetes, uh the docker shim is getting removed or the docker shim has already been removed because 1.24 is already out and which means the docker executor can be used only if you are using an older version of kubernetes right and from security perspective since docker needs privileged access to the docker socket on the host.

A

It is not preferable to use such a such a approach or such a solution in production, whereas emissary executor supports any container runtime and it is also more secure. Okay, so so moving forward. uh It is going to be by default. uh Emissary executor that is already and default executor from version 1.8.0 onwards,.

A

And other notable features of kfp, so I wanted to give you this other features, because it will help you to understand in a more deeper way about kfp, so it provides out of the box multi-user isolation for pipelines and by the way this is available only in the full q flow deployment. It is not yet available in the standalone kfb deployment.

A

Basically, this feature allows you to separate the kubernetes resources for multiple users, so you can create multiple profiles and each profile is nothing, but each profile is actually get getting mapped into a kubernetes namespace. So if you, if you create a user profile and that particular user, when they run a queue flow pipeline, whatever resources that are created for that pipeline run will get created only in that particular namespace okay.

A

So this provides you with isolation, for instance, when you are sharing a queue flow instance with multiple users, it provides you with very good isolation, and another good feature is step. Caching, okay, so we saw that there are. The pipeline is executed in multiple steps. Let's say you create a pipeline run and let's say you once again: recreate a pipeline pipeline run this time, just by modifying the hyper parameters alone. Okay, the and this modification of hyper parameters is specific to a particular step. Let's assume so step.

A

Caching makes use that whatever steps that were run previously do not get executed again, and it will elegantly use the output of the step that is already cashed and it will skip the execution of the step, so the speeds up the execution of the pipeline.

A

It also efficiently uses the resource of the pipeline, and you can also control uh when you, the cache invalidation, should happen and when the caching should be disabled and you can also altogether either enable or disable the caching feature and another feature that was recently introduced in the sdk v2 is pipeline root.

A

This essentially represents an artifact repository where the pipeline stores artifacts okay, so originally only minio was supported and that too the minio that was packaged along with kubeflow pipelines. That was the only way to store your artifacts, but whereas now you have three different options, you can have minio, you can bring your own menu or you can use any s3, compatible, object, storage or you can use even gcs. Google cloud storage.

A

All right now, let's get into a quick demo, so for the demo.

A

This is how the pipeline is going to look like so the first step is training the initial model and then receiving a candidate model and further on, we retrain the model with more data so that we increase the accuracy of the model and once we get this retrained model, we we just run model prediction on this model, and then we calculate the matrix out of the data that was produced by the model prediction and if the matrix is within as within the acceptable criteria, then the retraining is stopped.

A

If or else the training is again retriggered and then the retraining happens, and this happens in a loop until the model accuracy is as per our expected criteria. So let me let us go into the demo.

A

So let me end the slide show and before I open the ui, let me show the list of pods that are actually running for a queue flow pipeline installation. So here you can see the minio, which is actually the artifact repository the mysql database, which is actually the metadata store and workflow controller is basically the argo workflow controller, because this installation has only orgo workflow controller, and this is the pipeline service which basically accepts the pipeline and then creates the various kubernetes resources.

A

This is the pipeline persistence agent, which persists all the kubernetes resources, their input and output. Everything in the ml data store and schedule workflow is used whenever we need to schedule workflows rather than one-time workflows. We can also have scheduled workflows and when we have scheduled workflows, the scheduling is actually taken care of this component and you have a bunch of other components. These are all ui related components. The pipeline ui, uh the pipeline viewer crd, as well as the pipeline visualization server.

A

So visualization server is basically it crunches all the data from the metadata server and then it creates the visualizations that are necessary to actually evaluate the performance of the model. Okay, now, let's get into the kfp ui.

A

So, along with the installation of kfp, there are some default pipelines that are installed as part of the installation, and today I am going to use one such pipeline, which is actually the pipeline that I explained using this light. So this is how it looks graphically and I am going to run this pipeline by clicking on start.

A

So once I do that a run a new run has been created, so I can click on this run and it will show you show me a visual graph explaining the various progressions of that particular graph. So, as you can see, this step has completed and you can see that this step produced two output, artifacts, so oneness it produced a table which was stored in the artifact repository and then it produced the logs, and we can also see the pod that was the kubernetes pod that was created for executing the step.

A

What are the events that were generated? So if, at all, there are some failures, we can look at the events and try to figure out what went wrong.

A

So what is happening is the data transformation steps have completed.

A

The initial model, training has also been completed, and we are seeing in this step that the initial model, as well as the data set, are being sent as input for the step, and the output of the step is the trained model, along with the model config plus the logs okay. That is what we see it as output. Again, we can see the pod that was created. We can also see the logs that was created by the container that ran this step.

A

So there are some details, it says, succeeded volumes, so there were no volume mounts used for this step. Visualizations, no visualizations for this particular step.

A

Let's see, what's there in ml data ml metadata, there is, it says, corresponding ml metadata not found, and meanwhile the pipeline has progressed to the point that it has calculated the matrix and it has decided that the matrix is not as per expectations and it is triggering a retraining okay, so the retraining has concluded, and it is again predicting the retrained model.

A

So for this prediction it uses the retrained model, as well as the data set data from the data set. Okay,.

A

Let's see what is the output of this prediction? Yeah, the prediction output is available and there is a calculation that is going on. Let us see what is the result of this calculation.

A

Ok, so the calculation has determined that the expected condition has reached so the run has completed. So you can see that there is a green tick mark and it says executed successfully.

A

The pipeline has ran successfully and now, if we come here, we can see all the pods that were created by argo workflow in order to execute the pipeline. So for every step in the pipeline, you will see a corresponding pod, so you can also use cube, ctl commands and then look at the pods look at the logs that were produced by this pod events that were produced by the spots, uh the same information that you saw in the ui okay yeah.

A

This is a very, very simple, simple pipeline that we typically will find during model exploration and model development phase, and we saw now that q flow pipeline was able to execute this pipeline and execute it successfully. Okay, yeah, and that is pretty much what I intended to talk, and I really hope that you enjoy the talk, and I really hope that the content of this talk will be useful and by the way, if you have any questions about this, talk, feel free to post them as text questions in the corresponding slack channel and I'll.

A

Make sure that I provide an appropriate reply to your questions so enjoy that enjoy the day and enjoy the rest of the talks and see you soon. Thank you. So much bye.