Cloud Native Computing Foundation Kubernetes Community Days Bangalore 2022, 11 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Orchestrating Cloud Native ML workflows in Kubernetes

Description

A Machine Learning model is only a tiny piece in a series of multiple processing steps executed as part of an ML workflow. A pipeline is a description of an ML workflow, including all the components in the workflow and how they combine in the form of a graph. This talk will provide an overview of ML pipelines, its common components and how to orchestrate the execution of Cloud Native ML pipelines in Kubernetes, using open-source project Kubeflow.

A

Hi everybody, this is senthil and I am very glad that I am able to join the kcd bengaluru community in delivering my talk. My talk title is orchestrating cloud native ml workflows in kubernetes.

A

Let me give you a short introduction about myself. I am central. I am working as principal software engineer in ericsson since 2020.

A

My role here in ericsson is primarily to architect cloud native aiml platforms that run on kubernetes and apart from my day, job. I also take part in other activities. For instance, I am the organizer of kcd chennai 2022 that is coming up on 3rd and 4th of june.

A

I am also the maintainer of an open source project called as cubefletched q. Fledge is an operator which helps you to cache container images directly on the worker nodes, so this will help you in use cases where you need to have quick startup of applications or rapid scaling of applications, and I am a speaker I speak not very regularly, but occasionally on kubernetes cloud native and very recently, even on mlops, I am a tech blogger.

A

I was fairly active in 2021 with tech blogging, but in 2022 I am not that much active with my tech blogging due to my preoccupations with organizing kcd chennai, but I am very active in twitter and linkedin. So I welcome you to check out my twitter and linkedin profiles.

A

So the agenda for the day is actually very simple. So what I'm going to talk about? Is uh I'm going to talk about basically about ml workloads, how ml workloads are unique in nature when compared to typical software workloads and the importance of workflows in an ml system?

A

And when we talk about workflows, there is a need of pipelining tools, for instance, if you are familiar with devops and cicd right, so you need a pipelining tool for running your ci cd processors, for instance, tools like jenkins. So similarly in the ml world, you will need a pipelining tool and I am going to introduce you to a open source cube native pipelining tool, which is heavily used in the ml space called as cube flow. So I'll introduce you to cube flow.

A

I will talk about the platform components of kubeflow pipeline I'll talk about the kubeflow pipeline architecture and I will end my talk with a very short demo of kubeflow pipelines.

A

So what are the unique characteristics of machine learning? So this is a picture, a very famous picture, uh which was originally published in 2015 back in 2015 in one of google's paper.

A

So what this actually signifies is that in a typical machine learning project or in a typical machine learning application, the actual ml code, which has the ml logic built into it, comprises just about five percentage of the entire infrastructure or the entire ml system.

A

That's primarily because the ml algorithm heavily relies on data, and you need to take about take care of data collection. You need to take care about data verification.

A

You need to cleanse the data, you need to normalize the data and you need to label the data. There is a feature, extraction, phase, feature engineering happens and there is and machine learning training is a resource intensive activity. So you need to take care of infrastructure requirements for this highly resource, intensive stage of machine learning, training and serving infrastructure. Whenever you talk about a highly scalable application that uses ml, so your serving infrastructure has to cater to this high scalability and there is reliability.

A

So there are so uni you need, and and and for instance, there is also a very uh paramount importance placed on monitoring because traditional software monitoring you monitor for certain metrics and once it goes beyond the threshold.

A

You know that your application is experiencing some sort of resource contention or something like that, but whereas in machine learning, you need to constantly monitor the performance of the machine learning model right. Only then you can.

A

You can actually rely on the prediction of the machine learning model, so you need to be constantly monitoring the performance of the model. There are various ways in which you can monitor so so there's a whole lot of other components that need to work together coherently for you to be successful in building a machine learning system and successfully deploying it and running it on a machine on a production grade system.

A

Okay, so that's where you actually tend to become more focused on how you can solve all these problems in a more holistic fashion and by the way. How does a machine learning model development life cycle? Look like right, so it typically starts with you. Defining the metrics and acceptance criteria. So what will be the success?

A

Success criteria for your machine learning, project itself, and it's all about data data and data gathering data cleansing, the data, transforming the data analyzing the data and sometimes visualizing the data, and you need a whole bunch of data store technologies, be it sequel, storing nosql data. It all depends on what type of use case that you are developing using your machine learning and, of course you need to spend time with developing the machine.

A

Learning model train the machine learning model, with whatever training data that you have and once you are satisfied with the accuracy of the machine learning model. That's when you decide to deploy the model into production and then you need to constantly monitor the performance of the model, so you need to define metrics, and these metrics have to be emitted by the model itself.

A

Certain metrics will be generated by the infrastructure itself and the combination of all these metrics will tell you how the model is actually performing, whether it is actually doing the job for which it was intended initially.

A

So if not, there is no use of that particular model, and then you need to perhaps do a root, cause analysis and perhaps retrain the model right, so portability, scalability and resiliency.

A

All that you expect out of a distributed system is also expected out of a machine learning system and all these requirements right. So all these nuances or all these things that you that goes around a machine learning system are aptly fulfilled by kubernetes right.

A

You don't have to actually develop a new kind of infrastructure just because you need to run machine learning models or just because you need to process huge amount of data, so kubernetes is well versed to handle all of these requirements, and kubernetes is fast becoming the substrate for any machine learning model. So there is no denying that kubernetes is a fundamental piece of infrastructure for any machine learning model development, whether you develop the model or whether you deploy the model or monitor the model or scale the model.

A

So kubernetes is the de facto platform for machine learning models as well.

A

Now let me introduce you to an example: ml workflow right. So, first of all, there is a model building phase. So in this model building phase you heavily rely on training data, okay and out of the model building phase, you come up with some candidate models, so these are models which have to be evaluated okay, so that is where you use again test data. Okay, so using test data, you evaluate the model and then you get into an iterative loop of experimentation so that you come up with a chosen model. Okay.

A

So it is a highly iterative process in nature. So you come up with the chosen model which fulfills your acceptance, criteria, which fulfills your matrix criteria right and then once the model is chosen and once you are satisfied with accuracy of the model, once you are satisfied with the prediction of the model, that is when you productionize your model, so productionizing a model is basically how you take your model from your jupiter notebooks into a production ready artifact, which can be deployed into a production environment. That is what we essentially mean by productionizing the model.

A

So this can me this can simply mean that you dockerize your model so and store it in an ml registry from which you can pull it and then deploy it. Okay, and once you have productionized your model, then you enter into model testing.

A

For instance, you will develop a test code and you will have test data, and then you will test whether the model is performing as per your expectations or not whether the predictions are lying within a certain boundary or not right, and once you have tested the model and once you have fine-tuned the model, then you come up with the actual model which is ready to be deployed.

A

Okay and you have the application code, you deploy the model into the production infrastructure, so the application code will invoke the model, use the predictions and then it will perform its business logic and, more importantly, there is a huge amount of emphasis that needs to be placed on monitoring the model monitoring the performance of the model. For instance, once you deploy the model, what happens? Is the model can experience drift, so drift can be of various natures.

A

For instance, the model could have been trained in a certain train training data, but whereas in production uh the statistical distribution of the data itself could be different. So this this could cause the model to underperform, and this is very normal right. So you need to constantly monitor the performance of the model.

A

Actually, what I am trying to explain is forget about all the faces that I have spoken about. Basically, what I am trying to bring out is there is a. There is a workflow that happens, so there is a distinct chunk of work that happens in a stage with defined input and defined output, and the next phase actually takes the output of the previous phase and then performs some actions and then delivers an output. So what happens? Is work gets done in each and every stage and it's a complete workflow right.

A

So that is what is essentially happening in any typical ml project. So there is a workflow and the workflow has to be executed in a suitable infrastructure and speaking about workflow. You need to have pipelining tools, so there are plenty of pipelining tools available, both open source, as well as tools that are provided by the public cloud vendors. But today I am going to talk about an open source tool called as cube flow and by the way queue flow has seen wide adoption.

A

In the recent times. It is actually treated as the machine learning toolkit for kubernetes and it actually started as an open sourcing of the way google used to run their tensorflow models by the way, tensorflow is on machine learning for framework that was originally developed in google. It is again an open source machine learning framework.

A

It all began just as a simpler way to run your tensorflow jobs on kubernetes, but then, after a period of time, it it had its own roadmap and it finally ended up in developing into an end-to-end machine learning. Workflow system, okay, and what I mean by an end-to-end machine learning, workflow system is whatever you need for your ml life cycle is provided by cube flow, so whether you need to do model exploration or whether you need to do model training or whether you need to deploy the model into production and monitor it. Okay.

A

So all this is offered by kubeflow itself and that's why it is called as uh the machine learning toolkit for kubernetes and it runs entirely on kubernetes, okay,.

A

And specifically, in cube flow, you have a cube flow pipeline platform component. Okay, so q flow has many different features. Okay and one of the prominent features that is widely used in kubeflow is the kubeflow pipeline and it's a purpose-built pipeline for running machine learning models. So it has these four components. So, first of all, q flow pipeline provides you with an user interface with which you can submit your workflows. You can see how your workflows are running. You can perform experiments and things like that.

A

uh The queue flow pipeline has a workflow engine, so we will talk about the workflow engine in the next slide in the architecture.

A

It provides you with an python sdk, so you can actually use the sdk to codify your pipelines and you can create reusable components and use these components in your queue flow pipelines and of course, it also provides you with jupiter notebook so that you can use the queue flow pipeline. Sdk from your jupyter notebook to create your pipelines as well.

A

And this is how the architecture of kubeflow pipelines look like looks like so by the way, don't get overwhelmed by the complexity of this. Actually, what is uh what is the notable thing here is that you have a pipeline service. Okay, so the pipeline service is the core service which actually accepts the input, so whenever you want to submit a pipeline right, so the pipeline is actually getting submitted to the pipeline service that you see over here and once the pipeline is submitted to the pipeline service.

A

The pipeline service actually stores some metadata into the machine. Learning metadata database. Okay, and this is actually a mysql relational database, and once that is done, the pipelining servers make sure that it need it.

A

It understands that it needs to create certain kubernetes resources and then it goes ahead and creates those kubernetes resources, and there is a pipeline persistence agent, whose only purpose is to watch for all these kubernetes resources that are created and then persist these persons, these resources as well into a separate data store, so that even when your run gets over right, so you can take a look at what are all the resources that were created and what was the output from these resources, especially for debugging purposes and under the hood?

A

There are multiple uh workflow orchestrators that can be used in q flow pipeline. For instance, in this picture you will see that argo workflow is being used by the way argo workflow is widely used in kubernetes and cloud native systems for executing workflows, so kubeflow pipeline also uses argo workflows and queue flow pipeline can also be plugged into other workflow systems. Okay. But for this talk I limit limit myself to the argo workflow uh controller.

A

So basically what happens is once you submit a pipeline. The pipeline service creates an argo workflow and from there on argo, controller, watches the workflow, and then it starts executing the workflow okay. So this is how essentially the workflow gets executed and then the artifacts that come out of the workflow are stored in an artifact storage, which is nothing but an mini object. Storage, okay.

A

So this is how this is actually a very simple ah architecture, but there are other nuances and ah nitty gritty certain other components for caching and things like that which we will not touch upon, and that is pretty much about q flow. Now. Next, we are going to look at a very simple demo, so in this simple demo, what I will be showing is in step number one: there will be a model that will be trained initially using some trading data and in this in step number two.

A

We will retrain the model with more data, okay and then what we will do is we will predict the model I mean we will submit test data to this model and we will look at the output that is produced by the model and then we will analyze whether the output produced by the model is as per our expectations.

A

Okay, so that's what we do in evaluate model accuracy and if it is not, as per our expectation, we once again retrain the model. Okay, so it goes into a loop. So so this is a very typical machine. Learning workflow that you will come across in production systems, and that is what I am going to show you now.

A

So this is the queue flow pipelines ui. I have already opened it and, by the way, uh cube flow. I have already installed q flow. Sorry, it's not uh this window, perhaps this window yeah, so I have already in installed kubeflow pipelines in one of my name spaces. So here you will see all the components running just ignore the component that is in crash loop, backup because we don't actually mandate really need that component. So you here, you will see all the components up and running.

A

And as soon as I start the ui uh you I here, I am able to see the list of pipelines that I have right so for this demo I am going to use this pipeline, so this is how the pipeline will look like. So basically what is happening is, as I said earlier, so there is an initial model that will get trained and then based on the metrics that is, output from the model prediction. The model will be retrained continuously, so I'm going to actually run this pipeline.

A

So here are some parameters that I need to provide, so I'm going to select an experiment, so this is a terminology for queue flow pipelines and then I'm going to hit start. So, let's see what is happening so my pipeline has been triggered, so I can go and see what is happening to the pipeline.

A

So what is internally happening is the pipeline has been submitted to the pipeline service. The pipeline service will pass the pipeline and then it will understand what it needs to do. Basically, so in this case it would decide that it needs to create an argo workflow.

A

Okay, argo workflow is again an custom resource that is watched by the argo workflow controller and argo workflow will have all these steps that needs to be executed as part of the workflow and each and every step is executed inside a port. That is how argo workflow works. So if you see the terminal now, you will see a list of parts getting created.

A

Okay, so these are the parts that are actually created by argo workflow. In order to execute this workflow, okay and as and when each and every pod completes their activities, it will, it will get updated here and you can actually click on one of these items and yes and then you can see what what what? What is the input artifact for this particular step? What what are the output artifacts out of the steps?

A

So in this case, what has happened is uh this step has taken an input from minio? This is a training data set and it has actually transformed. So there is a transformation step that has happened and it has transformed the data okay, so likewise it gets executed. So what is happening here is uh the initial model has been trained and the in the model has been trained again because we have a training loop.

A

So this is the initial model training, and this is the training that has happened again with another set of data, and there is a prediction happening: okay, so there is a prediction happening, so the output of the prediction will again be a data set and the workflow is calculating the metrics out of this data set and it is deciding whether the model is performing as per expectations or not. So that is the calculation that was performed and then it has decided that it has to retrain okay, so it has again retrained.

A

So that is what you will see here and then again it is predicting okay. So this step is running. So again a prediction is being run and, let's see, what's happened, what happens now? Let's see if the loop gets terminated, let's see if the workflow has determined whether it is satisfied with the output of the model, so that is the prediction that is happening?

A

Okay, so let's close this- and here you go so here you now- you see that the workflow has completed. So what has happened is uh once the model has been retrained.

A

uh It is now it has now reached a stage where it is where the workflow is satisfied with the accuracy of the model and then finally, the model has the. Finally, the workflow has also completed okay, and that is that this is actually pretty much. What I wanted to show you so basically uh to sum it up, uh kubernetes is the de facto infrastructure that is also used for running ml, workflows and ml.

A

Workflows typically requires pipelining tools and there are specialized pipelining tools which make use of kubernetes and cloud native tools to run the machine, learning pipelines and one such tool is kubeflow and feel free to check out the queue flow project feel free to learn about the kubeflow project, and if you are interested, you can also check out the other projects that I have mentioned in the slide.

A

So thank you for giving me this opportunity hope you enjoyed the talk and by the way, if you have any questions, feel free to reach out to me in twitter, linkedin or also on the cncf slack in the kcd bengaluru channel. Thank you so much bye.