Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2021, 29 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keynote: Machine Learning on Kubernetes Made Easy With Kubeflow - Masoud Mirmomeni & Jimmy Guerrero

Description

Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Keynote: Machine Learning on Kubernetes Made Easy With Kubeflow - Masoud Mirmomeni, Lead Data Scientist, Shell & Jimmy Guerrero, Vice President of Marketing, Arrikto

A

Good morning, right, I don't know if I can top the opening remarks good morning, but uh thanks for joining us for our first keynote machine learning on kubernetes.

B

And how queue flow made this process easy for us?

B

I'm extremely excited to be here and share the game-changing experience I have had using kubeflow in my day-to-day job as a lead data scientist at shell queue flow, makes a machine learning process um very easy for data scientists and machine learning engineers, and um it's going to create a very efficient platforms for data scientists and machine learning engineers to collaborate, share ideas and learn from their own projects and experiences and finally, is going to reduce the cost of model building by managing the computational and storage resources efficiently.

B

But before getting to that, I'm going to pass it to jimmy. So he can describe and give a brief history of cupola and its relationship to machine learning and data science.

A

Hi, my name is jimmy and I run developer relations at ripto. For those of you who may not be familiar with the ricto, a richto was a key contributor to the kubeflow 1.3, as well as the recently cut 1.4 release, despite or besides, participating in couple in a couple of the keyflow working groups, they're also the primary maintainers of the mini kf, as well as the kl projects plus ekf, which is the enterprise kubeflow ml ops platform.

A

Now I know that we're at cubecon, so this is going to be a conference. It's all about cloud native architectures, not necessarily data science or machine learning, which is to say that likely, the majority of the audience or the viewers that are joining us uh virtually are going to be cloud native developers or architects, not necessarily ml ops, practitioners or data scientists. So it's not going to hurt to spend just a minute to talk about why the combination of kubernetes and machine learning are actually a match made in heaven. Here's why?

A

First, containers allow us to create test and experiment with machine learning models on our laptops, and we know very well that we can take those same models to production using containers. The idea here is not new right. We want to write once reproduce and run everywhere.

A

Second, a machine learning workflow on our laptop may be written entirely in one language, let's just say python, but when we take those models to production, we're probably going to want to interact with a variety of different services and applications.

A

So these are going to be things that are going to be doing: data management, security, maybe front end, visualizations, etc, and here we're going to probably want to go with a microservices-based container architecture. Here again, kubernetes is going to be a slam dunk for us. Finally, machine learning loves gpus, but gpus are expensive right, so it's not always about how quickly can we spin up uh an environment and get access to all the resources that we need?

A

In this case, it could be just as important how quickly we can spin down that environment back down to zero. So here again, containers are going to be a perfect fit. Unfortunately, there's uh an open secret in in in the industry that um not a lot of or a a lot of machine learning models are not being suspect, successful in making it to production, and the question is: why is this well?

A

There's a combination of factors going on here that involve skills, software methodology and the ability to efficiently uh collaborate right in an organization and big organizations being what they are right so skills in the sense that we're often asking data scientists to be kubernetes experts and we're asking kubernetes experts to be data scientists.

A

Therefore, finding the right um methodology, the right software and perhaps a little bit of empathy right, that's going to be needed in order to collaborate about across these teams and be successful, can prove to be a little bit elusive. So what are we to do? Enter kubeflow kubeflow is the open source project smackdab in the middle of this big convergence in it? And here I'm talking specifically about the combined ubiquity of cloud native architectures and the needs of machine learning workflows.

A

As we know, kubeflow was originally launched by google back in 2017 and has since become the most robust, open source cloud native by design, not as an afterthought ml platform for data scientists as well as operations. Folks, it's a complete toolkit of components that allow both data scientists and operators to manage train model and tune and even monitor their workflows.

A

Now that I've said a little bit of context, I'm going to hand it back over to massoud who's going to walk us through part of shell's data science and machine learning journey. So we can understand how q flow and its ecosystem of integrations uh helped solve many of the challenges that they were facing massoud over to you.

B

Thank you most of you might know shell as the old giant. However, in these recent years, shell has expanded its focus and and to other sorts of energies, are green and renewable, and to that effort actually um roughly spent um two billion dollars annually through 2020 for these kind of new resources and expected to expand these um expenses to even more for years to come.

B

So, um like you know, it's, it's obvious stepping into these very, very um large-scale environment that you need to get your resources from different um source of energy and distribute and transmit to users that are increasing day by day and they have drastically um different. You know: consumption patterns need a smart, very fast, agile um control system and without having um artificial intelligence at the scale. This is not gonna be achievable, but having ai at shells, the scale can create some challenges and our team at um shell faced some of these challenges.

B

The first challenge that we had was creating a development environments, proper development environments for these kind of um challenge problems. So, as a data scientist, I used to work on the local environments, like you know, build some machine, learning, simple machine learning models and using the local data. Now we are going to use a large scale data set from the greece from different countries all around the us europe, and we want to build the model.

B

So it's going to be extremely hard. If you want to create an environment like that for to be, you know capable of doing some, some modeling like that. Well, the second challenge we want to work in these environments. These environments require the specialized skills. Okay, if you want to work on your local machine as a simple model, it's going to be really really easy, but you want when you want to go and grab this data set. Let's say you want to forecast price, you want to create load consumption.

B

You want to figure out whether you want to figure out the generation. Is the consistency in the grids you need to have a graph of your network and you need to combine and get the data it's going to be extremely hard, and now you want to run them on kubernetes. This is great, but before that you need to know about containers, you need to about the um how to scale you need to know about gpus before even getting to the modeling. This can take very very long time.

B

It's very challenging and, of course, the last part. We don't want to actually bankrupt our iot system. I'm a data scientist, I'm a very selfish person. I wish I could have all the gpus around the world for me dedicated to me, so I can work on it, but is it possible?

B

I wish we cannot give any a couple of gpus to every data scientist on top of that machine learning is a very sparky process when you are in the development environment, when your code is ready to be, and it's in production, you just need a couple of cpus to have it running. But when you are in the modeling phase and as I mentioned, that the problem is very huge, so you need to have a very huge search space. You need to tune so many parameters and you need to have a huge computational power.

B

If you are in the production environment and you have a huge computational power, you lose it money because you don't need that. On the other hand, when you are in the modeling phase, you want to have a huge computational power, because if you don't have it you're going to lose money by wasting the expensive time of your data scientist and now we want to see how kubeflow actually help us to address all of these challenges.

B

The first thing is going to be: the um tupelo actually creates a self-serving model for us, so you and data scientists can go and grab computational power and storage and they have pre-configured ml toolkits in that that is that exists in the secure cloud environments. How cool is that now we can actually bring all of those things and do the machine learning projects easily. You know from minute zero.

B

If you wanted to do that, like you know in the old fashioned, so it could take weeks or even months now we can do it in just a couple of minutes.

B

The second one with having q flow, automated pipeline engine or say mkhk. We are going to fill in the gap between the data, science and software engineering and mlrs. Now our data scientists are capable of using their simple code and and bring it and pass it to the mla so, and this makes the process much faster for us to put things into production. And finally, since we are using kubernetes, we can smartly manage our computational and storage resources.

B

um For example, we can monitor our um as notebook servers, how they're using our computational power and, if they're, if they don't need it, we are going to release those and put it in the pool, so some others can use it.

B

um If we monitor our um notebook servers, as I mentioned, so if they are sitting idle for more than 24 hours, we are going to create a snapshot and we are going to release the resources and if the data scientists need to use that old server is going to use the snapshot and it's going to start working where he or she stopped earlier.

B

Now, I'm going to give you a demo how easy it is to run um launch a notebook server in the queue flow ui. So first thing we need to create a new server.

B

We just need to give a name, let's say cubecon and you can see. We have jupyter notebook environment, we have visual studio, r studio and- and if you remember I mentioned, we have a different ml toolkit pre-configured ml toolkit, you have something for deep learning, different version of tensorflow pytorch. We have something for spark. We have gpu version of that and if there is something that doesn't exist here, it's easy to actually bring it up here for other applications.

B

After that, with some simple configuration we are ready to go. We just need to say how many cpus I need how much memory I need to have for my server and if I need gpu or not, for example here I just I don't need gpu for the simplicity and after that, I'm just going to say how much storage hard I need for um my notebook server and it's I'm going to skip like you know, for the simplicity for the some other configuration and we're ready to go.

B

Just click on this beautiful launch button and you're going to see my notebook survey is going to start in couple of seconds which could take me like a couple of months. You know without having these things now we are ready to connect and, as you can see, I just need to have a web browser and a secure internet connection. Now I am in my server, I have visual studio.

B

I have jupyter lab and um now, if you go to the jupiter lab you're, going to see it's very similar to our uh lovely jupiter, notebook and but there's something more to that, we are, we have secularly connected to aws and all of my data is located there. So I can bring and drag everything to my jupiter notebook and I can start doing some data science and cool stuff from minute zero.

B

I'm going to suffer a little bit in this graph and I'm going to share the beauty I see in this graph. For you, it's just some might be a very simple flow graph, but this graph was very, very lucky lovely to me. It gave me those one of those aha moments when I saw it for the first time. I was super excited when I joined shell as a data scientist. My first assignment was to um build a predictive model.

B

I needed to grab data from different sources, I needed to subsample them and I needed to use different model configuration, but I couldn't have a huge search, especially because I was working locally, long story short. It took me a month or two months actually to come up with the uh like in the modeling, the proof of concept and jupiter format. I pass it to my co-worker and I said: can you productionize that it took him a month to come up with a model and the performance was not that great?

B

I was lucky at that time to be accounted to eric and queue flow. With the help of my co-worker, we built the machine, learning discipline and repeated the same experiment in just 35 days from the data processing to deployment and exponentially reduced the timing. Our second effort to couple of days now in our team, we have some team members with basic programming skills that can apply cutting-edge, machine learning and deep learning in just a couple of hours, and the story doesn't end here. It's getting get even simpler and better for data scientists.

B

We data scientists, love um jupiter, no, what's going on jupiter notebook now we just grab this jupyter notebook, add some add some text to those cells like import pipeline skip and some others, and we can push a button and create a pipeline from it. Kale is going to take over that um code for us and create a valid pipeline, and it's going to take care of all the data dependencies and it's going to manage the life cycle of this cube flow pipeline.

B

And, of course, finally, snapshot policies allow us to release idle resources without losing any work, and these, ladies and gentlemen, was the game-changing experience. I wanted to share with you as a data scientist actually and how kubeflow actually helped me to uh focus on my work, and you know avoid all of those distractions that I was always hesitating to touch that. So I could focus on my work and challenges that we have in huge projects that we have at shell and be productive and deliver the projects in a timely manner.

A

Thanks everyone for attending and enjoy the rest of the show.