Red Hat OpenShift San Francisco 2019 | OpenShift Commons Gathering, 28 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Panel: Building Kubernetes Operators for AI and ML Workloads

Description

Panel: Building Kubernetes Operators for AI and ML Workloads.

Filmed October 28th, 2019 in San Francisco.

A

So we really want to thank everybody for the time and effort it takes to build an operator and congratulate you and some of you who are about to get yours and operator hub. But if anybody out there in the audience is thinking about building an operator and need some help, please see me during the break and I will make sure you get that help without further ado sure art all.

B

Right we have microphones we're all good to go. Everyone has a chair except me all right, so thank you. Everyone is we're. Well we'll just pass the microphones around. Yes, we'll do that. I think it's gonna be a really fun discussion here, we'll talk about operators and you know kind of a the process that everyone has gone through with operators. So with that said, let's just kind of kick things off with a round of introductions.

B

You're familiar with me, some of these some of the members here, you've heard from as well, but maybe just give us a couple of sentences about who you are and then also what your products do and kind of. Where do they fit in this whole operator story, so Sonny we want to start with.

C

Invitation I'm, Sonny, co-founder and president of Provost or briefly, our solution, provides visibility and using machine learning to learn about user application workload and then trying to come up with the optimization for the resources like CPU memory or even GPUs, for supporting your workload.

B

D

Name is Gopal Krishna and I'm, the VP of engineering and customer support for cognitive scale, and we bring. We make our customers primarily in the financial services in health care, successful with AI with our product cortex, which allows for a build solutions in a predictable way, as well as to make sure that they are trusted early. We heard about ethics, you know embedding a lot of concepts to make sure the systems are trusted and can go through some level of compliance. So that's our product offering.

E

Let's try this one: shall we hey I'm, Ryan I'm a developer at Selden? We we make it easy for data scientists to deploy machine learning, models to kubernetes and run serve HTTP requests. So if predictions as HTTP requests and provide out of the box features observability, so monitoring and tracing metrics, but also features like making it easy to define roll out strategies with traffic splitting and to define inference graphs. So you can have like reusable steps within within the graph and Duke is sort of complet, more complex kind of deployments, like multi-armed bandit.

F

I am a zydeco jandi, I'm senior software engineer at Microsoft and I'm involved with the operator with the agitator which operated that you saw the demo previously hi.

G

I'm Camille by the public house, key I'm, CTO and co-founder of starburst data. Star Wars is a commercial arm for open source project Prester, which is distributed sequel analytics across many different data sources.

H

Hey everyone: my name is promote I'm, a senior product manager in our data center business unit at Nvidia, so we make GPUs I product manage coda, which is our GPU computing platform and we're building. You know both hardware as well as software that enables AI ml data, analytics workloads and many different types of use, cases that that will be accelerated using GPUs, great.

B

Thanks everyone, we talked a lot about kubernetes. We talked a lot about operators, kubernetes is kind of taking the world by storm and operators, in my opinion, kind of you know very close behind that. At what stage of your your your companies, each one of you have different products. Some of you may have started off more and they the kubernetes or container space. Some of you may be traditional applications, and what stage did you start to think about?

B

Operators in kubernetes is something that your company really needed to invest in, and maybe camille start with you with starburst.

G

Yes, so we obviously tracking kubernetes for some time, but about a year ago I would say we realized to scale prestel deployments to more and more different environments. You know traditionally do some Prem and and maybe at Everywhere's, but now you have to go to a shore. A Google cloud to you know different environments.

G

Kubernetes is a great framework, for course, of running presto, because it's a complex system, you know running potentially tens hundreds, maybe up to a thousand machines or managing all of that is really really hard, so bring uniformity across different environments, different deployment mechanisms, it's really important and we chose operator as a framework to implement our Cooper integration.

B

D

It's been a long journey for our company has been asked in for five six years now earlier on, in that genesis of the company we've been trying to leverage container technologies, as well as the orchestration mechanisms. So several times during the journey, we have evaluated kubernetes and kind of fell short for one reason, or another timing was not right.

D

So last fall is when we made a concerted effort to embrace kubernetes and work with microsoft, actually to render our product on edger, and there was a very positive experience for us, and then we quickly learned that the investments we made to kubernetes is rewarding us by allowing us to go to the next set of cloud vendors like eks in Red, Hat OpenShift, as well as gke as well. So we are able to evolve and rapidly satisfy our customer needs to go on Prem or on their own cloud or on our cloud to support them.

D

So it was a journey and we found it very rewarding. One point: do.

B

You find that building operators. Is it something that your customers we're asking is it? Does it make things more accessible to your customers.

D

Okay, let's talk about operators, so we heard about operators during the journey itself and right now we have you know we are experimented with operators and our customers are moving a little bit slower than we want it to be, and that's because of various reasons they are not there yet with their own internal.

D

You know, organization and selection of technologies to move forward, so we are kind of held back. We are you know, being a small company, we can't just be experimental about it in everything, so we are driven by our customer priorities and we are very closely monitoring and pushing them towards operator, and we find that as a very welcome addition to kubernetes because of the promises it's making maintenance of software upgrades as well as moving towards your at least towards Phase two, if not all the way to phase five. At this point, sunny.

B

What about you with profit story, since.

C

Our solutions built to help customers to cause resource optimization, so it's quite natural for us to use operator framework solution because one is customer. You know ask how to install our solution to help them to do cost optimization. We tell them is operator or we are actually fully certified level five operators on open ship, then there's no more. No further questions asked okay.

B

And then promote the work you're working on with NVIDIA, you know in terms of the customers, what they're looking for with operators? How does that fall in line with what you're doing yeah.

H

So we started working on an operator for for GPU deployment, essentially with with Red Hat a couple of months ago, so we've been building with Red Hat, a GPU operator that basically helps us provision GPU, so GPUs are special resources within kubernetes, so bringing up a GPU is can be fairly complex because we need to install user mode drivers kernel modules. Then we need to install device plugins to have these resources Expo to do the kubernetes master master control plane.

H

So it's not so setting up communities and deploying GPUs is fairly complex and even once once you deploy it, GPUs are throughput machines, which means you need to constantly feed the GPU with with data and so that it's it's operating at full throughput capacity. So the GPU operator that we've building been building with Red Hat, essentially automates the manage the management and deployment of all of this GPU components and the device plug in monitoring stacks automatically.

H

So it's been, it's been pretty cool technology for us because it simplifies users deploying these GPU worker nodes very easily. So we were also been looking into using operators to do other things now. So, specifically, MPI jobs can also be fairly complex because you need to have all of these worker nodes participating together, we're also looking at it using GPU direct technology, which is our technology that allows GPUs to communicate to Nick's and storage directly without having to go through the CPU.

H

So operators will also play a big part in in doing things like our DMA, and you know other other technologies like that. So it's pretty exciting for us. When we look at operators great.

B

Great, it sounds like it there's a lot of the automation of what would take a lot of time for someone or an engineer.

H

Absolutely yeah.

B

Great great fantastic, so, as can you maybe tell us a little bit, you talked about the data science experience. Where do you see, operators easing the adoption of AI and in data science, workloads.

F

Okay, so, basically, and for us, because we work with customers, so each customer has a different needs and what they want. So in terms of adoption of AI, if I understood your question correctly that how they used it for the before they were clone. So we started. We started last December with one of the biggest shopping centers in supermarkets in Australia that they wanted to have their training pipeline and kubernetes. So you can. We can see the trend from our customers that they like to have operators and using kubernetes in the AI space.

F

So we can see a trend in it and.

B

Ryan, maybe do you have running run across a lot of customers looking to do ml and ai with Selden and and there's a Selden operator? Do anything for them in that space, yeah.

E

Certainly the we chose we chose to use an operator because it allowed us to condense the serving specification in a kind of very neat, an expressive way. It's important that we can express that in the way that the data scientist can understand and control, so that the data scientist is empowered to be able to put together that that serving specification and run it without having to wait for the ops team to figure out how to how to get something up and running great.

B

And in terms of developing operators, there's what's the difference in terms of if you were to, let's say, build out an application that would be deployed using helm, charts or ansible versus now having that deployment, you know controlled by the operator itself, so maybe Camille or you know, as if you want to maybe talk about the differences there.

A

G

So, in our case I guess we didn't do an spore or harm charts. Before, however, we develop a custom framework to deploy. You know presto on a big cluster and I was a lot of scripting a lot of Python running things in parallel. You have to maintain that code and deal with all the failure cases. You know what, if some machine doesn't update properly- and it's just really really difficult to do all of this manually.

G

So then, when we decided to move to kubernetes well, most of those things are provided by the framework right. You have to so instrument the framework and orchestrate all of this, but you don't have to deal with like individual machine in a cluster coming up or down or being misconfigured or like all those those different problems right. So that was the give a simplification of how much code we have to develop and we have to provide to our customers versus what's guaranteed by the framework right.

G

You have ensure high ability scaling with horizontal scale error. You can just focus on instrumenting things so that your system operates very well rather than solving basic problems on the infrastructure level.

F

So the way I see it helm is for packaging, the desired States and then operator is responsible to make it happen. So it says you know so it's in my opinion. There are two different separate concern and helmets for packaging and operator is flowing and making it happen or just monitoring and making sure that they the desired States. Always there, okay.

E

Yeah so I guess in our particular case, because we have a the option to create a kind of complex graph of steps where you have like pre-processing and then you have model prediction and you might have some sort of post processing step or them. You know there might be more complex graphs than that to capture that it made. It made sense to be able to do it in a in a custom resource and then have the operator fulfilled.

E

That I think it would have been quite quite a burden to put it onto the the the users to have to figure out how to like, create the right resources using like a chart or something it's it's. It's a logic that we want to encapsulate in the operator.

D

So one of the things we are finding out with large customers is, however much we want them to just absorb changes and put it into production right away. That's not happening and there's always change management to Windows for maintenance, all of them come into play and we operate in a very low tolerance for any errors. That's going wrong and the specific AML world you got to change the algorithm, probably more frequently as it learns and updates your model and come back.

D

So you may be deploying new versions of the model every couple of weeks and sometimes maybe every week, depending on the sensitivity of problem. So when you try to do that without any formalism and simplification like operator framework, it's tough and you do lot of you- know: dotting dies and dashing the t's in order to get through the change management windows, and this one we believe, will minimize the risk so.

B

Reminds me, I just recently went to a doctor's office and they're still on Windows XP. You get this change management where there's there's not enough of a value to move off of it, but at the meantime your company has the support older version, so maybe sunny. If that's something you want to dive into it,.

C

Totally, that is make the change, update, particular for machine learning solution, much easier, yep.

B

Especially with the rapid iteration of these changes, there's things coming out all the time and I guess. Another added benefit is because, if you do have a platform like open shift and the customer has many different environments, they can kind of upgrade all of those environments, maybe in a single time maybe sunny. If you want to you, know kind of go into what that means for your customers as well. Yeah.

C

Yes, because, as in all machine learning model, you have to keep updating it, sometimes as frequent as four weekly basis. So you know once you have these operators and in customers will be much easier for them to understand what it means by getting an update right, because a customer don't want to have any a lot of updates too frequently and operation make it so much easier. So.

B

We talked a great deal about the benefits of operators. Maybe someone wants to volunteer and talk about the challenges you faced. Building an operator, it's an I. Imagine just like any new technology. There's gonna be some some herbs and flows and ups and downs. So does anyone want to take a first stab at some of the the complexities they've run into.

F

E

Actually, I was going to pick up on a point that you said earlier about the the the variety of tools that are out there. It's a it's a space, that's moving so quickly, and actually not so long about sultans operator was written in Java and we realized we'd. The the Java support really wasn't keeping up with the changes and had to switch to the queue builder stuff.

F

So it's kind of like I can talk about challenges about an hour. It's a it's a different talk, but one of the kind of like high levels of the challenges. One is test and build and test pipeline. So imagine that we put our operator that it's basically created resources on Azure. So now we are on open source and then you want to do an integration test under the on the PRS.

F

So how you want to make sure that you know it's secure, so someone is just not provision a cluster of ten nodes for doing the pipeline, because our build pipeline is part of our github repos. So what one option is around that another challenge is the versioning of the CR DS and the specs that you are saving it so between different version of the operators. So if you change your the map and your model that you're saving it I mean at your data model, not machine learning model.

F

So when you save that the data model, so how you want to do, keep the versioning between two versions and so that that's another challenge. Okay,.

B

And Ryan you mentioned it, you know a lot of different frameworks, a lot of from ways of doing it. What is the experience been like for those of users like the operator framework or the operator SDK? What does that experience been like.

F

We used to be there, but we, but we looked into the operator framework there. It was very quite similar, but we have a team of the SMEs that really kind of like chat with them. Before we pick and then we are using the framework based on the skill set of our customer and our team, and we picked queue builder because we found it it's easier to use, but it's still, if the functionality wise it's exactly same and then for even we realize it.

F

So when you look into the docks that we can easily use cube builder or the CR DS to push it into operator hub, so there is no difference in terms of you know. They generated lis result.

B

And the great thing is: there are tools out there. We also have resources in the openshift community to help out getting operators in into the operator hub, and you know using the operator SDK and the framework. So that's great. What about for those of you I think you know some of the panel members. You have certified operators. Does anyone maybe want to speak about the process or first off? Why did you want to go the route of having a certified operator and, if you're currently going through the process of then what is it? You know?

B

Let's talk a little bit about what is certification mean in this case for your customers.

E

I'm happy to say something about that: yeah, so I think it for us. It was it's just it's really cool that people thought you were using. Openshift can just click the button and they get the operator installed it. It's like it's very obvious, win for us to make it the install that easy, but also that gives a lot of confidence to the customers that are running on openshift, that they know.

E

This is a supported operator that they can work with, and it's been through that process, but yeah and regarding the actual, the the process of getting certified. The support we got from Red Hat was amazing. Traffic. One guy in particular, was amazing.

B

G

I would second that I think the best experience was did help from the team. You know in terms of testing that you know troubleshooting any issues providing infrastructure to certify our kubernetes integration on openshift, so that was the great part I think the the process of is so fairly new. So it's some rough edges, but with that extra help, I think that was really smooth experience and yeah. There are like definitely some notes. I have to share for, like the improvement of the process for the future generation of certification, but I think it's really good.

G

I mean we're actually successfully there on open operator hub right now and that's really important for some of our customers. Obviously right if you go to a bank and break Enterprise and the fact that it's certified and open ship, the password we are actually using for kubernetes, is it's a win. You know immediately sunny.

B

C

Want to reiterate the other member said: first of all, confidence, the customers and certificate certification process make it so much easier because of support from her head. We have done it I think in a very very short time period. Of course my team has done it, but they told us it has been a great partner right, I.

H

Have a different perspective: we haven't been certified yet so I mean it is kind of encouraging that you know the certification processes is in Red. Hat plays a very successful role for us. I think you know having a certified operator is very important, as we you know, scale, with red hat into enterprises, because we have a lot of components that are managed by the by the operators. So having a certified operator will give us give customers a very inner ear, surance that you know there's their support behind it.

H

So I mean I'm looking forward to the certification process and for.

B

Those not familiar with the certification process and what it means. The great thing about certification process is. It gets the the all of the components of the operator on the same type of operating system. In this case we use ubi, which is you know our way red hats way of saying this is a nice certified operator in terms of wheels support it, but then also you get the support from the partners as well. Then you have that dual level of support both from red head and the partners alright.

B

So let's talk a little bit more about about operators and AI one of the things that's coming. You know we. It was briefly mentioned before about the phase two operator.

B

If you're not familiar, if you go to operator hub, you can see more information about the different phases, but really one of the kind of the high levels here is when you start to think about how can a I and in automation be really taken to the next level with operators themselves, and so we kind of deed we've kind of talked about that is like a phase 5, auto pilot operator.

B

So maybe, if someone wants to discuss you know if they have plans to achieve that, auto pilot status and maybe bake some machine learning into their operator, maybe give some thoughts on what that might look like.

E

Well, I suppose I should we do we do something similar. We we have a way to enable sort of data science functionality in in the custom resource, so you can, for example, turn on an outlier detector component and that will then run alongside your model and then, if a particular data point is its marks as an outlier, then you know that that's kind of that that's potentially like you might get a lower quality prediction for those. Those particular records.

E

We're looking to do more in this kind of space as well like did you automatic detection, maybe for concept drift, yeah, so I think I, don't know whether the autopilot would would would be more on in the lines of like having AI embedded in the operator. The.

C

E

State reconciliation itself, if that's what you had in mind or if it's just in.

B

That's actually yeah have using AI using AI in the sense of managing and deploying your application, so maybe Gopal. If you have thoughts there, yeah a.

D

Couple of dimensions, this one is I. Think when I read your question, you're talking about you know in the operator framework, you are doing a lot of data collection and you're doing a lot of performance, telemetry, etc. So you can use that. Definitely, you know where there's data there's always opportunity to learn and make it easier for people to operationalize some other decisions in cognitive scale. We have just released a product called certified with the AI at the end in store Y.

D

So what it does is allows models to be checked at various points of a life cycle. To get a score, we call it a TX code, which is a combination of explained ability, robustness as well as bias, combined with data and compliance related scores.

D

So we, our intention, is to migrate towards chaining that to operator framework so that when models change either during the time of change management or while it's in the process of change management or while it's functioning as well in a way to measure its dimensions and report, the outliers and be able to initiate, you know remediation activities. So that's what we are hoping we'll get there as soon as we can.

B

I think we have a few minutes left here right. So let's go kind of just down the line here, starting with you Sonny and what's what's next, what's next for your company in terms of operators- and you know what you you know, what you're working on well.

C

I will give a bit more detail in my presentation this afternoon. I want to say that the next year is to get analyzed more and more data, more metrics performance data or collected from various different kind of sources on companies or even lower layer platforms.

D

For us, it's more of aligning with our customers, large banks and healthcare companies and make sure we are going hand in glove with red at kubernetes and OpenShift and whatever they ask us to do to be. You know credible as well, as you know, opera operable in their environment. So that's our next quote.

E

So we we've recently just added an in process of building out a bit further, the user interface around Selden, so that you can see your deployments more clearly and get visibility in one place of all the metrics and the behavior of the deployments and make it easier even easier to initiate new ones with Wizards to generate the custom resources.

E

Another thing we're working on more closely in the operator space actually is that a collaboration with the coop flow project to do serv model serving to serve HTTP predictions, but in a server 'lest way based on key native. So you can scale to zero and make more use of the underlying infrastructure resources.

F

In terms of the agitator bricks operator, we are on a journey to go to operator, have been getting our certification and spoiler are. There would be some other announcement related to the operators at Keuka and from the marker subside I can tell I can't tell.

B

F

Yeah watch this in.

B

Space yeah we'll be there anyway.

B

F

Will be some announcement.

G

So I think after successful release of the initial version operator, we want to expand- and you know, take more advantage of like the native press, the features that we are now like, developing in conjunction with the operator framework capabilities. So, for example, in in the press environment you have multi nodes running and sometimes you have spikes in load, sometimes slower, and there are many different reasons and potential causes how to improving.

G

You know performance of the cluster, sometimes you just you know you have to add more nodes for some time and and that may be based on CPU, which we support today, but sometimes you need more memory simply right here. What this requires more memory, and otherwise we won't be able to finish successfully in a short time. So we want to add that capability and tap into values.

G

Metrics we can control and I, don't think, is security configuration especially if you're dealing with multiple different data sources of tricky, I, think to control and simplifying sort of the configuration and change management of that through operator mechanism would be.

H

So for our from from invidious perspective there there are two things that I can think of one is we operate a fairly large kubernetes cluster for I mean internally to serve. You know the companies compute needs for their. We offer I know our users and MPI operator just to manage their MPI batch workloads for for for training, and you know quickly. We are trying to use kubernetes to do multi, node, deep learning training because, as the models get supremely complex, you have billions of parameters to train models.

H

So we are looking to use communities to do multi, node, training and I'm. Pretty sure that you know, there's going to be more operator work there as we try to get all these nodes cooperating together to do deep learning training, then, on the other side of the spectrum, we're also working on edge, you know edge use cases. So there's going to be a lot of you know tiny little gpu-accelerated devices and I'm trying to do inferencing at the edge related to video, analytics, smart cities and so on.

H

We're already those applications are fairly complex and we're building operators to be able to deploy those applications at the edges were so I. Think, there's a big spectrum of use cases that I see that we'll be working on fairly soon great.

B

Fantastic, although each one of you have vastly different products, the nice thing is: there's a single thread of making things easier and enabling customers and I think that's really when you think about openshift and the foundation of open shift, and why it's so popular, you know, that's that's at the core of it and I think operators are following right along there and just making it much easier for customers. So I want to thank you guys for for participating it.

B

Hopefully you know that was really good information and, if you guys have questions about their operators feel free to ask.