Red Hat OpenShift Commons Briefings 2018 | OpenShift Commons, 26 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons ML Briefing: Introduction to Kubeflow with David Aronchick (Google)

Description

David Aronchick (Google) gives an introduction to Kubeflow to the Machine Learning on OpenShift SIG of OpenShift Commons.

A

So I'll try to be brief. This is this is an introduction to key flow again. The this is to be very, very crystal clear. This is the open shift. Sig is not my intent to drive this, but with Red Hat, we've collaborated to help make tube flow work excellently on open shift and and obviously that's a work in progress, but but it is absolutely part of our core goals that we make this happen, and this is a just a brief introduction to what we're doing at at the in the cube flow project.

A

So high level, you know we come up to you know everyone's heard about ml, it's very very attractive because it really helps you solve a lot of problems where it's it's very hard to describe what the answer is in this case I. You know it's this interesting example at Google we actually use tensor flow to to and and some of the deep mind work to actually make massive improvements and cost savings.

B

A

In fact, this is what we have to be measured: data center efficiency called PUA power usage effectiveness. Lower numbers are better because it means you are using more of the power array. You have greater effectiveness of that power you're using at Google. We we used tensor flow internally.

A

This is what it looked like before we turned the ml control on and we saw a huge reduction in in overall cost, and then we turned it off and it came back and so again you know we're very, very excited about ml generally and we want to help bring it to market. The problem is that you know for all the magical ml goodness in the world. Most people are over here and in between the two there's a lot of pain and the ping could be adopting your existing processes.

A

It could be understanding whether or not ml actually has an impact on you and and whether or not your problems are actually solvable, vo ml, and so what we really want to do is help people bring ml into their business without as much disruption as possible. You want to do that in a cloud native way and by cloud native we mean you know these core components of being cloud native, composable for and scalable. In this case, composability means the following: it's you know.

A

Obviously, building a model is just the very smallest poor part of solving your problem with ml. In fact, when you look at it, it ends up being all the things around ml that end up. Taking up the majority of your work and and these components are you you want to be able to compose them such that you can use the tools that make sense for you. We, we are not in the fan of delivering a single stack and only that stack that works. We know that every enterprise in every use case is different.

A

We also want it to be portable, a lot of times you'll hear about folks who you know want to roll out ml, but they're their data.

A

Scientists may be training on Macs and then they are just experimenting, and then they move to an on-premises cluster, and then they move to a cloud cluster and oftentimes as they do those movements, they might use different tools, they might even use different ml stacks and that's obviously a terrible situation, because you get the bug that that Jo beta was just hinting at right there, which is you know you you introduce change and change, introduced the opportunity for breaking, and so in this case you know it's not at all uncommon to see.

A

You know in this case six different systems that might be solving your problem and each of those six different systems needs setup in all these various places they need to be supported, they need to be rolled out and maintained and- and that could be actually quite complicated, even in one environment, let alone the many environments that folks are building and running it.

A

We can skip by that and then finally, you want scalability ml workloads tend to be extremely burstable. As you know, when you're training, you want every possible cycle you can get and then, when you're done, you want to shut everything down and then beyond that you want to be able to scale. You know kind of in a linear fashion, so that if you you're not training fast enough or you're not able to support it, you want to be able to just you know, directly scale up the machines.

A

When we looked at these problems, we basically said: what's pretty good about this, you know pretty good at solving. These is containers and kubernetes. It already supports a high scale. It is also highly portable. It runs in many many different locations on Prem in the cloud you know on OpenShift, no matter where it is. It supports those things it's highly portable, meaning you can take workloads and as long as you know, they're containerized and described properly. They run everywhere and it's very scalable. It's goes up to thousands of nodes, including things like accelerators.

A

So that's great, except if you want to use ml on kubernetes, it often requires understanding a lot of things and that's very painful, and certainly not something in a data scientists, core job description, and so that's why we have introduced cube flow. The idea with cube flow is, we want to make it easy for everyone to learn, deploy and manage portable distributed ml on kubernetes everywhere.

A

So this is our core vision and we are a loosely bound community. We're happy to change anything, including the vision. If the community decides, as I mentioned, it is highly composable. It allows you to basically create these components and wire them together, using nothing more than yamo it's very portable. It works everywhere that kubernetes does and uses only native kubernetes api x'. We don't make any changes and then, finally, it's quite scalable, so we're just getting started right now in the Box we have a Jupiter hub.

A

We have a tensorflow training controller and a tensorflow serving deployment, and then the wiring to make it work on kubernetes everywhere and so in. The Box is just these components out of the over a pipeline or stack or things that we were talking about. We are actively talking to folks everywhere.

A

You already mentioned and I want to point out the the great data science being done by a pachyderm and we're working with them, as as close as we can to try and help integrate and I should also say that all of these steps should are not necessarily required to run in queue flow or on kubernetes. We are perfectly okay and in fact expect that many of these components will stay outside of our process as asbestos for, for the foreseeable future, kubernetes doesn't care where the resource ultimately lives. Excuse me cube flow. Basically, it's you know.

A

If it's inside kubernetes, that's great, you can use native kubernetes service discovery. It's outside kubernetes, that's fine to just give us the URL and we're fine.

A

So using cube flow. This this is what's necessary to set it up. There's some boilerplate stuff here, the stuff at the top. Basically is just initializing a few variables. This K S stands for case Annette. You do have to install that. It's just a packaging system, then you install the components that you like so first.

A

You just use this registry that we have hosted here and you can install packages in this case, we're installing three packages core serving and TF job, and then you using the namespace you deploy these components to to your ultimate destination, whatever kubernetes cluster you're. Looking at whether or not that's your on-prem, whether or not that's your in cloud, whether or not say your laptop, you apply it and now cute flow is up right. Let's say you don't like tensor flow, that's fine! You cross out one and you install you know SK, learn arbitrarily.

A

Maybe you don't like TF serving that's fine too. You would swap that out. So again, these are all this is the vision, certainly, but you know, and and unfortunately, if those and then little P up magic there, those those packages don't yet exist, but we are working as close as we can with various folks to get things just like that in the system.

A

So that's it. Yes, that is it for now. Our goal is really to take this in whatever direction the ml community and the OpenShift community would like us to take it in our goal is really to solve the boring gross annoying ml problems that are out there right now, so that folks can work on the higher level. So what's next, we have been doing a number of community meetings.

A

We would love to have you if you'd like in those community meetings for cube flow and and bring the open shift perspective, and what it's like running on open shift, but really we're just nailing down the governance proposals. We've moved it out of the Google github repo. That was a frequent request. It is in the kubernetes or exhuming it's in its own organization and repo right now it is entirely open source. It is Apache.

A

2 license fork it to your heart's content, but we're not done yet and we're still trying to figure out exactly what organization organization type it should be, what governance model, what the core contributors should look like so on. We do have a lot of tech. That's coming in. We hope quite soon.

A

I mentioned already other popular toolkits bark ml extra boost, SK learn. We want to do auto-scaling serving. We want to do, tensorflow, transform model analysis, and we really really want this last. This third bullet point, which is, if you are trying to use it and finding problems in any way, shape or form or you're trying to contribute, and it just doesn't line up, please reach out to me reach out to the community. You know we want this to be incredibly open and and useful from day one.

A

We, the next major milestone, we think, will be cube. Con EU, that's the beginning of May and we hope to reach a fairly stable state by that time. You know we're not using labels like alpha and beta, but I would certainly consider things very early right now and by by May. We hope to be in in much more production use case and we've.

A

You know I personally know of at least eight sessions that have been submitted relative to cube Flo @ @q con and you know we'd love to meet up and talk, and then finally, you tell us that whatever direction things are going, you know we we want to do this in a very, very community driven way. We also want to make sure you know I come from the kubernetes world. We want to make sure we do this in a very, not Google way.

A

You know nothing made us happier at Google, then, when more than 50% of the contributions came from not Google, I would love nothing more than to have the same into flow as soon as humanly possible and and whether or not that's other cloud providers like Asher or AWS or digital ocean, or you name, it could be ISVs people building in inside cube flow. You know OpenShift Red, Hat, all those various folks.

A

We want that to be the representative of the community, not just a couple Googlers. You know who talk a lot anyhow, that is the tube flow story, that's where we are right now and would love to hear where you'd like us to go and what we can do, and you know any questions.

A

B

Only one and I think it got answered in the chat and Carols got one so unmute yourself, Carol and yeah. Thank you. David I have to confess I have not used cue play yet so I'm I'm trying to grasp where it fits in between kubernetes and the suite of machine learning tools that are out there.

B

Is it more a package, manager or I'm trying to get what it really is? Yeah.

A

So it's I would I would classify it pretty close to a package manager for the short term. It is explicitly our goal not to build anything that is already out there and we much more want to include the things that are being built and being used in production today.

A

However, what we've seen is that many folks will containerize they're very useful packages, but then getting that running in kubernetes in a distributed way is not the easiest thing in the world, and so as a community we're coming together to basically say okay, this is the standards for how you might run your workload in such a way that there's a standard between these various steps and and how you might drop your system in here.

A

So, for example, with Jupiter, we want to make sure that Jupiter communicates with whatever ml framework is being used in a standard way, and that's not on Jupiter to do it's up to us as a community to say: okay, if you're gonna respond and be an endpoint for Jupiter, you should make these. You know this available, whatever you should use to Benes native service discovery.

A

You should you know so right now, certainly I would say it is pretty close to a package manager, but over time, I I do see it getting larger and and dealing with a lot of the common problems that folks are having today relative to things like standard, endpoints and and workflow orchestration.

A

Again, we wouldn't build that we would, you know, as a community, decide on the right orchestration and make it possible Danny. What's your question.

B

Yeah I mean that that that helps put it in perspective somewhat I'm gonna take another tact and just say from like if I'm a machine learning data scientist, whatever I, might use PI torch, so I could learn. Tensorflow dump ice, I PI it like pretty much are you know a variety of things and from an end-user, and this is not like the streaming data production spark kind of workflow but I'm at my desk doing analysis on data. There are things like paper space which you can go and pretty much just click a button.

B

You get the whole GPU with everything already pre-configured. That kind of stuff seem like digital ocean now offers. You know, machine learning, instance and there's I. Think two different groups, there's the machine, learning end-user data science, folks and then there's the okay I have to stand this up in production. For you know, let's say: I was Netflix with you know, movie stuff, coming in all the time I'm just trying to see where cube flow fits in. All of that.

A

So explicitly our goal is to reduce the difference between those two as much as humanly possible, because I've heard exactly what you said many many many times. The problem is that when that data scientist finishes their experimentation, they often have to rewrite, throw out, etc, etc. They're they're ml model, in order to port it to the production one right pint orchard cafe our perfect example most folks that do work in PI George end up using Cafe as the production ml framework and- and you know, I we're trying to reduce that as much as possible.

A

So the idea would be that data scientist who today paper source, is a great example right like and they're, making, incredibly incredibly easy to, install I would love to flow to get to that level of easy to install.

A

But going beyond that today, I think the the standard experience would be- and let me I will share this again so today a data scientist today might look at something like, but they may execute these same number of commands, but instead of using something like cube flow they'll do a bunch of pip installs. They might use some apt-get installs and things like that. Just to make sure the right packages are there and then they have to do their own coordination. Oh things are talking over local host there on this port.

A

You know I'm not sure they have to do a whole bunch of stuff. Our hope is that we get this level of stuff down Stowe's, simply and so cleanly that a data scientist who it's certainly they're, not gonna, spend a lot of time doing this, but they are doing that today. If we can have them to spit up a production stack on their local laptop such that it does represent what the standard is for the for the enterprise or that the particular job they were doing with all these various components.

A

So you know I said: oh, what do you want to swap out TF job? You don't have to swap out. You can have six different ones.

B

A

Don't care TF job and SK, learn and pipe doors, and so on and so forth, but they're all done in the way that the organization has said: okay yeah. This is what we're we want to get to when we get to production that reduces the overall amount of change and friction and pain that they then it comes to when it comes to taking that bottle that they trained and got running locally to production. So that's really our goal.

A

You know it is not our goal that we're trying to make every data scientist into an IT ops person. Absolutely not. That is exactly why we're using kubernetes to provide that layer of abstraction from everything under the hood and and really get at nothing more complicated than just kind of like pip install food.