Red Hat OpenShift AI / Machine Learning | OpenShift Commons, 27 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons ML Briefing: KubeFlow On OpenShift with Subin Modeel and Willl Benton (Red Hat

Description

Subin Modeel and Will Benton (Red Hat OpenShift) demo Kubeflow on OpenShift to the Machine Learning on OpenShift SIG of OpenShift Commons

A

Going once twice thrice, okay, so up next. The way I'd like to proceed is basically to have a bit of a showcase or demonstration of some of the work. That's been going on with ku flow that that came out of the the coupe con conference.

A

Recently, some folks are read out of them, working to get it up and running on on open shift and I've asked William Benton to do a brief introduction for those who aren't aware of what it is and what the relevance is for us and then subin to take us through a demo of what he's been able to get going on open shift. So far, so with that William over to you, sorry will over to you.

B

All right, I'm just sharing the screen here. Let me know when you can see it.

A

B

There it comes alright, alright, so I'd like to start with just a little bit of context on why it's hard to operationalize machine learning in applications- and you know we're thinking about intelligent applications that learn from data to become better.

B

The more people, use them to provide improved functionality, and if you think about what these applications look like, they look a lot like contemporary applications, except that they're also dealing with data, their training, predictive models based on that data and there you know, dealings may be dealing with a wider range of data sources than a conventional database back without application. So all of the things you have to deal with in a conventional application still apply, except instead of just scaling out. They may be a web proxy or duplicating or replicating at sequel database.

B

You might have to scale it a confute framework as well to provide the sort of sort of performance guarantees that your users expect.

B

Now. The intelligent components in this application are typically going to be prototyped by data scientists and data scientists like to work into notebooks.

B

These are going to be turned into apps by machine learning, engineers or app developers, and if we put another way, we see one of the big problems with turning machine learning into a production application. These data scientists are working in one environment, handing it off to other teams. That's then going to sort of port it over to work in another environment and I bet. Those of you who work with data scientists or who are data.

B

Scientists have been in a position in the past to get a notebook from a colleague this either doesn't run or doesn't produce the same results that your colleague expects and I can I can say this because I've gotten those notebooks from colleagues and I think I've. Also given those notebooks to colleagues. The another problem with operationalizing use. Machine learning frameworks is that they might depend on specialized libraries frameworks and not everyone is going to have installed frameworks that aren't going to be packaged to be distribute.

B

They might depend on specialized hardware drivers as well to accelerate learning problems.

B

Another problem, in addition to the monitoring the scale challenges you have with conventional apps, because that intelligent apps also need to monitor the performance of models. As you know, a trained model captured something what the data it was trained on, but it's new data drifts from the training data. We can get silent failures, so we need to have some way to monitor the performance of our models. In addition to the performance of our application, now open shifts can make some of this easier by providing a really nice work.

B

Flow for reproducible, builds and and reproducible to plan from a git repository through a source to image builder, and you can basically ensure that if you use your notebooks with a source damage pipeline, that your colleagues will be able to reproduce them and run them on open chip in the cloud. But the problem is you still don't get past the issue of having a common library of these frameworks that are packaged.

B

So we'll talk a little bit more about that in a second, but I want to introduce coop flow at this point, and the idea is for coop flow is to sort of view for machine learning frameworks on kubernetes. What kubernetes does for application registration in general? The idea is that we have custom resource definitions for Jupiter hub and so that you could have a multi-user collaborative notebook tensorflow so that you can train conventional or deep learning models and tensorflow serving so that you can actually deploy those models as components or production application.

B

A coop flow is a new project that was founded at Google. It was just announced last month at guca, but it's already attracted a ton of excitement and attention, and the promise of coop flow is really that all of these frameworks and some of the hardware drivers that you need to actually deploy these frameworks can get good performance are going to be packaged up so that you can run them either on your laptop and a local kubernetes cluster or in the cloud at scale.

B

So if you've been following machine learning on open ship for some time, you might have heard of efforts like rat and licks do which a number of people on this call that are involved with or the piece a program and the really nice thing about coop flow is group flow is another another spin on these approaches. So kupo uses custom resource definitions which weren't available when we started working on that emilich style, and it's it's another way to get this sort of the sort of capability into kubernetes and open ship.

B

So we're really excited that that more people are interested in machine learning entities and open ships and we're involved in this community and we're looking at how we can take the work that it's already been done and integrate it and benefit from the Google community as well. So with that I'll hand it back over to Matt.

A

Hey thanks well so next up we'll have Subin do a demonstration of what what he's got working right now for ku flow and open shifts, and then I'd like to open it up for some discussion. Some feedback people thought so Subin /, you seem open, should command that nodes, etc.

C

Okay, okay, so I what I have set up with a GPU on a single OpenShift cluster, so I have a single node in this open shaped cluster and I have a GPU, NVIDIA GPU in this cluster. So I really wanted to show a demo of one of the component of cube flow, which is the TF tensorflow job operator and show you a demo of an application doing that or anyone on the GPU saving the model into a volume, and you know doing a tensor flow serving from from that particular volume.

C

But unfortunately, some some birth and I am NOT able to do that. So I'll do a very reduced demo of what we have working for a tensor flow job operator. So just I have a medical notes just to share with everyone. But here what I'm going to do is that I'm going to create.

C

Since this, once again, let me see if this okay, so this demo and the commands which you see here, is to create a tensor flow job operator from the templates right now. If you look at the cube flow code base, you have the case in it commands to create the different components of cube flow.

C

I am just going to create this particular one component, but not using case network using the manifest files, which was you know, given by a case in a document, and the first thing is that I am going to create a new project within open shift called a test, and let me just see the test has been created and then what I will do is I'm going to log in as a cluster admin and create a default admin as all binding, so that my C RDS and CID instances can be created.

C

And then I will go ahead and create.

C

Job operator template.

C

So you can see here that I have a cue flow Prater template available here, and this is all very similar to what you find on the cue flow, get a project and then I will go ahead and create.

C

An application of cue operator.

C

See that it will create conflict map called PF, zanpakuto context, it will create a deployment conflict 800, see oxy it port. It will create a job operator God in OpenShift, so coming back to open shift. So you can see here that I have the job offer at a pod and if I look at the logs, I should be able to see that controllers has started properly and what I'm going to do now do next is to submit CID instance, which is RTF chopped.

C

So when you submit a TF job. This is how the template would look. You need to pass in image given in the flicker spec and you can or you cannot provide it if you have GPU or provide the NVIDIA GPU the source limits, and if you, if you don't, have a GPU, you don't need to pass this, but if you have a GP just possible so that the GPU is allocated to this particular pod, so I have with me some templates for distributed job.

C

This is a tapo for distributed application, doing really nothing much just as a master worker and the default parameter. Server and I have one of the instance of sample job, which is just having the image which is doing a mathematics. Multiplication inside so I will go ahead and I am just create drop.

C

C

C

If you notice that when I quit the job, it says that the user subpoena cannot be, cannot create a job, so I need to again go back and login as a cluster admin.

C

Logging at the clustered men and then.

C

Submit this job.

C

In see that created.

C

We get rehab job will show you that our sample java has been created. This is a cre instance and I come back to OpenShift and see in the operator logs that that particular the job has been submitted and the operator is creating the pods and if I look at the available pods here, I can see that hurts sample TF job the pod has started and Hatter has already computed. If I'd log go to that particular pod and look at the logs I can see the magic summer clicking output shown here.

C

I also have on the job, which is a my next example. If I submit that.

C

Nasty government yeah if I look at available list of jobs, I can see that they're cool two jobs which have been submitted, one of them a nasty endless and booted choke again coming back to the job operator and looking at the logs, like I, can't find that that particular job has been submitted, and you can see that the Manistee status is shown here.

C

It's currently running here and again, going back to the pods I can see that the in Manistee master pod has started and running right now, I'm going to give it a few seconds so that it can complete you.

C

Can take some time.

C

C

When the minister job is taking a little bit more time, I think it's related to the computations involved in it. So that's about it for my demo to show what are the components of cube flow, which is the TF Java operator working on open chip, the it is internally using a service account which is having the cluster admin privileges. That's how it works. Thank you. So much.