Red Hat OpenShift Case Studies | OpenShift, 29 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Case Study: ML on OpenShift with NVIDIA DGX at MOD Israel

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Hello, hello, can you hear me good, hey, hi, everyone, I'm, the shark, a turkey, a product manager and OpenShift, a IML on OpenShift I focus on that I'm very pleased to introduce you this next topic, which is about really. How do you do machine learning automated and on top of OpenShift with us today we have a tie and we have guy from the IDF and they're going to start next. Thank.

A

You sure so hello, everybody and we're Italian guy from the IDF and today we're gonna talk about machine learning platform we developed on top of openshift and kubernetes that create state-of-the-art machine learning models and utilize. The data scientists and software engineer jobs in our organization. So a little bit about us I meet, I I'm the machine learning team leader in the IDF and actually everything you see here in the demo is something that we built in our team and I I'm.

C

Guy I'm I'm, the manager of a private cloud managed services project, especially openshift in that case that we're gonna see today so a little bit about IDF, the Israeli defense, all wear a tie and I come from right now. This very Defense Force are in the process of digital transformation that our main goal is to accelerate the delivery and the development of application inside our systems.

C

We have a wide variety of system from management applications to what we're gonna see. Today, that's accelerate the data scientist, jobs.

A

So an atom of removal to the main topic, how we can make you each already, each and everybody here in this room, machine learning, expert and build a really state-of-the-art machine learning models in just minutes or hours. But first we need to understand a little bit about machine learning and its basics. So actually machine learning is just learning from previous data in order to predict the future and we'll follow up in examples that will demonstrate the word process during all the presentations.

A

So we have you a data set of medical diagnoses and there was our mission- is to predict if James will have flu or not based on these parameters that we can see here and how we can do it. Actually, we have a wide variety of machinery models and algorithms. This is a release, a small sample of them and we can use each and every one of them in order to predict and create models that help us to predict the future.

A

But each model here has its own configuration or on his parameters and it's usually a very exhausting task to just select the best one and fit it to the data. So we need help because it's actually can take a lot of math a lot of time in order to find the best solution to the problem, and so, let's deep-dive about what the data scientists really do when he gets a new data set and start building the models. So first we start with the data.

A

We have a step called the data engineering which include removing irrelevant columns, making the data more predictable. Like you know, flu example will obviously remove the patient name column, because it's obviously will not predict. If one will get a flu or not, but we will we will. We assume that fever, for example, will give us more predictive analysis if, if about, if one will get or not so we'll, obviously we'll give it more weight.

A

For example, after that we start the machine learning task, which is just taking a lot of algorithms and models and start fitting them to the data set to the problem. In order to get the predictive model, you can see that it's a cycle and it's a almost usually it's almost everytime, exhausting process that takes a lot of time.

A

It's based on trial and error, so it can take even a few months for specific data set, and after that we have the operations. We need to serve this model to as a production service to different applications and consumer to consume the release. The result of this model to predict the future. So an overestimation of you know in an organization, for example, if we take 100 data sets, new data sets that get into our organization in the final round.

A

Only five are really going to production, and that's because of this exhausting process of building machinery models. Some of them are falling because of really really small things that and it's just a waste of data and waste of knowledge. Really so from this pipeline, we can gather four different top challenges that we are coming to solve. The first one is environment.

A

Actually, machine learning processes usually requires a lot of unique resources, like GPUs huge name, huge number of CPU cores alarms, and these things are really hard to allocate dynamically actually before our platform and that the scent is just bought, really big servers that they own them and say each server was assigned to one data, scientist and just was really not cost-effective, because the resources well and shareable, and so we're gonna, show you next how we solve this one and the second one is history.

A

We saw that we have a cycle of Fame machine learning, building process and each time this cycle includes a lot of model evaluations that are usually not stored everywhere. So we're no key they're, not keeping track of results. You just got for each model and it comes figuration of model, and this is not good, because it's not only to waste a lot of time. We can use this history later in other projects and experiments in order to make the building process more efficient and we're gonna see it also in the demo.

A

The third one is optimization and we just want. We look for a tool that just take a wide wide search space space and just give me the best combination of search results of just to give me the best mode, obviously, and we said how we solve it, distributed on openshift and how we utilized OpenShift into it efficiently and the last one is deployment, and here we have a gap between data scientist, scientist, knowledge and software engineers, knowledge, because the scientist doesn't know enough about docker writing applications and the operations.

A

So we can't just take his mathematical model and deploy to production as the rest api and, on the other hand, the software engineer doesn't know how to handle this mathematical model that I said that the scientist built and expose it as a rest api. So this is why a lot of models really just don't go up to production and we're gonna see how we solved it. So now, after we understood the challenges, let's see how we solve them. So basically, let's go back to the first challenge.

A

It was environment and we deployed jupiter hub, which is a common tool of data scientists these days and we control our this resources. Our resources are being allocated it's allocated and dynamically. Actually so guys, let's just spawn a new notebook, and here you can see a different variety of of machine learning. Let's say environments.

A

Each one of them is just including and unique packages, unique environments and unique resources that depends on with each which data set and which problem I'm trying to solve, and so like now we'll just select the data science notebook, because we want to solve the flu problem, we spawn it and if we go back to the open ship to still see what happens behind the scene, we can see that we got a new pod here with its unique resources and unique packages.

A

Actually, each notebook is just a docker image that is being controlled by us and when the research is over, when the data signs go to sleep and on that day he just turn off the computer turn of the notebook and the resources has been freed to other data scientist. This includes also GPUs, which are we have everybody in, doesn't have enough GPUs today.

A

So this is how we resolve this thing, and now, let's move on to my environment that I prepared before- and you can see here, Jupiter with the flu data set, we have just generated. We have here the same data that we saw in the example and a notebook that includes our demonstration. So what I'm going to do now? We're just trying to fit the data to machine learning models so we're going to run a different. The basic data science type data science operations in order to fit the data Tuesday to it.

A

So now we're just reading the data we are dropping irrelevant columns. Like the name column, we are just converting some columns to numbers and dates, so the mathematical algorithm can understand. What's in the data and splitting it in order for us to evaluate it now, let's say: remember the second problem: the history problem I'm going to build you now, three decision, trees, algorithms, a decision tree is just a predictive model that lends the data ends and know how to predict future data.

A

So it has, we can control the depth of the tree and this parameter actually really influence the performance of the model. So thank one to create three different decision tree classifier and we're going to change this. This much depth parameter every time. So we start with a depth of depth depth of three and we got 0.54 accuracy.

A

We try again with another value and we got another accuracy scroll and we try again with five when we get another accuracy school. Sorry.

A

So now who remembers the value of depth equal to three I assure that we, with the concert, a concentrated enough, is remembering it now, but if I run another 1000 experiments I assure you that no one will remember what was the result, and maybe it was a good result and we are losing a lot of data that way so right now, I'm going to show you how we solved it using our ever our ml tracker platform, which is also hosted on openshift. So let's just go.

A

This is the UI of this, and then we are going to create a new project. You can see. I just need to specify the project name and a description and I can control, also using node select all feature of kubernetes which resources actually, which which resources will be running the pods and the workloads that my experiment needs.

A

We also have dialect connection to the object storage if my data is saved there, but will not not use it right now. So, let's see create that project.

A

You see an empty project right now and let's return to the Jupiter and type in a newly created project here and we are doing the same thing- we're just running a decision tree algorithms with different parameters. You can see that I'm importing here a Python package, we wrote and we will use it during the wall representation, so we're just creating a track here and we know how to when we input it's the model object we give as input also the accuracy skull and each key value matrix.

A

We want to keep track off again we're going to run again. The three experiments as before.

A

And move back to the tracker and we're gonna see that we have three different experiments with the results and the matrix. We can see that we know how to extract each and every parameters of the model, and it helps us in the future to understand which model was really the best best one. And so this is the history part.

A

The third one was optimization and we are gonna see how we, as a machine learning platform team, can help data scientist optimize the models really easily using smart search realization ago, in terms that we go, we see right now so I'm doing the same thing: I'm just defining the basic data science operation I did before and I'm defining an objective function.

A

That's actually going to run my models and here and if I'm, defining a safe space right now, I'm searching between two different algorithms and decision, trees and cannons, and each algorithm has its own parameters right now. We're just testing two different parameters: each algorithm- and you can see the range here.

A

We actually define defining a really large search space for our optimization problem, for example the much depth what we tested before right now, we're testing it between with the range of 3 and 20 and we're going to define it and then deploy the optimization task to OpenShift and this platform. A good thing to mention is that we are not just running through the all combinations. We have an s, smart search, algorithm. That knows how to do it really really fast, and you can pay attention to what we need to specify here.

A

So it just displays the objective function we defined, which search algorithm I'm, going to use the number of workers which will obviously see late later. What what does it mean and how many evaluations I want to do? That means how many models I will actually build and the more the better, but it will it will it's also a power consuming. So now, let's run it and see behind the scene in the openshift. What's going to happen,.

A

So we just trained before so 100 explained and experiments are no way over. But let's do it again.

A

And move back to up and shift- and you can see here that a new pod, the manager pod is being created and after it three different machine learning. Workers are created because we specified with before and then the responsibility is to just run models with different combination. The dist manager is responsible to provide, and then you can see we just specify 100 evaluations and in about five seconds.

A

I think we just finish, and if we go back to the UI of our platform, we can see that the world results and we see that we got slightly better results than we got before, and it's all been automated using this platform and OpenShift, and so this is how we are optimized actually optimize and make easier make easier job to the data scientist in our organization and the last challenge the deployment challenge.

A

So we I'm actually not going to show you how we deploy real models but I'm going to show you what is needed in order to to deploy one. So it's we only need to specify which frameworks the model was built in the path to where it is day. Well. Is it saved as a file, and we support an object, storage today and the pre-processing function? Name that is responsible for converting the data. We want to predict to a-1. That model can understand using the same basic data science operations we did earlier after we run it.

A

Scalable deployment on hoppin shift will be deployed three different pods and can be scaled by stress and a deployment in openshift. This is a really awesome.

C

A

Let's go back to the slides.

C

So, let's talk a little bit about architect, our architecture. Are we gonna things none, so we have a three main components. The first component is the openshift OpenShift components, actually, the master and the inference running on a bare-metal servers.

C

The second and main component of this architecture is the GPU compute nodes, each compute node of a 2v 100 nvidia cards on it, especially for all the GPU were closed, that the system needs to be done application, and actually we have a quite problem there cause right now, each pod can utilize can assign GPU and no other pods can share the GPU with it. So when you assign to one pod one GPU, the GPU is only for that pod and actually on workload in production. You don't utilize 100% of the GPU and it's quite problem.

C

This is the Nvidia device plug-in that using right now we folk that plug-in and write our own written plugin for the GPU scheduled actually split. The GPU and time share with everyone with the other pods on the system, and this is our main component of the GPU right now we are adding 4 4 multi-gpu node, based on NVIDIA GTX. Actually, when you want to run more complex machine land models, you need to run it on a multiple GPUs, more than one or two, maybe eight six at a time.

C

Sometime more than ten and this machine can do that, it's mean like one pod can share, can use six or eight GPUs without sharing them with any more pods. Actually, the live demo we are showing right now is based on the these D cheeks and, and everything is running on GPUs.

A

So auto ml, and until now we just saw optimization smart development of how we optimize the data scientist life in our organization, but actually we needed some knowledge in coding and some knowledge in data science in order to use it. But right now, I'm going to show you how we build machine learning models using only a data set and making all this process at automatic.

A

Actually, auto ml is one of the most research topics in many academic institutions today and its goal is to automate the data science data scientist job, and so we implemented it in our organization and we're gonna show you how we give it as input the data set files.

A

If the CSV file we just research before and return as output, a great model, a state-of-the-art model with really good results that is already exposed as a REST API hosted on openshift, and so let's do it and just move on to the auto ml tab here and upload the file.

A

You can see that we know how to extract the relevant columns from the fat from the file. We just need to specify again a project name and when we click the interested column. In our case, it's the got blue column. What is happening behind the scenes? We are trying a lot of different machine learning processes really complex, one that scares the data and make it more predictable.

A

And after that again we are using the same optimization tool we used earlier, but right now we're running, really complex models that usually requires a lot of optimization and consumer out of time. In order to do so, if you move back to the OpenShift, we can see that right now we have only two workers that Rand is this task, but each one of them is really strong.

A

Actually, it has eight cores and 16 gigs of RAM in order to do this task, because it's inside this there are really complex models that built in it may take a while. It may take an hour or even more. It depends on the size of the data and it's complex, complex stability, but we already prepare the deployment that we ran earlier and we're gonna show just a REST API that is going to predict us if one will get a flu or not based on the only relevant columns.

A

Our model did automatically so we can see our model just just new to remove the name column because it's irrelevant and.

A

And right now we are going to just give it input, different values and see the predicted result, and so, let's, let's type in summary, some values. Actually this is the UI for testing purposes only and real applications. That was the other applications that hosted on open shift also just requested this prediction as using REST API. So this is how we we really make smart applications.

A

Okay, let's click on to addict and see this person will get flow. We think, but let's see how it, how is it shown in our platform? So let's move on to the tracker again, we still at 0%, but it can take a while. It may take a while and move to our preferred project, and you can see here that we got a new deployment with green status and we can also choose- and it's a URL, because we support and that's a kind of a be testing for models and we support for each project.

A

Each project can have multiple deployments each time we can switch the models that is actually being served using this API. So is how our data scientists just test the newly built model. Using this way, let's move on to the presentation again.

A

It's a point to mention and let's see, what's really behind the scenes and what we have in our open chip project. So actually we have scalable deployments of models, we have the Jupiter hub and we have a lot of notebooks that being spawned on Jupiter. Now we have mini, or as that is an object, storage that helped us to save models and crypt racks of expert experiments. We have a PostgreSQL that helps to host to be.

A

This is a database of our application actually and every time you cast MQ cluster, and that is actually the master and the workers from the optimization task actually communicate through a cluster rabbitmq cluster, a queue in order to for us to keep stable communication between them, and so it's actually. This is actually our development right. Now, let's talk a bit about the impact and what we did in our organization.

A

So each machine learning model that we will built manually and when it gets into our platform, it got improved by average in about 30%, improved by improvement, I mean performance or some other metrics that we evaluate model by. We increase the number of machine learning models in about 70% and we got a huge growth in users in the past half year about six six hundred percent of world, and we faced today a lot of challenges. One of them is the code remote debugging, not every data scientist walk on Jupiter.

A

Some of them needs to work on the local machine, but actually they can't because then, because I need a GPU, so we are looking for a way for the that they will work on the local machine, but the code will run on a pod on the server that is actually hosted on a GPU and we're looking for a better way, also to save experiments, as you saw we needed to specify in our demo what we want to save, which model which metric, which result and we're looking looking for a way to do it automatically, and also we're looking for a auto eval for unstructured data like images or videos that include also deep learning, Loyola Network search.

A

This is a complex, a complex topic.

C

We actually don't want to keep on with our own written device plug-in. We want to get more upstream and standardized ID device plugin that everyone can use and to share the GPU as we want for multiple pods. As we explained earlier in the architecture, and we have, we have a challenge to run and manage the multi clusters of open shifts and communities to get manage them operate and monitor a lot of them in a different locations. So this is our current challenges.

A

No to Shara it's going to.

B

So that's great right: did you guys, like that? That's great right, so what I thought I'd do here is kind of take a step back and kind of see what's happening right, like you are already familiar with OpenShift I I thought, we'd start with the openshift architecture. You can see your master and your worker nodes here and your parts running. They are being exposed as they're being exposed as services, either within the cluster itself or through the routing layer. So so what's and then all of this you've got the storage.

B

You got a registry, you know, open shift is abstracting and it can run and all the hybrid cloud infrastructure, Beit, physical or public or private or virtual, so you're familiar with all this.

B

So what what is where, where it is helping a data scientist where it's happening, helping machine learning as a service is you can confidently take and build your machine learning pipeline and workflows on top of openshift and bring a machine learning into production, because one of the stats that I recently learned was 70 to 80 percent of you know, models don't go into production right, so I think we think you know, and you know, Italian guys show that we think that openshift is a great platform for for or coming that barrier right like how do you actually create workflows?

B

How do you use the best of software development lifecycle that earlier we saw with Maguire Bank, for example? How do you bring that to basically machine learning workflows and how do you bring that into production, so I think so that's kind of the first step. The the next step really is some of the stuff that it I showed in terms of you know the Jupiter notebooks, for example, that were being developed that he showed where in you know, it can spawn off multiple workers that can use GPUs etc.

B

Those are things that we are trying to that's one example, but then there was also an example of how workers and masters communicate with each other using somekind. In that particular case, it was rabbit MQ there was. There was also storage that he showed some kind of interface to an object: storage where this is stored, you know, so what we are doing really is with a with the collaboration internally and with other external partners. We have created this project called open data hub and open data. Hub really is a reference architecture built.

B

So the open data hub has two aspects: one is we have used that reference architecture internally to build machine learning as a service on open shift at Red Hat, and we are using it internally to do optimizations for our data scientists, and then we are open sourcing it. You know all those things as a reference architecture, so these are some examples of those things that are shown here. There is like, for example, Kafka to do streaming. There is an endo messaging and then the respond to do a streaming in other data.

B

Real-Time data processing there is the Jupiter itself jupiter hub, wherein it has this pre-built notebook images, and you can add more as you need you know, and so and then there are. We have a ai library which has like optimized framework such as tensorflow built on there, l and Red Hat stack, which includes things such as ubi, which is Universal base image. So so that's basically in a nutshell, what the open data hub is.

B

It is about saying that if it's a reference architecture that you can, which is open, source based, which is all open source that you can use to create these end-to-end workflows on openshift and cuban at ease for building ml as a service, deep learning, as that's one aspect, the second aspect really is also to help partners right. You know, you know if there is a partner who like, for example, a partner such as anaconda, for example, for Jupiter.

B

You know they could use this reference architecture to bring their products and services on top of OpenShift, using operators and operators framework, the it I showed Auto ml and the other vendors, such as h2o driverless AI, which can do the same thing. So so, all in all, we are excited about this. We welcome, obviously it's open source, go to open data hub IO and provide pull, requests etc and provide input to us and feedback.

B

The other thing that I wanted to quickly highlight before ending really is our continued partnership with Nvidia, which around this stuff, you know so we into at Summit. We introduced this idea, a program called accelerated AI and what it is, is it's really a easy button for bringing AI and creating ml as a service in the enterprise data center.

B

Here on the left, you can see a stack right like you can see here that there is a x86 server, imagine a bunch of them having GPUs, and then you have the CUDA drivers, the Nvidia drivers and other drivers.

B

For example, if there is infinite bandwidth, Mellanox those drivers, all pre-installed automated, and then you have the device manager plug-in and then on top of that, you see that there are the Nvidia NGC containers, which are these pre-optimized CUDA based optimized for GPUs, like they have various frameworks and libraries for data scientists like tensor flow, for example, is one of them, but they also have very industry specific.

B

You know libraries for machine learning so that program, basically so what you get really with this so-called with the accelerated AI program? Is you get the NGC containers on open shift on x86 servers with GPUs, fully supported end to end from Red Hat, obviously, and from Nvidia and from the OEM? So that's the program that we announced we're very excited about that there is a lot of buying at both companies, as well as the OEM level.

B

So you know you can find more about those I mean there are those resources as a blog on it, and we collaborated with in video on that and then we are also inviting customers who are interested in participating to sign up if they are interested in this early access program. You know so anyway. So that's where I wanted to conclude and thank it iron guy for coming over and talking, and we are very excited about what we're going to do with respect to a IML on OpenShift. So you know we'll handle questions, offstage, I.

B

Suppose, if you have any questions. Thank you.