Red Hat OpenShift OpenShift Commons Gathering 2020 | Red Hat Summit, 28 Apr 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Case Study cnvrg.io ML Platform | OpenShift Commons Gathering|Virtual Red Hat Summit 2020

Description

OpenShift Case Study cnvrg.io
Building ML Platform on OpenShift
OpenShift Commons Gathering
@ Virtual Red Hat Summit 2020

A

Hi hi everyone, my name is Johan and I'm CEO and co-founder of converge. Converge is a machinery platform built on top of open shift and kubernetes. We help teams to manage, build and deploy machine learning all the way from research to production. We help to breathe science and engineering teams and we provide IT with an environment to manage all machine learning, resources, utilization, infrastructure and more. We start converged because data scientists are spending 65 percent of their time on DevOps and also 85 percent of the models don't get to production.

A

This is happening because there are two different audiences in the machine learning world. First is the IT side focused on production, machine learning, focus on infrastructure of X capex. On the other side, you have the data scientist, a team, that's usually focused on algorithms insights and they spend 65% of their time on DevOps.

A

Now what converge and RedHat provide is a solution to solve exactly that. We provide everything, data, scientist and DevOps needed out-of-the-box a managed, kubernetes deployment on any cloud or on from environment, a fully automated installation and lifecycle management or the application, and also all tools. Data scientists need for machine learning, AI from research to production and an open, flexible container based code first data science platform, which integrates to any kind of tools that you already have in your ecosystem.

A

Machine learning today is fragmented, broken between a lot of different tools, scripts plugins and connected stacks there. On the left side, you have the MN ops DevOps, a lot of work around configuration and installation scheduling, resource management, lifecycle, collaboration and a lot more on the right side. You have all the data science workflow from data selection to data preparation model research, which probably need to be versioned in the in the middle. Then you have a lot of experimentation, training of different models, visualizing models, validation of models, tuning and deployment.

A

Once you deploy the model, it does it's not stopping there. You also need to monitor and proactively iterate, so if there is some sort of model, decay or new data, that's coming into the model. How do you re trigger this kind of pipeline this pipeline? It involves both research and a production deployment into a single we continuous training and continuous deployment mechanism because of this complex environment and procedures, a 85% of models don't get to production. This is a problem identified by a lot a lot of different companies.

A

This is a paper published by Google a few years ago. It's describing the hidden technical depth in machine learning systems. What happened when a company is trying to get machine learning from prototyping to real production scale? Suddenly they face a lot of different challenges: challenges around resource management, who is using what GPU, terrific infrastructure monitoring of models both in training and also in production, a lot of stuff and very little in every action, magic.

A

The machine learning code what's really giving the competitive advantage, the algorithms, the cool stuff, reddit and converged, accelerate and automate data science all the way, from research to production, providing a really fast way to get from start to finish from data to production, a platform that, based on top of OpenShift, uses all the scheduling, resource management, lifecycle, collaboration and other features that opposite openshift provides together with a suite on everything you need to build and deploy a model, whether it's from data selection, data, versioning version control for data data, science, data preparation, model, research, running a lot of different experiments on the opposite, compute during validation, tuning of models and eventually also deploying converges our code first platform.

A

Full stack container base just like OpenShift and it's okay, open, meaning that you can use any kind of framework and any kind of container to build your models. The verge accelerates everything from research production across any infrastructure. You have the open, open shift and converge solution, work side by side, so we use that converge open ship to distribute jobs across the different compute resources, so think of it that you can have one control plane for all AI and you can attach different compute resources to the platform.

A

So if you have open shift on Prem or open shift in the cloud or a hybrid and mix of the two, you can have all the different clusters, a on clusters unified in one environment, and then your data scientists can run machine learning workflows on any of the workers. Only any any. The parts or containers that you Pro that you have access to.

A

So you get one platform to manage training of models to manage research on Jupiter notebooks through yes code or something else you can use, auto scaling and cloud bursting and a lot of nice features built in terms of the pipelines. So converge provides a solution to build machine learning pipelines around all your different machine, machine learning, compute and different jobs. Each component here in the graph that was built with converge, can be running on a different open shift cluster.

A

So you can have this preparation step running on a spark cluster on premise: a CPU cluster built on top of open chief. Then you can have GPU training in the cloud or GPU training in the cloud or on from using also the open shift platform and the last. The deployment can also be deployed on a public cloud. Cluster close serve also an automation tool for building models, so you can run this pipeline on pre-processing model selection and the model deployment every day or every week, or even based on the new data.

A

That's coming into the flow, but whenever there is a new version of the data this tree, this flow can be triggered. Automatically flows are versioned tracked in runtime and also during the building of the flow. So you can always see for every model that you build exactly how it was built with what kind of data, what kind of metrics hyper parameters? Algorithms everything centralized now.

A

One of the nice unique things that we have together with openshift is that, besides the fact that each node here on the graph, each component can run on a different computer resources, convert automatically scales up the cluster and free the resources when this job is over. So in this case, I'm using multiple clusters, but think of it that you can have one cluster for spark or deep learning and for classic machinery and converge will orchestrate all the different jobs. With the help of overture.

A

Converge is certified at the overshift container platform. You can quickly install it and get failure, recovery, lifecycle management, cluster health, cluster monitoring, upgrades and everything you can enjoy from the will now go to a demo demo.

A

Okay, so this is the converge UI you can think of the converter. I sort of like get github designed for data science makes everything really simple. You can share models, you can share resources, experiments, research, everything can be shared and you can have all your data science team in one single platform, so data scientists, data engineers and IT. Now before we dive into one of the used cases here, converge relies on openshift computers, so you can attach multiple clusters.

A

Convert itself can be installed on open chip, and then you can attach openshift and kubernetes clusters directly to the platform from the UI. Now one of the nice things here is that you can track utilization, so you can make sure that you're, using all the resources and if you're not then see exactly. What's the what's the holdup you, we also provide Ravana in Cabana and other nice open-source tools built in into the platform and into each cluster that you connect.

A

You can also specify compute templates, so you can have your own instances list, so you can create. Let's say: I want to run only on half a CPU with one one. Gigs around or I want to run on a really large 4 CPUs with one GPU or two GPUs I can specify everything. I can build my own custom.

A

Compute templates makes it really important for ID to be able to manage machine learning. Computer plates provide data scientist with an easy way to spin up resources.

A

All right, we'll go into amnesty, to show you how you build and deploy in a bottle and converge so at convert. We support a lot of different ways. You can start building models. We even support the most basic way, which is spinning on a jupiter notebook or even BS cone. So you can choose to run vs code on any of the OpenShift clusters that you have. You can choose any compute template you want.

A

You can also choose any dataset, you want and you can choose any docker container one, so it can connect to your own private registry and we also provide some pre-built as well.

A

Once you spin up a resource, convert will allocate of the CPU and memory that you need we'll spin up the container, get your code from jet or from convert and you'll have a working environment up and running in a few seconds instead of a few hours of static up. This is all in the high security standards that openshift provides and it's the fastest way to get a visual vs code running on a remote machine.

A

Alright, next up is flows, and this is from the slides I showed earlier. So this is a really fast way to build any kind of machine learning pipeline. You want, in this case I'm loading data from an object storage. It could be mean IO, it could be your own object, store cloud object or anything you want here, I'm running it now on a spark pre-processing.

A

This is a simple Python script that you can run on your own. You can choose compute. You can choose any docker container that you want to run this job on then I'm going to do a model selection process. Each of this compute.

A

Each of those components here, like I said, can run on a different computer, so this can run on a spark cluster. This can run on the GPU in the cloud. This can run on my on premise GPU. This can run on a CPU, so I get like the flexibility to run any kind of compute any kind of task on any kind of compute node that I want. So I can have one PI point spread across all the different compute resources there have when guarantees high utilization for and the best tool for the best job.

A

So here I have three different models: each of them can have multiple experiments, so here I have two different number of box: I have 10 comma 100. I can also have different types of learning rate, for example, and what's happening here.

A

Is that once I'm going to trigger this specific component converge will automatically run it four times, one for each permutation that it's calculating, and this is really cool, because the nature of machine learning is running a lot of different experiments to optimize, to the best result that you can converge, makes it extremely simple, with openshift to just specify a grid of parameters and then convert will take care of resource management tracking, of the different models and eventually also deploying the model to a remote openshift cost all right.

A

So, what's going to happen here, the unloading data from an object store I'm, then running a pre-processing step using spark and verge automatically passed the output data as an input to all those three different models. Each of the models will be running as many as time as I said it in the internal parameters. Then I'm going to automatically pick the best model that that I have and deploy it as a web service. This web service is also I'm going to pick the model based on accuracy.

A

Of course, I can customize it based on any metric that I want and I'm going to pick the best model and deploy it as a web endpoint. This is really cool because in one pipeline, I've been doing pre-processing on spark training on GPUs or CPUs and then deploying to a remote openshift cluster by one of the counters or something once this pipeline is, is triggered. Converge automatically tracks everything all the input and all the output is automatically versioned.

A

Data is version and across the different component, and you get this table that you can track in real time and you can see all the different executions of the graph. You can go into specific experiment of the graph and see resource utilization, hyper parameters and metrics are automatically being tracked. You can see metadata about the run and we also automatically plot your accuracy and metrics.

A

So it's great also for research like you, can track and monitor the different models, see what kind of hyper parameters work best for with what kind of models you can even select a few models like this and click compare, and then you see all the different models. Side by side and exactly what happened in each model in every stock in terms of serving so we we selected, we did model selection. The best model was automatically deployed to a end point as an end point, so we use open shifting the back end for this as well.

A

We deployed your file and function as a web service on OpenShift, and the cool thing is besides the fact that we completely automated DevOps. We helped data scientists and their engineers to get sort of like an x-ray to the model. You can see all that's happening and all the input and all the output are automatically tracked, so you can build new datasets using this data or you can see the activity of the model.

A

What happened recently, you can also deploy new versions of the model and using canary release that helps you to gradually deploy new models and continuously test them, as they're being deployed. We also have cabana and Griffin are great tools for IT and their Maps. To maintain and monitor the endpoint and last part is low, is a continuum machine learning. This allows data, scientists or engineers to monitor models in production in commitment. Today, let's say you have five data scientists, 10 data scientists.

A

Each of them has five models in production very soon, you'll have around 50 models in production to monitor that's quite hard, and each model requires different kind of monitoring. The one we did is that we made it extremely simple for engineers to add alerts to their models, so you can track model confidence or you can track data quality or any kind of parameters that you want and then trigger an email to the data scientist. So let's say your model receives that input. You can automatically send an email to the data scientist hey.

A

You should be looking at this model, because something bad is happening in production.

A

Another cool example that we have a lot with our customers is being able to track the confidence of the prediction, and if that confidence prediction drops below 0.5, for example, over a specific period of time, then automatically rain a model returning the model, so it will. It will basically trigger the pipeline that we just built before make sure the new data is being fetched.

A

We train the model using the model selection and deploy the new model gradually with a/b testing canary deployment into the production, this production endpoint- and this is how we offer continuous training and continuous deployment on a platform that was built on top of OpenShift using the scalability security and other great features that OpenShift provides and all delivered in a way that they designed these and engineers can consume it fast and easy and be more productive at part and put more models in production.

A

So this is a converge. Thank you very much.

A