Red Hat OpenShift Data Science | OpenShift Commons Gathering 2021, 28 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenDataHub: The Open Source Toolbox for Data Scientists - Audrey Reznik (Red Hat)

Description

OpenDataHub: The Open Source Toolbox for Data Scientists
Audrey Reznik (Red Hat)
http://opendatahub.io/

OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html

Find out more about OpenShift Commons, please visit: https://commons.openshift.org

A

So hello, everyone, my name, is audrey resnick, I'm part of the red hat openshift data sciences team. As a data scientist, I've had the pleasure or maybe the torment of delivering aiml models to production. So in a world of jupiter, notebooks terminal servers get lab runners, s2i containers and openshift.

A

You don't know how glad I am to have discovered open data hub in this presentation. I'll give you a brief kind of background on how open data hub started. What open data hub is, and hopefully I'll be able to have enough time to conclude my presentation with a quick demo, it's not going to be a live one it'll be with slides, but the demo will show you how to go ahead and deliver an ml model which dwells kind of into the world of fraud. Detection a bit history on how open data hub started.

A

It started internally within red hat as a platform on openshift for data scientists to basically go ahead and store their data and run their data analysis workloads, hence kind of the phrase data hub and fairly early on. It was realized that data scientists and data engineers requirements for tools and really anything to do with aiml components, were pretty different from devops requirements, so data scientists I can attest to this as a data scientist- are mostly ui driven.

A

We really avoid using terminal commands and we expect the tools any of the tools that we use to include our favorite aiml libraries that we're accustomed to using now collaboration and sharing is also a very important requirement for our workflows to successfully be able to be delivered to production.

A

So the main points of kind of sharing, machine learning, workflows done in notebooks and moving a model to production and managing the mode, while in production monitoring it making sure that your predictions are accurate watching for any data, drifts, resource usage, gpu memory and whatnot. Those are all very important to us, and these are um things that were combined together as multiple tools and components to kind of obtain, an end-to-end ai ml platform. Hence we have this open data hub being not a single application, but really a platform with multiple tools running on openshift.

A

A

So open data hub is really how red hat does artificial intelligence and machine learning internally. On openshift and we've learned quite a lot from running machine learning, workflows on openshift, we faced kind of still face a lot of challenges and issues that we try to kind of resolve and provide solutions in the open data hub issues and challenges. There are maybe three or four first of all is the people in the aiml projects, there's always a team of data scientists, data engineers, devops product owners, business developers that need to collaborate and work together.

A

Secondly, there's sharing and collaborating around the ml development is difficult, sometimes, most of the time it can be manual and really can be error prone. Thirdly, another important challenge are just the computer resources themselves. Aiml workloads are compute heavy and cpu memory. Storage are not unlimited resources. I think we all know that um and definitely they're not unlimited resources in any development or production environments that we're working with. And fourthly, which is the final challenge and one that is very critical, is delivering to production and the production development life cycle.

A

Sometimes that's not as easy as it sounds so today, open data hub internally runs uh aiml work, workloads uh such as application logs. So in our uh internal open data hub clusters, we run anomaly detection on multiple red hat application logs we have cluster metrics, we gather and analyze the cluster metrics or sorry, the cluster logs from openshift clusters, and we have an ai ops team dedicated to finding or predicting any issues that may occur there. And finally, we have customer support data.

A

So, on our customer service side we store and analyze any of the so reports, customer feedback and many other different types of customer data. So we've kind of gone through the history.

A

Let's go and take a look at really what is specifically open data hub, so open data hub first and foremost, is an open source project driven by an open source community.

A

It's a collection of tools and components that make up the end to end aiml platform, specifically on openshift, the aiml workflow starts with prepping and basically transferring the data into a data, lake or storage and making it assess accessible for data scientists.

A

When we look at what the data scientists do um we're really looking at the next phase, which is model development and what we're doing is we're looking at the data analysis of our data, picking certain features uh going ahead and creating a model going ahead and then training and then doing some model uh validation.

A

The very last phase goes into the devops uh realm whoops back up into the devops realm and that's really moving and serving the model into production. This phase is not kind of a static one-stop uh model serving delivery phase, but it's a constant, optimization phase, so the cycle of monitoring, optimizing and serving is a constant cycle that happens really for the lifetime of your model and again, at the end of the day. It's that collaboration between your data engineers, your data, scientists, your devops and any of your business developers that you have so next.

A

What I wanted to basically show is show you is just a diagram and show you where you can actually find open data hub. So, first and foremost, open data hub is an operator that's installed from the openshift operator hub. So you see that I have an openshift screen here and we can go ahead and then choose the opendatahub and when you look at the open data hub you're able to see that there are various tools that you might be able to use.

A

So if you're a data, scientist you'd be very interested in using jupiter hub, maybe for some of the business analysts. You might be interested in using grafana to take a look at some of your results from the model that you've deployed.

A

Now, open data hub integrates open source projects into, as I mentioned, an intimate and aiml platform on openshift.

A

So we go ahead and we take all these different open source projects such as kubeflow and we adapt them to run on openshift and we package them basically within an operator and then we go ahead and offer it on operator hub. So, of course, kubeflow is pretty big, and this and the central component in open data hub and we add other components and you can see them on the screen. There we add things such as grafana spark, prometheus, jupiter, hub kafka, etc.

A

So this slide here really shows all the different tools and components that are provided by the open data hub platform and it just addresses basically a specific functionality in the uh end-to-end ai ml workflow and again, this will look very similar to the slide that we saw just two slides ago, where, first of all, we focus on data analysis.

A

We have storage integration which could be our self storage uh working with postgres, sql or mysql. We have to have some way of doing data exploration, so we might use superset or hue if we're interested in our metadata. We might have something as hive metastore then for big data processing. We may use something as spark.

A

Those are things that the data engineer and the business analyst are very interested in then we move on to the artificial intelligence and machine learning, so the data scientist domain, but a data scientist they may jump into an interactive notebook such as jupiter, go and go and do some of their work in there. If they want to go ahead and train, fine-tune their model or work with a distributed model, they may use something as pie torch. They may use something like spark for machine learning applications themselves.

A

There might be various libraries that they're interested in. In that case they can use the open data hub, ai library and then finally they're going to go ahead and look at how they can deliver some of their their services for their model or deliver their model through cube pipelines or maybe airflow.

A

That brings us to the production side where we're going to go ahead and deliver what we've created to the devops engineer. So again, when they're looking at the model serving, they may use something as seldom a way to deliver.

A

Some of the services again might be using something pipelines such as keep flow pipelines or maybe argo and then, finally, if we want to actually take a look at what's going on with our model, we'll use some sort of monitoring tools such as grafana or prometheus, so the open data hub comes with an ecosystem and again this is provided by red hat and certified partners and basically to help enable our customers. We built this ecosystem around this open data hub and we feel that it provides our customers with a faster go-to-market strategy.

A

So if we take a look at the product integration, this ecosystem provides tools for tighter integration with red hat products such as red hat, open shift, self storage, open shift service, mast, mesh.

A

We can go all the way to red hat three scale: api management to actually get help with some of those items. We do have red hat consulting engagements so as part of the ecosystem.

A

We have that dedicated aiml consulting services team to help our customers succeed in their digital transformation efforts or plans and really accelerate their their time to market with what they're trying to do very important part of this is our red hat certified partners.

A

We work with third party vendors to get them certified to use ubi images and certified operators. Then these partners become certified partners that will provide support for their tools, integrated with open data hub, and we could look at some things, such as selden or anaconda, anything that we might use for for model serving etc.

A

And finally, we have industry use cases so basically to go and showcase. These integrations. We've built multiple industry use cases showcasing how we're using open data hub integrated with the red hat products again, such as fraud, detection, with open data hub and the red hat decision manager.

A

So what I'd like to do is just give you kind of a slide demo to show how I would go ahead and do some fraud detection within a bank to give you an idea or flavor of how you can work with openshift and open data hub to actually deliver your solution.

A

So the first thing that we're going to do is just basically log into your openshift account from there we're going to go and proceed to the open data hub, dashboard, we're logged in as a developer and to do any of the navigation. We would use the left panel navigation bar so right now we're looking at the topology.

A

So I would just proceed to the open hub dashboard by clicking on the odh dashboard operator and then click the open, url button what'll happen is we'll be presented with some sort of open data hub screen and we'll have a large choice of options to choose from. As I mentioned, odh contains a number of tools that you can build and manage and deploy your models. We're going to take on the role of a data scientist and work on a fraud detection model.

A

So what we're going to do is click on the jupiter hub card to open, jupiter hub and go ahead and begin programming.

A

So when we open uh jupiter hub we're first going to have an option to determine the type of notebook that we're going to use, we're just going to use a basic machine learning, workflow notebook that we can use to deploy a fraud, detection model and again just a reminder, we're looking at legitimate and fraudulent transactions that are in a bank. So we would go ahead and just accept the other defaults and choose spawn to continue.

A

I've actually gone ahead and pulled in the notebooks through a git repository.

A

So in this case, when you go into your jupiter notebook and you pull in your notebooks, you'd be able to see them, and in this case we have some of our feature: engineering and model or logistic regression and services notebooks that we use to deploy our fraud detection model.

A

When we put the model into production, we actually go back to the openshift side and we use pipelines so we're deploying the machine, learning pipelines into production with openshift pipelines and we'll see how we can use the services to make predictions when we go back to the main openshift console and select pipelines, you'll see in this case there's a pipeline that we've already created. So what we do is we could click on the pipeline and see the pipeline details and remember. This pipeline is going to help deliver our models or our model.

A

So once the pipeline is finished, we have a model or a rest service, that's built with source to image or s2i, and at this point what we'll want to do is take that pipeline service. More specifically the url, because we're going to be using that url and you'll see at the bottom. I have a service url such as pipeline operator data, hub user, one, etc, etc, and we'll be using a request library in python to interact with our rest service that we've just managed to deploy.

A

So if we go jump into the jupiter notebook to interact with our model services, we'll go ahead and just replace our default host with that generated url from those pipeline services that we have running, then, if we go ahead and run our services make that request and then run our model, we'll have the model making its predictions. In this case, we have a lot of legitimate predictions on the right hand, side of the screen you can see under predictions. That could mean that we're very good.

A

We if we as we'd, run this model a little bit longer we'd, probably see some fraudulent predictions coming up. um All in all. That looks very good. So then, what we want to do is we want to actually go and take a look at graphically what our legitimate and fraudulent detection transactions look like over time. So we can go back to odh and we would launch grafana.

A

We would log into grafana and then we would get in touch with the pipeline service that we had running and then we'd be able to visually monitor our service for fraudulent and legitimate transactions.

A

I apologize for the screen being sort of or the screen capture being sort of fuzzy, but what that's doing is um what it's showing you is over the course of a day the number of legitimate transactions, which can um should be a lot larger than the fraudulent detections which we are detecting.

A

So through that very, very short, slide demo. You have the ability to visually.

A

Worry about that.

A

Now we have the ability to visually, monitor our service for fraudulent and legitimate transactions, and that's all going through using the open data hub services where we were able to deploy a jupiter notebook and go ahead within that jupiter notebook and, at the end of the day, get our model running and then go back into open data hub for another tool. That will allow us to actually see some of the services that we have running from our model in the back end.

A

And that concludes my demo and my kind of recap on open data hub. I hope that you found this useful and I look forward to answering any questions that you may.

A

A