Red Hat OpenShift Copenhagen 2018 | OpenShift Commons Gathering, 12 Jul 2019

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Fraud Detection Using OpenDataHub on OpenShift - Juana Nakfour (Red Hat)

Description

OpenDataHub Fraud Detection
ML SIG
OpenShift Commons
July 12 2019

Open Data Hub project is a reference architecture for an AI and Machine Learning as a service platform for OpenShift built using open source tools.

A

Is it okay to have an open shirt I, see.

B

A

All right, so in this presentation, what I will do is give a little introduction on what open data hub is and then I'll dive deeper into the fraud detection use case that we implemented using open data hub and at the end, I will have a demo, but it is a recorded demo. It's not a live demo, just because I don't have the cluster with everything running at this moment, all right. So, let's start so.

A

What is open data hub I'm, not sure if anybody on this call is familiar with open data hub, but in a nutshell, it is an open source project to create an end and a IML platform on OpenShift kubernetes.

A

We also run this platform internally within Red Hat, and we have a lot of internal customers where they might have to use it and that's where we learn from and we learn what the pain points are, and we try to bring in the tools to basically solve any any pain, points or issues that are needed with the end to end AI platform and I just want to say that the AI ml and anthem is very complicated. It is, it is what not like any other simple software engineering platform.

A

It requires many users of the platform and a lot more tools and a lot more complexity. Then then, what we're used to software engineering problems? So with this in mind, we came up with the open data hub project and today it is an operator that you can download from the community operators on open sure. It's basically and one-stop easy install for all that. For that all the tools that you need to run your AI m. That's one right there!

A

Well the tools that we bring in we started out with the basic tools and I'll show you in a little bit what our road map looks like, but basically what we thought of the users of AI ml. We started with the data scientist, so we pre invited you to have not work. We thought of the data engineer and we provided self storage, and then we also thought about the processing part, which is SPARC and I'll, talk in depth about this in a second. So we also have tools for DevOps.

A

Ai ml is not just training models. It's also serving them and monitoring them and being able to monitor their feedback flow back into the al AI ml models. Alright. So let's move on to the next slide. This is kind of it's a busy slide. It is a high level kind of a higher level reference architecture for the end-to-end AI.

A

Not all these tools today are in the open data hub operator, but most of these tools that you see here are something that we worked on and tested on, and maybe we have it installed also on the internal platform, although we are testing but as you've seen a little bit on the road, not some of them will be included coming up here by the end of the year and I just want to talk about the reference architecture at a high level on the right side, everything runs.

A

Everything is open, chef, kubernetes native, so everything runs on that platform and, as you can see, as I mentioned earlier, we have multiple users for that platform. We have the data scientists, the business analysts, the data engineer their folks engineers. All these people need to use these tools in the platform starting from the bottom. You know again, it's the first software engineering problem with AI ml. We always start with data.

A

Your data can be in motion or can be static somewhere in storage such as in-memory databases or data lakes, or a relational databases, or it could be in motions, for example, it could be in Kafka or coming on an SVG interface from self etc. So after we get the data, we can tag the data or data clean, the data etc and then give it to the data analyst, which is the next green level. Here that we see the data analyst will take the data and they will create the models they will analyze it.

A

They will try to make meaningful predictions or meaningful studies out of it. Once that's done, then you basically have models that you need to serve and that's where Selden comes into place for the model, lifecycle, ml flow or I think we demoed that here previously for stats coming out of the model, the applications that use the model.

A

So, on the right side, we see well now that we have the model. So it's running, we have data coming through it. We have to monitor it right. We have to not just monitor it. We also have to have some kind of workflow that is automatically generating updates for that model.

A

We also have to secure the model network, security and governance come into play so, as you can see, pretty complicated and to end platform all the way from the bottom from the platform, all the way up to actually serving the model and monitoring it on the left side, I have a just a very simple sample of a workflow that we have that we use most of the time and it's very related to the fraud detection areas that you will see coming up here. We have the data served on set. We take that data from staff.

A

We transform the data who create the models using Jupiter, hub, Jupiter, notebook and spark or tensorflow. We run experiments on it and get data out of Emma flow. Then, after we're happy and we're satisfied, we deploy using Selden and openshift, and now the model is served and once it's served together, metrics and we display it on the graph on our dashboard and sultan, has interfaces for Prometheus so that you can extract a metrics from the southern system itself and the model itself and you'll see that in the demo, that's coming up for fraud detection.

A

Just a brief. Our next try just a brief outline of the roadmap for open data hub. What we have today and what's coming down the pipeline. So the initial release, which was earlier this year included, like I, said the basics for a data scientist to grab data and do some analysis included Jupiter hub and it's multi-user Jupiter help multi-user spark clusters same things. If you have multiple users using the same open data hub installation, they can have their own Super Hub and their own spark cluster and also included stuff Nano.

A

The latest release that we have that we released a couple of weeks ago included Selden for serving beaker X, which is a notebook that has the notebook image. It has a lot of good tools for better easier data analysis. It also includes included GPU support and Jupiter hub and we added prometheus and go fauna so that you can do your monitoring and came out of the box already monitoring the spark. That's one coming down a lot, a flying and August and of August release, we're gonna, add a lot of really interesting tools.

A

We're gonna, add the AI library, open data. Have the air library which I did not talk about this here, but I think we demoed this some time and the ml stick will come again, probably demo. It again include our cargo, which is the native workflow for AI ml. Many is offensive, will also have stuff installed by rook and that's basically at a high level. That's it for the roadmap.

A

If anybody has any questions, you can always stop me and ask me: that's okay. So if there isn't any questions, I am gonna, go ahead and start with that with the demo for the fraud detection, I was trying to see if people can unmute neither ask right anybody.

B

I'm just unmuting everybody now so you just are all self melted self melted right now, muted used to be I want wanted. um A couple of questions. I know that we've created this open platform open data hub, but it I know I've heard it is already being, and you said, say: Mass open cloud: are there other places where it's being used in production, yeah.

A

Yes, we do have some right head customers that are already deployed it and are using it. But yes, it's in the mass open data cloud or scientists and researchers the universities use in.

B

This, so this is a project that that anyone can take and deploy and and use on their own kubernetes openshift yep.

A

Thank you for pointing that out, I missed the URL is open they to have that io. If you go to that web page you'll find all the information on how to install it, the documentation, our blogs, and we also have a high-level architecture, reference architecture at /rp, page demo. Yes,.

C

Can't waste a book, good question, so using staff as the storage is it possible to plug something else? As a starch, for example, a database.

A

I mean it's pretty flexible. If so, if you're asking me from within the notebook, then if there's a notebook library that supports it, then you can do it.

B

A

So any also NES 3 interface. We already have that. We have examples for you to use s3 interface to any storage, not necessarily stuff, and so the interface is the interface that we use right now. So we have a stuff. On the first release, we had a step nano pod running and interface. We used was s3. So if you have another pod there that's running another storage that has s3. You can also do the same thing. I.

C

Feel because it's Park, you could probably connect to any spark data source that the cluster can see that accurate.

A

That's exactly correct, yes, ok I mean not just the cluster can see. I can see so, whether it's running within the namespace or there's a route, a rest interface or whatever are out to that storage. Then, yes, I think that do it. Ok.

C

A

The fraud detection use case.

A

So what is the fraud credit card transaction news case, so we wanted to come out with the use case kind of captures the whole end-to-end AI ml. In this case, although I think at the other end, which is which is the feedback loop back to water, you serve the model, it's not really in this use case, but we capture the beginning, which is getting the data getting the scientists to explore the data and then, after the scientist, decides ok, this is the best model. We serve the model.

A

Well, what we did is we grabbed some data from Cagle. It's credit card transaction data. This data set included time of the transaction amount of the transaction and 21 hidden features of the transaction and they're hidden to protect consumer neighbor. So we took that data and what we did is we used all the tools that we have in the open data hub to kind of flow. Through this exploring the data fixing the data baiting the model and serving tomorrow, we wanted to create a model that can predict a fraud transaction, so you feed it.

A

One of these credit-card transaction- it will tell you this is fraud or this is not fraud. You also wanted to monitor the model, so we collected models on the model and we showed these metrics using a graph on the dashboards. Of course, metrics were collected from Prometheus alright. So let's move on to a high-level architecture slide again a little bit more busy that I want to talk to it. So at the left side, you'll see the users of this use case. We have the data scientists and the end user. These are the data.

A

Scientist is the person who is creating the models in the lesson there I'll talk about in a little bit is the person who's doing the credit card transactions, and then we have the dev ops he's the person who is monitoring and making sure everything is running. So we start with the. So we downloaded the data and we saved the data and stuff.

A

That's credit card transaction data and the data is around two hundred thousand transactions, and then we gave this to the data science and said here. Take this data. Tell us what you can. How can we predict the transaction and then the data scientists? What they did is they use the Jupiter have notebooks and they did their analysis. They used spark for some analysis. You get the data, as you can see here in the gray spark box that they have their own spark cluster.

A

They have their own dripper hub, notebook, that's fun to play with and then after they came up with the best model, the way they think and they analyze and I'll show you an example, notes book that we have here. They know they came up and said. Ok, this is the best model that we can come up with. You took that model. We saved it as a file called model pickle and we save it in ourselves. So he comes Seldon and created a Selden custom resource. What what custom resource does?

A

Is that grabs that model from work and serves it and it serves it as an endpoint now to simulate. We don't have our transactions coming into the platform right. So this will me like that. What we did is we created a coffe cup producer that Kefka producer is gonna. Read part of the connect, our transactions every one to five seconds randomly and it's going to hit the Selden rest interface for the model and then that's going to bring back a prediction saying: okay, this transaction was fraud or this reduction was not brought during that time.

A

But all this is happening all this metrics and data is collected by Prometheus, and it's shown in Griffin our battery boards for the dev ops to kind of watch and see how things are operating and that's at a high level of what this use case is, and it's a that I have next. Basically, like I said these are the transactions most of the stuff that needed storage was and stuff.

A

We used a notebook for data exploration, used spark for bringing the data into a data frame. We used SK learn to create the model that was a random forest classifier. We saved the model in a file, we've used Selden to serve it, and we used Kafka producer-consumer to simulate the transactions and then Prometheus and grow fond ofor, metrics display I.

A

Think that's all I have right now. If there's no questions, I can move on to the video of the demo, so I'm going to switch over here right, so we should be seeing right. Now is what I've had openshift portal? Yes right, all right! So we'll start by showing you know the pods in the platform that we just described really quickly and then we'll move on to them with that.

A

So you'll see here that we have the Kafka operator and you will see in a little bit that we also have multiple Kafka pas they're all running you'll see the girl on a pod in the Jupiter hub. Jupiter have database I just want to point out. There's this to Jupiter. We had two users using it at this point. This was an open to TLC's, a user and user 11, so they both have their own Jupiter hub. You'll see the model. There's two models are being served.

A

One is on the full 200 K and one is just the example that I'll show just thinking. Let's see the seldom core and the Prometheus in Southern Cross we're here the spark cluster and again same thing, the spark cluster. We have two users, you'll see to spark clusters with workers and masters for each cluster and then at the end, the sisters, the string operator. So that's it for alright, let's move on so this is a notebook that our data scientists use to kind of explore.

A

What's the best way- and this is just a sample- I- wouldn't say that this is you know, production already or anything like that, so we uploaded the credit card data to that nano on a bucket called open. We get a 200 request back for only a sample of the data which is tanking over there and that's just the exploration part, but not the actual production part.

A

So we uploaded this bucket called open and we used spark our spark cluster and you'll see that once you open this notebook, you already have a pointer to your to your own spark cluster in iOS environment. Very well, then you just connect to that spark cluster and you get a handle and a session to the spark cluster. Well here we're just reading the CSV file from from our self storage, and you know on earth.

A

It'll take time and you'll see that only 10k out of the 200k we read, and it will show you here and a little bit. That's only a small fraction of these transactions are actually fraud, which makes this data set skewed. But that's okay for this for this little demo, so speeding the transaction. It will take a little time here.

A

So you can see so numbers 10k and only 38 outside all right, so we get this data, it's in a data frame. What we'll do is we will try to do a random, for us will take only 75% for training or for model fitting or the rest we'll use for testing later on. What we'll do also is.

A

We will only take the features for the feature on the drop time in class and use class, which says, fraud or not fraud, as the prediction vector we do the model fit, and then we will create the model using random forests and you'll see all these V features. These are the features that are hidden and you'll, see that it trained on 7500 transactions and the Titan 2500 or left for test and you'll see that in a little bit coming up what's creating the model.

A

Number, so we took all the features at first, that's what the data scientist says did they took all the features? First indicated that model, which is pretty big. Normally you don't have all these features. You want to pick the most important features and you'll see that so we did the confusion. Matrix and I won't go through this deeper. So basically the confusion matrix shows you what was predicted.

A

What's a fraud, what's really fraud, what was predicted not for and what's with the numbers, so you'll see those matrices here and it's just one way of seeing how good or how bad your model is. And again this is just you know, testing for this use case. It's not an in-depth test just to show what the features you can do and what the tool is. So we check the important features here.

A

We took the model and we plotted the features based on importance, and you can see here that at the top, seven or top nine are important and then the tailor's off the rest were not. So we took the seven important features that you see here: the V numbers and then we recreated the model again. So we recreated the model using the important features, and we did play around with these metrics that you see the estimator and the depth data. Scientists did some tweaking here and there so we model fit again.

A

We created, we create the model again and this time we're saving the model and a file called that file is the file that we're gonna use to serve the model and you'll see that in the web.

A

Well, the output is modeled optical and the number of features this time are only eight, well the confusion matrix again. We do it again. This is just calling up and doing here, but this is just to show that you can do it and look at it again and then before we serve the model, we do like a little test. We give it not. We filter on the test data for not fraud, and we send it not fraud that you that out zero. As you see, we do it again with fraud.

A

It's just spread spit out one. These are the predictions coming out of that model.

A

And again, if anybody has any questions, please stop me.

A

You can see one here for all the fraud, it's being son, so we're happy with the model data scientists happy with the model. What he does is ok, I'm gonna load it to tough again and they put it in a specific cold model, and now the dev ops part comes may be, or it could also get a data science student it for for actually serving the world, so we're just showing here that it's being uploaded- and it's successful this next steps that we she will be shown which is logging into the open shift cluster.

A

You can do that on terminal, we're just showing it here in the notebook just to make it easier for us, so you log into the cluster and you create a new project and we create a new custom resource for Sheldon called seldom deployment. You'll see it here and that's all them deployment. What it does is it grabs the model, that's incest and serves it, and it exposes a rest interface. So you can see we have two.

A

We have one already model full, that's already installed from before, and my model is that little model that we just they you can do more testing and see here again. This is just a Python way of testing it again. We just send it.

A

Let me send it fraud and not fraud and we'll see you at its itself.

A

So the full model was used for for the demo that you'll see in a little bit the part where we will see the metrics coming out of go fauna. This is coming from the phone model, so we gave it fraud it spit out one and we're moving in here at the not fraud and they'll see you that it won't return.

A

A

All right, so that's it for this notebook, not fraud, giving a serum and they do the dashboards that we have running on the cluster. This is the first dashboard that we see in Griffin. ah This is actually showing all the metrics coming out of the model. So it's the first graph is, is graphing probability of a fraud vs. amount, nothing interesting. There you'll see this the red spikes or the spikes that are saying fraud. The next one is the probability of fraud versus v70 in here, I. Think, there's something interesting.

A

We see dips for 317 every time, we're from fraud same with the second one, which is v10. We also see dips. So this is just some interesting things that you know you can look at for a lot of time and try to come up with something. This is the core metrics for Selden. It just shows what the errors are: HTTP errors and the success rate and requests per second to you, the model.

A

Then we move on to another dashboard, which is the calf cow. So, like I said, we used cough a cough, got to kind of simulate the transactions, and here it's showing us how many brokers, how many partitions, how many messaged rates are coming in and, like I said we are randomly generating messages between 1 and second to 5 seconds.

A

Moving on to the cluster monitoring board, and here you'll see all the monitoring coming from open shift clusters such as memory usage, much memory, you are using how much CPU were using there's interesting part here, but CPU pod usage per pods, either operator hub using the top CPU and then I think this is really interesting and then pod memory usage. You can see the spark cluster here. We use really love and using a lot of it, not a lot of money but top memory.

A

So that's, basically the DevOps way of looking and monitoring what's happening in the cluster.

A

Think it's for that part.

A

For the demo, though, the URL for the open data hub community is opened in app that I own open source. This is all open source. We see the deep lab link there and we also have a Twitter handle Oh anybody. You have any questions.