Red Hat OpenShift Copenhagen 2018 | OpenShift Commons Gathering, 8 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ML on OpenShift SIG Sherard Griffin (Red Hat) AI on OpenShift with the DataHub

Description

"Data Hub" is a collection of open source and cloud components deployed as a "machine learning-as-a-service" platform to solve internal business problems at Red Hat that enables teams to build, deploy, and execute analytic, machine learning and AI models.

Repeatable human tasks are being replaced by automation, creating significant opportunity and risk for Red Hat. AI can be applied to our core business and direct customer services. To do so, data must be seamlessly unified from a broad range of sources and made accessible to analytic models.

This presentation is about how Red Hat runs AI and machine learning workloads on OpenShift.

A

So, thank you, everyone for inviting me to speak on what we're doing with AI and machine learning at Red, Hat and I wanted to introduce you to a project that we have going on here. Called the data hub. The data hub started off as a reference architecture for how we are doing AI within redshift, I'm, sorry, OpenShift and then some other technologies from other other open source technologies, and it has spawned off into also solving internal problems at Red.

A

Hat I'll just speak through exactly what the data hub is, how we're using it and then how we're kind of pitching it as a reference architecture.

A

So you can think of the data hub as a collection of open source components and the foundation of it being open shift and kubernetes, but some of the things that we're tying into it is being able to do. Data streams or big data, do model training, execution of those models, basic ETL requirements as well, providing api's and then visual, visuals and reports.

A

On top of that, in terms of why we created the data hub, we were initially tasked with solving some of our interesting build issues that we were having at Red Hat for continuous integration, continuous delivery. So the Red Hat act, the data hub actually spawned off as a way of us aggregating and collecting all of that information. It quickly was moved into the AI as a service type of category.

A

When we started to think about well as we're collecting all of this data, what can we start to infer from it and then also enabling other teams to do that? So what we tried to do is provide a platform that will take some of those mundane repeatable tasks such as, if a build fails or if there's some insights.

A

That needs to happen on on log data or metrics data and taking those repeatable human tasks and starting to automate those, and what we found is that that's a great way to augment our core business by and giving developers the ability to automate a lot of these tasks, and it just makes things a little more efficient on our end. But in order to do that, the first thing we had to do, which is why it's called our data by that data.

A

And some of the things that we talk about when we start to describe AI and ml AI is like the new bi right. Everyone says it but they're, depending on who you're talking to it, has different connotations. So one of the things that we did in the data hub team and the AI Center of Excellence, which is a part of the data which is what the data hub team reports into is we had the level set on.

A

What do we mean when we say machine learning, artificial intelligence, and how does that stack up against some of the other things like statistics and predictive analytics, so the bread and butter of the data hub? We focused more on the right side of this chart when we do things like natural language, processing, autonomous decisions and also you know anything from anomaly detection. Those types of pattern recognitions, that's really where we're focusing on and then where we have. Our data science is spending a lot of their time.

A

B

A

Perspective this is where we get into when I mentioned. We are using the data hub as a reference architecture. We've at the core of the data hub we use you'll, see on the left. Side of this is Seth object, storage. That is basically our we're using that similar to what you would use s3 as if you were in amazon's infrastructure, where that is our data lake. We have lots of different types of data. That's stored there, anything from Red Hat cloud services. We have data pumping into that.

A

We also have metric data coming in from services such as Prometheus. We also have data, that's more operational in nature, and then we also have just basic customer information that we store their support tickets feedback loops things like that, and we use that as a collection, a way to collect all of that data itself, as it stuff is great for streaming data into those systems. We also use elastic search for more of our raw log analysis.

A

So sometimes, when you have you know just just terabytes of log data coming in, you need an engine that will allow you to quickly sift through that information to do some kind of visualizations through use, elastic search for that for that use case, and then we use Yanis DB janus DB. However, wherever you come from in a part of the world, you might pronounce that differently, but we use a graph database Yanis DB, and that is for some of the work that we're doing with stacks and doing intelligent tack.

A

Recommendations, though, if you are deploying a stack, that's focused on artificial intelligence, giving you recommendations on hey, you might want to add these packages to it, or your packages may be becoming out-of-date. Here's the impact on your system, and so we use a graph database to handle those those types of use cases on the ingestion side of things we are using Kafka. There's a project called string Z that will be part of a MQ as well, so the string Z project is all on kubernetes as well. We lose log stash.

A

We have a homegrown ship shift instance, which basically takes Jenkins artifacts from our build systems, pumps it into our system so that we can analyze those artifacts. We also use open whisk if you're familiar with serverless actions. Open whisk is an open source technology for those service actions, and that allows us to do things.

A

You know with data as it's coming in streaming, that we can do some kind of AI or ml on top of the data as it comes in on the analytics side of things we use for the ingestion and processing of the Enel of the data coming in.

A

We use spark again that's on kubernetes, that is a project called rad analytics that we're leveraging their technology for spark on kubernetes, and then we also have Jupiter hog that we've deployed and Jupiter hub allows our data scientists to get access to the data and Seth and elasticsearch using spark as the processing engine, but then also they can use other images. Other types of notebooks as well such as scikit-learn, pi, spark and things like that, and on the reporting end, we have cabana that we use that for our basic visualizations, that's hooked into elasticsearch.

A

We are looking to expand on that to have more of an open source, bi tool sitting on top of the data, but as of right now, we're focused mostly on using cabana for those visualizations.

A

All of that rolls into our service layer where we have something called a common AI library and the common AI library you can think of that and internally. Our use case, for that is, as our data scientists, on our team and and other teams are creating these analytical models. There's a life cycle that has to happen where they may play around with things, but then, as they play around with it, they publish it into the execution engine and then we say well, you know what that was actually pretty cool, I.

A

Think other teams might like this anomaly detection, so we're building out an AI library that allows the data scientists to take those models and put it into a place that other teams can leverage it and just you know, take that. Take that model deploy it but then pass their own data through it and get some results out. We're doing that in a number of use cases that we just started out and we'll be publishing that ad library, pretty soon all of that fits on top of monitoring and alerting.

A

We use Prometheus and Ravana, or our monitoring and alerting needs, and then we also use a last Alert again. All of this is sitting on top of openshift, so everything can be deployed pretty rapidly in multiple environments.

A

I'm going to skip this kind of shows the the openshift side of things, a basic workflow that I'll show really quickly is how we have several different data sources. This top part here is going to show a little bit how we consume data into elasticsearch. In qivana, we have various data ingestion services that pull in data from various build systems, and we take those logs and we pump them through Kafka again with the streams II project, and then we do some kind of normalization on that data.

A

Using log stash flow in D we've since actually expanded. Our normalize errs to now include we do some copper connectors, that's not anything new, but we also are using case equal, so we'll be replacing some of the fluent D and logstash normalizes with case equal running on top of open ship.

A

This lower flow right here is how we get data into Seth, and so this just shows you we're taking a lot of performance data.

A

We have data from an IT data warehouse, stas reports, again build and test data, and we use workflow for workflow manager right now, we're using Jenkins, but we're also looking into witching to something more akin to like an air flow, Apache airflow and we'll be testing that out as well, but today we use Jenkins for basic Jenkins and cron jobs for our basic workflow management and then all that data ends up in the staff.

A

What we'll be working on in the very near future is to come combine these two workflows to where data flows through sista Kafka and then from Kafka. It goes both to elasticsearch and step. So that's a little bit of a modification on this flow that will have on.

A

The data on the model designing side of this is just to show a basic use case of how we have data coming in landing in SEF, and then the data scientists they use Jupiter hub right now, along with spark to get access to the data, that's stored and stuff, and we use a combination of Hadoop and Amazon drivers to get access to that data. We don't actually use Hadoop underneath the covers we just used there. There there there jar files to get access to it, using spark.

A

On the deployment and execution side well, what happens after the data scientist takes their data model and deploys it? Usually they team up with the data engineer, and we have a number of different environments that they can deploy it on. If it is more of streaming, data coming in or ad-hoc requests coming in, we normally push that to our server lists actions. An example of that is we just launched actually next week, we'll be launching sentiment.

A

Analysis service that sentiment analysis service will exist in is in open, Wisc actions so that as data streaming in from various systems, we can do an analysis on that data in real time and then process the results and return results back to users, but then for some of the back sobs that we have such as model training or doing a batch execution of a model. Then that's when we go back to the workflow engine and the data usually comes in off of stuff to do that. Training.

A

There's a little bit of combination of both the other thing, we're working on, which would be part of the server less actions, as well as the feedback loop. So as models need to be revisited and corrected for accuracy, that's going to be done through the server less actions as well. An example of that is for the sentiment analysis. If the entity detection or the sentiment analysis of that data comes back incorrectly, then we're we have mechanisms.

A

We have api's that are going to be hooked into a UI that allows the end users to modify that information, and then we retrain the model based on that information that came in and then the flick on the tail end of this is where we get into the AI and ml side of things. We provide a number of different libraries for the data scientists, including the SPARC ml library, I, could learn and ltk Kerris we're also very soon going to be rolling out tensorflow with GPU enablement, hopefully in the next month or month or so.

A

We'll have that as something available to the data scientists to work on again all of that being part of OpenShift and will continue as we have more types of AI and ml models available, we'll be adding those to the to the images that we have to make them available for the data scientists and to kind of wrap. This up. I talked a little bit about some of the services that we have. This would just kind of go through some of those again for our cloud services, we're doing anomaly, detection z' for infrastructures.

A

So you know some of our customers that are on the cloud services, we're kind of actively monitoring the collective nature of all of our customers to see interesting patterns and then we're working with the service teams to help either resolve issues or offer up new new opportunities for customers by by analyzing that information and then on the customer on the sentiment, analysis and entity detection. We have an ongoing project with a number of teams where we're looking at support tickets.

A

Looking at feedback from engagements with customers, deployments of customer environments, feeding all of that data back into the data hub and providing insights on what customers are talking about. Trending information, you know: what's what's working, what's not working, you know any kind of information we can use to help out from a support side of things, and then we also provide visualizations on top of the data hub to analyze the build information that comes into the system.

A

At the end of the day, we provide a lot of images, a lot of containers, and so we do a. We do a lot of work with the various teams that Red Hat to do to assist, with the container validation, doing recognitions of issues that might come down the system and and help them analyze that data as well.

A

The few the data hub, one of the things that we noticed as we talk to customers, is that there's a strong need for data governance when you're talking about deploying an ml as a service or an AI as a service platform for an organization, because you're now centralizing all of that information and centralizing by the access to that information.

A

It becomes very important for you to add the right level of security, metadata, lineage, auditing capabilities and whatnot on top of that, so we're actually working with a number of customers to identify commonality across all the customers and provide recommendations and also loop that back into the internal data hub, so that we can add on to the governance. So we're looking at things like OPA as well. As you know, some of the Apache products to help us get through the data government inside of things and then on the AI lotta lifecycle.

A

We need to elaborate on the use case, of storing the models so giving a proper repository, also working with promotion from dev test prod and then providing performance monitoring of how that model is actually executing in production, whether it's accurate or inaccurate. How and how it can be more efficient and then also backup. So those AI models that give you a rundown.

A

That's that's a overview of where we are with the data hub and AI as it as it's being worked on in Red Hat, and certainly it's a very challenging but interesting, and we will be publishing very shortly. Not only articles on on what we're doing, with with the different deployments of the data hub, but then also providing all of a public git repository as well as quai images, so that anybody can take this deployment of the data built on top of open ship and put that in their own environments. And that is it.

B

Hey this day, brown check really really cool stuff and certainly I've heard a lot of these problems over and over again. So I'm really excited to hear you working on them. I was wondering: is there something we can do in cute flow to make dooming this easier or integration with the services that you're providing easier? Yes,.

A

I'm I'm glad you mention it. I can't believe I didn't I, didn't bring up cute flow, and all of this cute flow is very much at the forefront of what our team is looking at as well. So we're doing a lot of investigating with the tensorflow in getting an integration with you flow, so that I don't have any strong answers as to how that's going to turn out just yet.

A

But we do have lots of Engineers working on integrating that with the rest of the ecosystem, especially as it deals with in the first passages is making making cue flow available for the data scientists. You know through the UI and then once we do, that working on the deployment side of it and we've looked at a number of different schools to help out with the promotion of models. We just haven't narrowed down on one that we really like just yet yeah.

B

No that's totally. Okay, I just was going to suggest you know. We are happy to help you close very much at this phase where, while the the core is is doing very surprisingly well, we really want to get real-world usage of it and it looks like you're already leaning into a bunch of cool stuff. So.

A

B

Don't hesitate to reach out to say, hey, you know, it'd be a whole bunch easier to integrate if we did X or Q flow did Y or something like that. Awesome.

A

So I will definitely catch up with you on that then, because yeah, we we've been dabbling around with it and I think. If we can have more targeted focus, then that could help us deliver something quicker.

A