Red Hat OpenShift OpenShift Commons AIOps SIG, 2 May 2019

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Log Anomaly Detector Service - Zak Hassan @RedHat @OpenShift Commons AIOps Briefing

Description

Log Anomaly Detector Service -
Zak Hassan @RedHat
@OpenShift Commons AIOps Briefing
April 2019

A

Okay, so my name is Zack and I work in the AI and her excellence with and under Marcel, the AI, ops group and I'm. Currently, the lead for the log anomaly, detection project and I wanted to over and present some some of the things that we've been building eternally the first the use case. So we're building this log anomaly detector to help our internal customer with root cause analysis.

A

We built it with unsupervised machine learning to identify the anomalies we found it. You know with with machine learning, it's it's more of a. They give you a probabilistic answer rather than a deterministic answer where probabilistic is generally, it's saying that you know, there's a percentage of being correct and there's a possibility of having false positives and we'll talk a little bit about.

A

How are we built in designs and things to tackle those challenges who our architecture we run on kubernetes and we have basically the core component, which is the log anomaly detector, which holds log data from elasticsearch trains and and then produces two models.

A

I'll talk more depth in depth about the model part, but for now I just want to kind of give it a high-level overview of the system and then once the models are trained, there's inference the inference step also also basically tags anomalies that are flagged as anomalies, and we store that in a fact store.

A

We also generate a prediction ID, but we can reference it later and then we'll which I'll talk about more in the next slide with which is about more in depth, about the pack store. And then we have some monitoring in dashboard tools were using Prometheus in prathana to get a dashboard to see the feedback that we're getting back from the pack store, as well as the metrics that are being generated from the model training.

A

So we also we also in the inference step. We also send out messages to Alessi search which, which triggers a alert to be fired off and an email sent to the user, and in that email we embed a link, a hard link where we use that prediction ID that was generated in the email. So when that user says ok, this is not an anomaly, and this is false positive. They can report that and give us feedback on that and.

A

That link would would lead to this. This particular page here, the fax, where the anomaly ID would be in here this five, it's an anomaly or not, and then you put some notes if you, if you want to give us even more detail on why it's not now anomaly.

A

And that's that's the high-level overview of the system, we'll go into more in-depth detail on the models and so.

A

The the system itself is basically a Python program via live in you. Can you can specify if you just want to train or if you want to train and do the inference and then step together? You can provide a config map or a llamo to inject into the program, or you can specify environment variables. What options are supported and you can either pull data from elasticsearch or you can pull data from a local file and then the sink that you would write to would be either elasticsearch or local file.

A

Again we may we're actually looking at maybe trying out Kafka next to try to see if we can get some more performance out of it.

A

Though the facts door, so the facts tour came about because of this folks who were we're getting basically predictions that weren't necessarily right, so we needed human feedback to verify. Is this really an anomaly or not, and I'll talk a little bit more about the Assam model self?

A

So just just a I can't take the credit for this whole system itself, as I had a great data scientist on the team, Michael Clifford and I had an awesome architect, but check of lean who also were involved in the design of the the model, training of the somme model and the and the overall system I've added the fax tour, which was an additional component. This system, so I'll talk a little bit about the model training part.

A

So there are two models: there is a word to beck model and the word the back model does the machine learning and basically what it does is a probability of words occurring with each other. It's we're using it in a very kind of unconventional way, but basically it's kind of like a pre-processing step, basically verify basically convert the raw log messages into vectorized representations and, and once we're done with that in the training step. The the next thing that happens within the training step is we generate.

A

Another model called saw model self-organizing map, and what that does? It compares the vectorized representation of the words the inference step, and basically it would basically print will train the model, and then you would measure the distance of each inference message to the different nodes on a map and then determine how, like or unlike they are to the nodes on the map.

A

Basically, it generates a real number. If it's normalized, it would be between 0 and 1. We'd have a threshold value, it's configurable for basically, let's say consider if it's 3 3 times the standard deviation plus the mean that's an example, our calculating the threshold value, and then this is a standard way to like determine it's outside the bounds of normal behavior.

A

If it's outside of the bound monkey who flag the system normally and anything that like will make sure we record it another additional thing that we have within the fact store is we use a bloom filter to store the IDS of the predictions that were they were basically or or the messages?

A

And then we want to kind of check, as this message has been seen before, when, where has this been reported as an anomaly, then kind of Bob try to prevent the user from getting duplicate, emails and messages, or if he reported this was an anomaly. They should never get an email again. um The the next step after this is to pull in that data and then retrain the model with the false positives and that's still in development. That's something we're working on improving our model on and and yeah, so pretty much.

A

That is the the flow of how this system works. They're. All the next thing I'm going to show you is a live demo, because I think it's flies are cool, but what's what's cooler than a live demo, so I've prepared an environment here for us, so I've deployed. So the database that's being used we're using Seco, alchemy and the database we're using is my sequel. Then we have profound the fact star, which is a classic application which connects using mice, connects using tickle alchemy, and you can you can?

A

If you choose not to use my sequel, you can use, Postgres can use other different databases, but for us we just wanted to try to use my sequel and that works for our use case.

A

So the MA, the the brainer here, is basically, you kind of look at the the system. Training look at the part itself, the log anomaly detector.

A

We can't we can see that it's, it did send some um recorded some anomalies and it sent some anomaly was found. This is the score that they found. The data. Sciences can also look at a graph on a dashboard, a little bit easier to look at and see how how were performing in terms of the the system itself and also you can kind of view some of the log messages that were recorded, the prediction ID the score and then the the internal customer can give us feedback over here.

A

So I'm gonna go and and and let's say, I got this email. This was the the particular prediction ID that was emailed with a link on in it.

A

We'll use query string, but then it would auto fill these these values here basically I would say. Is this an anomaly false, and why is this not an anomaly, because it is just wrong and once I submit that that feedback goes in it's now in the dashboard. We can see this anomaly wrong and and that's it and pretty much we iterate through this, and you need developing um and that's pretty much it for demo.

A

um Is there any questions, anything that anybody wants me to dive into more more in depth? Glad to take any questions.

B

Not seeing any questions yet, but there really ought to be questions and I'm guessing that the internal customer is perhaps openshift online or open ship dedicated like right in my guess or close.

A

So we have this, we have this internal machine learning platform where we have OpenShift and we have been more components like Alice research Cubana as well as logstash Africa, and some systems are deployed up there and their logs are streaming through Kafka and then into elasticsearch. Some customers would need would meet. The would want would have a requirement, the machine learning. Then we deployed this service as service that they would utilize.

A

C

Yes, so the in customer day and I think that was also your question is a team within ratted that is maintaining built pipelines for containers and they are using this tooling to identify anomalies in locks of their deployed system and they are running on an internal deployed OpenShift instance. So, basically, you could deploy the same setup that zach just highlighted to any other team. That also runs on top of OpenShift and I. Think that's that's!

C

Where we're trying to go from a prototype, that's Michael Clifford did as his internship last year then magic took it and sort of product production, eyes, dit and now we're trying to actually apply it to a running customer and it's again all open source. You can find it on get up and deploy it for yourself.

C

Obviously it has some some prerequisites to set up, so it needs to be on top of openshift and it's access to elasticsearch where the locks are stored, but other than that it's an it's an open source project and we're also in the process of collaborating with a czech university in prague to extend at least the research part of it and it's coming out of the CTO office of redheads. So it's no way a production eyes system, but it's some of our internal yeah research topics that we're doing Zach.

B

Can you post the or go to perhaps the github repo, with the with the source code in it, yep it'll be great, really.

B

Are there any other questions, thoughts from folks on the call or for Zach or for Brian or for Phil? Well, we have them here and if there are other topics you folks would like to hear about, we will be meeting again in another month post at Summit and cook on so there'll be lots of busyness in between. So if you've got topics that you learn about there, that you want to talk about afterwards, just ping Marcel or myself and we'll get them added to the hinge of the next agenda.