Red Hat OpenShift OpenShift Commons AIOps SIG, 25 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons AIOps SIG ARCANNA Bogdan Dass Siscale March 25 2019

Description

Introduction to ARCANNA - Automated Root Cause Analysis Neural Network Assisted – Bogdan Sass (Siscale)

recorded at the OpenShift Commons AIOps SIG mtg on March 25 2019

A

So hello, everyone, my name, is Bob dances. I am a principal solution, architect with Sai, Stella and I'm. Here, to talk to you about a solution that we sigl have been developing. It's called Arcana, it's a short name for a very long duration for a very long name. Actually, automated the root cause analysis. Neural network existed before we discussed what I cannot does I want to tell you why we started work on this project and just to give you a little bit of a background.

A

I am a networker and, as he said, min I started working with computer networks a while ago, a little bit over 15 years ago and I've gone through CCNA CCNP this year, I boom going I moved on to data center virtualization, and then it was only a small step to move to operating system virtualization to container technologies, docker and kubernetes, and during these years, I've seen the same process again and again, I've seen what happens when something isn't working.

A

What generally happens here.

A

Somebody just notified you get an alert, the app teams nope, it's not us internet or guys, the network team, it's not us the symbolization, they sell the silver team, of course, that US is the app a lot of time passes.

A

Nobody knows where the issue is nobody, maybe hours later, nobody even knows where to get started on Xing. That issue, and the problem here was very well pointed out by Marcel earlier. It's I want you to use this image. I want you to talk about, searching for a needle in a haystack, but I think Marcel put it too much better. It's like chat and my scheme and the mice are multiplying like crazy. Just a few years ago you had your physical server and you had your application.

A

If you had a problem, it was the network, the server or the application. Now you have the hypervisor. Now you have the container. You have the container Orchestrator. You have storage that might be local. What might be over the network?

A

You have much more places in which something can go wrong and identifying the true culprit when something does go wrong is becoming more and more difficult task, but we also have some very nice very useful technology that can help us and since I don't know, if everybody here is familiar with elasticsearch and their lest expect, I will just do a very quick presentation of them. First of all, elasticsearch started as a tool for searching through huge amounts of text.

A

Even now, when you talk to do any research cluster, it replies with you know for search, but the truth is, it is no longer just for search.

A

The also a very powerful way of dealing with time-series data, and nowadays we are seeing elasticsearch being used more and more for monitoring, because it's a very it works very well as a kind of no sequel database. You can just populate it with the time series data, the matrix that you want to collect and then aggregate correlate work with those metrics also around elasticsearch. We have an entire ecosystem. Now it's the elastic spec.

A

It was originally called elk for elasticsearch logstash ng Bona, but then they added the bits, the lightweight log, shippers and somehow neither bulk or black ELB case sounded very well, so they decided to simply rename it to the elastic stack.

A

So this is what we have. We have very powerful tools that allow us to collect the data. We have very powerful tools that allow us to centralize the data and put it in one place, but still we still have two problems here. The first one is that the data comes from many sources.

A

If you've ever had to collect information for multiple devices belonging to multiple vendors, you already know this issue. I want to know what user has performed, a specific action and all the actions are loud, but what is the field for the user? Is it user user name user, dot, name nginx, that exit that user underscore name? It's very difficult to correlate data when the fields that are being used differ between different tools and different vendors, and this is where elastic has come up with a very nice idea. It's called the elastic common schema.

A

It's an open source specification defines a common set of documents, filled document fields for data. Once you apply this electric common schema once all your data is indexed. In the same way, it becomes easy to correlate data from different data sources. So that's one problem that I don't want say it is solved, but it is in the process of being solved.

A

The second problem, when collecting data is this one? This is an actual demo created by the people at elastic. It shows real-life troubleshooting scenarios using elastic search, a problem that occurred in an application. It's a basic application with multiple processes, multiple micro-services making, update application and at some point we get an alert and, as usual, the alert doesn't say too much for performance on the server from there we go to the dashboards, and without going into the details we start to dig. We start to look what has happened. When has the problem started?

A

Does it have a clear start and end point with we dive in? We look at the actual server application performance. Monitor. We look at the response times. We still don't see a smoking gun all the response times are very long. We switch, we go into the containers, we dig a little bit deeper.

A

We eat that there seems to be a problem with one of the containers running on one of the nodes we go into that container and we see some spikes on on the CPU usage. You go there and finally, we look at the processes. We see that there is a backup process that actually runs at a certain interval and everything becomes slow well that the capacity is Radek.

A

Sorry now I went very quickly to all of this, but the problem here is that there are many sources of data, many places where something could go wrong and many times we do not know where to start. We start digging. Look at the servers. Look at the network. Look at the application. In the end, we will manage to isolate the problem, but it takes a lot of work. It takes a lot of time and the question was: can we do things better? Can we improve the time it takes to identify the actual the?

A

What calls and we believe we can, because the problem here is: how do we make sense of the mountain of data? How do we use the machine to help us to give us a starting point to point us in the right direction?

A

Identifying the root cause enter our camera? This part here is what we already have. This is elasticsearch collecting data from our infrastructure. Everything in the infrastructure is monitored. We have telemetry, we have the information going into elasticsearch.

A

Which, with Arkana, we can do first event clustering, we put together the events that occur together, the events that probably have a common root cause.

A

Then we try to identify the probable root cause for those events, and with that so with that we can engage the appropriative once the problem has been solved. The feedback actually goes back into our Kenna. We tell the system what has happened, whether the determination was correct or not, and the system learns from our feedback.

A

A

Again, we have our system, we have the electric search with all the data we have our Keanu, which is basically a plugin for Chobani, the data visualization in the console for elasticsearch and inside. We are adding a tensor flow to provide Marshall learning model that actually gets access to all the data. So the machine learning system looks at the data and try to identify what the hood goals might be.

A

Again, to give you the full flow.

A

What has happened? What are the events?

A

What does the machine think that the root cause is here identify what the potential would cause is for this set of events?

A

It is that enough, if the determination correct, we don't know right now, it is not, but we provide feedback after the troubleshooting steps have been completed after the root cause has been positively identified. The user or voice feedback for Java, the user, tells the system yes you're right. This was the actual root cause, or no, that was not correct. The actual root cause was something else. It was that work and the system learns and all the data also goes back into elastic stack into Electric search and with this information, the system continually improves with time.

A

It learns to identify the root cause correctly, and this is what we have now. But what about the future? How this system be used in the future I need to specify that we are not there yet, but think about future, in which we can actually take action when we are reasonably confident that the root cause has been correctly identified.

A

What if we have more than eighty percent sure that the issue was a bigger process running on the database server? Can we go in and automate the solution? We believe we can if we have a certain confidence threshold- and we are above that threshold we just go in. We had an instable script. The script goes to the server and takes corrective action.

A

We might be headed to a point where the problem is solved before the users even notice it. It will not apply to all the problems, but if it applies to 50 60 70 percent of the problems, it will free up a lot of time, a lot of resources for the people actually doing the investigation now, just to show you what the interface looks like this is the interface token. If you have ever worked with elasticsearch, it will look very familiar because it is nothing more than another plugin for jovanna.

A

This is where you define the machine learning jobs. This is where you tell it: what pills to take into consideration for the ml job and, of course, you can also rename some of the fields. If you need to do so, you can rename them from the interface the ml job starts. Writing and Indian. We get an output like this.

A

These are the events that are identified and I cannot believe that these three are part of the same set of symptoms. They have the same underlying good cause. We have a web server reporting on internal server error, a 500 error message: we have an tikka sequel server, saying that it's unable to write to disk. We have a silver that is out of memory. All kena believes that this out of memory was the root cause. I cannot believe that we should investigate this particular server. First, it is correct. Is it not?

A

We go in we investigate with our phone our investigation of troubleshooting steps as usual, and in the end these are actually toggles. You can switch them between a good cause and symptoms. In the end, you can go in and tell the system yes good job or no. That was not correct. Try to do better next time and the system will improve.

A

Keep in mind that there already is a level of machine learning in the elastic spec. Elasticsearch already has unsupervised machine learning that can already reduce some of the noise. It can detect only anomalies it can detect when something deviates from normal. We are adding on top of that. We are adding the supervised machine learning component and the automated root cause analysis. So the tools that we have go up to step 3 that Purcell mentioned earlier.

A

Now we are adding step 4, the automated our CA, automated would cause analysis and of course, on top of that, you can add place. You can notify the correct teams, you can add playbooks for automatic remediation if they would cause identification is reasonably confident and you can always provide feedback.

A

The system will learn from the feedback you provide.

A

So that's it for Arcana, of course, if anybody has any questions for the system, I will be glad to answer them. Just please don't ask me too much about the machine learning part. I am NOT a developer. A lot of that is magic to me. I will have to ask my colleagues who have actually written the code for that.

B

Well, fortunately, there there are a number of your colleagues that are on the on the call as well. So if you have questions, please raise your hand at the chat Marcel. If you have any comments, please please add them now. Some.

A

Of the developers are also in chat, so if you have any detailed questions about Arcana, they are the people to ask basically in here in the chat with us, so I do have the people who can answer.

C

Yeah one one remark for me- and this is my cell again so I, really like the setup that you are trying to plug into existing technology, because I think that's crucial that you need to have your monitoring stack in place before you actually can do put some AI or machine learning on top of it, so it nicely integrates with what people already have, but then improves gradually their solution and I also also think that the feedback loop is a very important thing that most people often oversea overlook, because in the end a machine can only be as smart as you train it to be.

C

So maybe one question for for your developers would be: is there way audit here? Did you analyze how much better the the network got over time and how much feedback was required to train it up to a certain point where it could reliably identify some of the root causes.

D

It's coming here well, currently, we're not quite at that point yet so I couldn't give you exact number of iterations. It largely depends on the data you give it so we're not quite at that point it's still no yeah we're still at the start, let's say.

A

But I do have some good news here and there I forgot to tell you about that. In the presentation, this technology will be open source and, as Colleen has said, everything will depend on the size of your net or the complexity of your network. The type of data you're collecting the type of issues you're encountering how many of them are repeated, how many of them are new and so on, but everything all the code will be open sourced.

A

So very soon you will be able to take the code implement Arkana in your own network and see whether it works for your particular use case.

C

That's that's a very, very good news and I saw that, on your other talk. Actually one of our team members and also prototype some similar solution and I see already some some place for collaboration there. So we also plug into elastic and we train a model, not an neural network model, but a model of self-organizing map to flag anomalies in log files. You're going one step further of actually pinning down some root causes we're only looking at a stream of lockfile messages and wants to detect something the normal is in the content of those messages.

C

But given that two parties came to a similar desire independently I think that's that's a good proof.

D

E

So I'm Brian I'm, an SI over at profit, store and I'm gonna, be talking about disc profit and what this profit does is we use machine learning AI to predict future disk failures up to six weeks in advance, and, additionally, we can also predict performance and capacity for up to 90 days into the future, as laws give correlation for the effects of disfavor between the node application and cluster, and our biggest use case right now is for Seth.

E

So I know that a lot of you might not be very familiar with Seth, but all you really need to know is it is that it's a distributed storage system, that's self-healing and fault, tolerant due to redundancy, which makes it really good for big data storage, and you can kind of extrapolate this use case to all all distributed clusters.

E

So in 2016 we partnered with another big company that actually presented at Red, Hat storage Day in Seattle that wanted to do a petabytes, F cluster for OpenStack Club, and they found there was three major stability issues with asset cluster that was sort of blocking their project. The first one was that every time disc failed or you know, SD failed. The map would change the crush map, which would cause placement group hearing and backfilling, or the cluster would rebalance to heal itself.

E

But then this would take up cluster resources and the client would receive slower performance. The second issue was that the data distribution was in balance. The cluster had no visibility of the underlying disk capacities.

E

So if you had different capacity disks in your cluster, some disk might be a ninety percent, while the cluster was only a fifty percent full and then the user wouldn't know until one day the OSD just couldn't write couldn't be written into anymore because it was full and then the third issue they found was that one slow disk, a single, slow disk would affect the performance of the entire cluster, and then this would just continue to drag on the performance until those it was ejected, because the cluster had no visibility over the health of the DES underneath, and so they proposed a solution with our dis prophet prediction, which was much less mature at that time.

E

But it essentially did the same thing. We could predict dis failures, six weeks in advance and then they had they drew out all this architecture stuff. But the most important thing is this graph at the bottom right. You can see that there's a normal workload here of around 400 or so I ops and then, when they, when they simulated a disk failure by just pulling a disk, they found that the cluster performance dropped below 200, so they dropped around 40 to 50 percent.

E

I ups and persisted that way, so it persisted that way for the whole duration of the test, so 800 minutes around 12 hours or so versus with our disk prediction, you can see that, with being able to know a disk is about to fail in advanced. We can take pre-emptive measures, we can disable the cluster rebalancing and then we can remove it, the disk and replace it within an hour and how the performance go back up to a fraction of the time in a fraction of the time right and then the same company tested art.

E

Our prediction engine against 20,000, drives over the course of 90 days and they found that we had an accuracy rate of 96% and a recall rate of 97% and the recall rate is actually the more important statistic here. It's um it's. The number of correctly predicted failed disks over total number of failed disks. So out of every 100 discs that failed, we would correctly predict 97 of them it, and then this is just shows that we're already integrated in the set community we're called the disk prediction plug-in.

E

You can just enable us through the manager daemon and then you can just use stuff native commands to access our prediction and yeah. So we're we release with Nautilus for older versions of stuff. You would use this this one line. Installation and you can. You can use that with ansible a chef puppet, any kind of automation, software to make it simple for a mass appointment and our biggest our biggest account right now is actually in Michigan. There's three universities, Wayne State, Michigan, State and University of Michigan, and what what their setup is?

E

They all three of these campuses share a single Giants F cluster and they put all their research data on this set cluster. So it's they have to make this F cluster as resilient as possible, and so what we provide is just the dis predictions and allowing them to monitor the health of their discs before they fail right and I'm. Just gonna go through a quick live demo, I'm gonna switch screens here you guys see my my web browser.

E

Yes, okay! So what we have is just a simple software as a service cloud login when you go and you see a dashboard and it just gives you a a really quick overview of how many discs are being monitored. How many are good?

E

How many are bad are gonna fail in less than two weeks less than six weeks, and you can go to the disk health list here to get a list of every single disk that's being monitored, and then you have all the UI unique identifiers and which node saw and the size the serial number the vendor all over here. Sorry all over here, so you can see that. So you can easily identify the discs so yeah, so this is so.

E

This would be where you would go for the disc details and then we also alluded to it earlier. We also have prediction for capacity and performance so over here we have the cluster capacity, but we also go down to the OST level. I'll just use I'll just use pools because it's more more interesting and then we can predict future use future capacity for the next up to next ninety days. But of course this depends on how much data you have so the general rule of thumb is for every cycle that we predict.

E

We need three cycles of that data, so if we have 10, if we can predict 10 days of into the future, we would need 30 days of data right and yeah. That's about all! We do.

E

That's that's the end of my mice and I. Think I just have some screenshots.

B

Thank you. This is great. The other thing. Can you a mop over to where that Steph plug-in is or a disk protection so that we have the URL in front of us? What work? Oh.

E

Yeah, okay, actually I can just go to it and why I.

B

Didn't realize that it was a plug-in already to stuff so yeah that stuff is yeah.

E

B

E

Darrell we work with we work with sage the creator of stuff and he got us a pretty quick, so pretty good great.

B

To know a lot of deaf folks using using Stefan your Red, Hat and elsewhere, so yeah.

C

B

C

What one interesting thing to note here is that you can run the prediction in the cloud or a local, so you could also run this setup in a completely non software as a service environment as well.

E

Yeah, because, in order to be which they wanted, like a lightweight version of our predictor and then so we would we just gave them like one with with less baggage. That would be only 70% accurate, that they could enable locally. But it would it wouldn't use all the metrics that were provided for the prediction it was requested by them to have a local lightweight lightweight package. Okay, yeah.

B

But once again I'll unmute all, and if you have questions please ask them: I'm gonna switch over and share my screen.

B

And you may all notice that I've added as many names as I recognize and the chat into the attendees list, if I got your affiliation wrong, please off into the hacker MD the hack, MD, notes and and correct me.

B

They are almost ten minutes left myself and I'd like to talk a little bit about some of the goals for this group. One one is we're just trying to reach out and build the community around AI OTT yeah and make sure that we have some of the resources that people are looking at and requiring. So thank you both Bogdan and Ryan, for sharing your insights and your tooling.

B

That's it's a great start, and if there are other topics that people want to talk about or present on or questions, you have please reach out to us again sign up through the the Google Groups and and ask for those. If there's anyone here that in look in the chat that has any questions not seeing any I'm, hoping that some of you will have some suggestions for upcoming topics and we can move forward. We were planning on doing this on Mondays at 9 o'clock aka.

B

The time we've started today over the next few months and doing it once a month and trying to pull together something at Red Hat stomach, which is in May, if you're interested in having maybe a lunch and learn on the topic, we'll all be in Boston at that time, as well as I'm looking to post what we call an open ship, Commons gathering, which is our face-to-face time for open shifters and the people on the kubernetes in ecosystem in the Bay Area sometime around September.

B

So if you're interested in getting together, then please reach out to Marcel or myself and we'll start coordinating a face to face, sometimes probably in September on this so Marcel. If, if you wanted to add a few words in here, I've added a few resources down the end, if everybody could send me PDF versions of their slide decks on 2d Mueller at Red, Hat comm, that would be great and I'll. Add them in as well. Marcel yeah.

C

I'm just really excited that we gathered so many people, also not only from the presenting companies but also from other companies in this first kickoff meeting here, I'm really looking forward to extend that to other people in the community, M well to enlarge this community or to create a community.

C

So once we have our website up, please make sure that you register your email there for additional updates or go to the google group that we have there so that we don't miss any of the sessions that we hopefully continue to do on a monthly basis and I'm really looking forward to collaborate you with you on get up and in an open-source way.

C

That's all I have and.

B

It's great to see so much of this stuff from descale and from profits them they're being done out in the open. So this is this is great, and we look forward to more of that. So thank you again to our speakest speakers and we'll host this.

B

The video of this session, the Google Groups list and I'll create a YouTube playlist for these topics and edit them and get them up, hopefully in the next 24 hours or so, is there anything else anyone would like to add, um while we're here I'm just check-in the chat again and I'm, not so hopefully, I've gotten. Everybody's affiliation is correct. If not I posted the link already into the Google group, and we will we can correct it from there. Oh thanks again, everybody for attending and we'll be back again in another month, eat.

C

Okay, thanks Diane for organizing this and see you again soon all right. Thank.

E

You guys thank you, everyone.