Red Hat OpenShift OpenShift Commons AIOps SIG, 25 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons AIOps SIG Talk on DiskProphet: Disk Health Prediction Brian Jeng ProphetStor

Description

DiskProphet: Disk Health Prediction Brian Jeng (ProphetStor)
recorded on March 25 2019 at OpenShift Commons AIOps SIG

A

So I'm Brian I'm, an SI over at profit, store and I'm gonna, be talking about disc profit and what this profit does. Is we use machine learning AI to predict future disk failures up to six weeks in advance an additionally? We can also predict performance and capacity for up to 90 days into the future, as well as give correlation for the effects of disfavor between the node application and cluster, and our biggest use case right now is for Seth.

A

So I know that a lot of you might not be very familiar with Seth, but all you really need to know is it is that it's a distributed storage system, that's self-healing and fault, tolerant due to redundancy, which makes it really good for big data storage, and you can kind of extrapolate this use case to all all distributed clusters.

A

So in 2016 we partnered with another big company that actually presented at Red Hat storage day in Seattle that wanted to do a petabyte set cluster for OpenStack club and they found there was three major stability issues with a stuff closer that was sort of blocking their project. The first one was that every time disc failed or you know, SD failed. The map would change the crush map, which would cause placement appearing and backfilling or the cluster would rebalance to heal itself.

A

But then this would take up cluster resources and the client would receive slower performance. The second issue was that the data distribution wasn't balanced.

A

The cluster had no visibility of the underlying disk capacities, though, if you had different capacity disks in your cluster, some disk might be on ninety percent, while the cluster was only a 50 percent full and then the user wouldn't know until one day the OSD just couldn't right couldn't be written into anymore because it was full and then the third issue they found was that one slow disk, a single, slow disk would affect the performance of the entire cluster, and then this would just continue to drag on the performance until those it was ejected, because the Pusser had no visibility over the health of the day underneath, and so they proposed a solution with our dis prophet prediction, which was much less mature at that time.

A

But it essentially did the same thing. We could predict dis failures, six weeks in advance and then they had they drew out all this architecture stuff. But the most important thing is this craft at the bottom right. You can see that there's a normal workload here of around 400 or so I ops and then, when they, when they simulated a disk failure by just pulling a disk, they found that the cluster performance dropped below 200, so they dropped around 40 to 50 percent uh ops and persisted that way.

A

Oh sorry, it persisted that way for the whole duration of the test, so 800 minutes around 12 hours or so versus with our disk prediction. You can see that with being able to know it just is about to fail in advance. We can take pre-emptive measures, we can disable the cluster rebalancing and then we can remove it, the disk and replace it within an hour and how the performance go back up to a fraction of the time in a fraction of the time right and then the same company tested art.

A

Our prediction engine against 20,000, drives over the course of 90 days and they found that we had an accuracy rate of 96% and a recall rate of 97% and the recall rate is actually the more important statistic here. It's um it's. The number of correctly predicted failed disk over total number of failed disks. So out of every 100 discs that failed, we would correctly predict 97 of them it, and then this is just shows that we're already integrated in the set community we're called the disk prediction plugin.

A

You can just enable us through the manager daemon and then you can just use stuff native commands to access our prediction and yeah. So we're we release with Nautilus for older versions of stuff. You would use this this one line. Installation, then you can. You can use that with ansible chef puppet, any kind of automation, software to make it simple for a mass appointment and our biggest our biggest account right now.

A

It's actually in Michigan there's three universities, Wayne State, Michigan, State and University of Michigan, and what what they're set up is they all three of these campuses share? A single Giants F cluster and they put all their research data on this set cluster. So it's they have to make this F cluster as resilient as possible, and so what we provide is just the dis predictions and allowing them to monitor the health of their discs before they fail all right and I'm.

A

Just gonna go through a quick live demo, I'm gonna switch screens here you guys see my my web browser. Yes, okay!

A

So what we have is just a simple software as a service cloud login when you go and you see a dashboard and it just gives you a a really quick overview of how many discs are being monitored. How many are good?

A

How many are bad are gonna fail in less than two weeks less than six weeks, and you can go to the disk health list here to get a list of every single disk that's being monitored, and then you have all the UI unique identifiers and which you know what it saw and the size the serial number the vendor all over here- sorry all over here, so you can see that so you can easily identify the discs so yeah. So this is so.

A

This would be where you would go for the disc details and then we also alluded to it earlier. We also have prediction for capacity and performance so over here we have a cluster capacity, but we also go down to the OST level. I'll just use I'll just use pools because it's more more interesting and then we can predict future use future capacity for the next up to next ninety days. But of course this depends on how much data you have so the general rule of thumb is for every cycle that we predict.

A

We need three cycles of that data, so if we have ten, if we can predict 10 days of into the future, we would need 30 days of data all right and yeah. That's about all! We do.

A

That's that's the end of my vice and I. Think I just have some screenshots.

B

Thank you. This is great. The other thing can you I'm not over to where that stuff plug-in is or a disc protection, so that we have the URL in front of us lower? Oh yeah,.

A

Okay, actually I can just go to it in my fiction.

B

I didn't realize that it was a plug-in already to stuff so yeah that stuff is yeah.

A

B

A

Do I work with we work with sage the creator of stuff and.

B

A

A some pretty quick, so pretty good rates.

B

Know a lot of deaf folks using using seven, your Red, Hat and elsewhere, so yeah all right. So.

C

What one interesting thing to note is that you can run the prediction in the cloud or a local, um so you could also run this setup in a completely non software-as-a-service environment as well. Yeah.

A

Because, in order to be with, if they wanted, like a lightweight version of our predictor and then so we would, we just gave them like one with with less baggage. That would be only 70% accurate, that they could enable locally. But it would it wouldn't use all the metrics that were provided for the prediction it was requested by them to have a local lightweight lightweight package. Okay, yeah.

B