OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Illumos Brings the SAS by Kody Kantor

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=14KbyfOcf23rhatgypG3ZxIGSfBGbHNlp

A

Yeah so important, Cantor I spoke yesterday. We want the ZFS capacities or similar thing. I were to join your buddies, a.

B

Little bit better, if you stand up you're gonna confess the next cool.

A

So I'm telling what everybody's favorite stuffs has topology is. We already certainly touched on it a little bit like during Alan's talk talking about maybe like storing the initiator in a beat-up property. So we know like what that thing is actually connected to. We talked about raising device, names and Linux, and we don't worry about that on the wheels which is cool so I'm glad one of your problems, but we'll talk about our problems instead, so like just to start out like I mean like this serves like for always like I. Don't know like.

A

We all know that there like there are ways to tolerate failures for like forecast, failure to see them coming up and then avoiding failures. You know, tolerance is pretty well understood. We have things like raid in ZFS, not NTFS forecasting. We talked a little bit about smart data earlier I can tell you like when a failure, the disk will predict when a failure might be coming, I'm like proactively or notify the operating system or like using a CLI tool that that might happen.

A

You know ZFS Rio Fest, what RFS they can all do, detect something or cover from Tyco some errors on disks, and we also have things like by link state error, counters, I, don't know how many folks are familiar with those yeah.

B

A

So richer yeah, one one person right, weren't, more probably oh, but it will tell you like in in the SAS fabric like where errors might be coming from that aren't seen by just yet that, like the SAS fact like topology, so hot fighting from either by like reissuing requests or whatever is going on in firmware land, and then we also avoided so like this would be doing something like creating a pool that takes into account the physical layout of disks in a system so that maybe, if you have a system with a bunch of disks on the front and a bunch of disks on the back and they're like connected like to the system in different ways like through different cables, you can maybe create a pool that can tolerate a failure of one cable or something like that.

A

I'm. So we'll talk a little bit more about this later.

A

We'll talk about really bad stuff, which is operator, error and, like all things, can go a totally wrong so like, let's just take a situation where the OS, like ZFS, identifies a bunch of check Samir's in a bunch of disks so low. Like ZFS using the eggs, add or Mme and see if we most will automatically swap in a hot sphere and the operator is gonna, be nullified, hey. You should replace this disc because we found a bunch of checks limiters on it.

A

So the operator goes for places of gist and well resume ring is going on. You don't really want to lose any more discs if you can help it, but then more check Samir's happen on different discs, or maybe the dust that you just speared in.

A

So the operator replaces those just two and then you end up in this situation, where you lost a bunch of data and like today, the tools don't really say otherwise like if you go on Lumos in like look at the bolts that are in the system, it will actually tell you like hey.

A

You should replace these discs and like if an operator doesn't know any better, then they'll replace the discs, but sometimes an operator does know better and it's like even impossible to know what the right thing to do is and that may Harbor failure is like really complicated after taking an account like where things are in the box. So it's just like picture for you box, just like a system that we use a giant like the bomb is online. So if you want to check that out later, you can, but this is late.

A

I don't know like how many Forks have looked at hardware diagrams, but this is this is an HPA here and then we can't appear of like backplanes that does this actually like plug into and in our bomb. What we do is we have like two cables that go to one expander and then two cables from that expand or go to your expander, so like the HPA and that architecture is a single point of failure.

A

If you lose the HPA all your disks, but basically just disappear, or if there's like errors that only occur sometimes because the HPA is slowly failing, then it'll look like every disk of the system as throwing errors. So if you think that doesn't actually happen, it totally does happen. This is like an eye chart here, except it's even if you could see if it's basically gonna readable, like we now we, the system started out with just too hot spared us and somehow we have like ten of them.

A

Now we have two thousand eighty eight data errors and, like everything, is either resold, ring or will saloon camp you following and.

B

Like when you're an operator, you.

A

Look at a fool like this, like what are you gonna do: okay, Mike.

C

A

Mean life you just sort of like turn the box off and turn it on I'm hoping my recovers.

A

C

Know in situations like this.

A

This is when engineering gets called in to take a look, and usually what we have to do is fire up this thing called LSI util, like we use of LSI HBAs sure most people do and I mean.

B

Even use LSI util.

A

B

Okay, so if you.

A

Used it, like you, know the pain like it. It's for me. It's like a super super scary tool because, like the options like first, it asks us to like which HP you you want to look out here, like while I'm getting older or to.

A

Be really careful like when you're typing, in the option to you were like to.

B

0, like Diagnostics.

A

Like I, don't want to go into rate actions for like Harbor rate on accident, so like eventually we can get these. Finally, air counters so, like each pie, is like a thing that, like a poor on the back plane that the disciplines into so here.

B

In this example, there.

A

Are no link errors, which is really good, but sometimes you'll see there's like thousands of errors but see if best reports. Everything is okay and you're, like sort of questioning your entire life, so.

B

Once we once we.

A

Have this magical flying number that we have to like say that this first by zero is throwing like a thousand errors. Then we sort of have to figure out where that actually is in the system, which is like another menu option. In else I you tell and we have like even more cryptic data. There's.

B

Like a bunch of worldwide.

A

Names here, so these are each be a worldwide names and you sort have to do this. Mapping up okay, well, five, zero had a bunch of errors, so I think that's this one here. You know you're not really sure you just like matching up numbers at this point, and things like this is a really simple to all of you to like. We have to h, bas and, like I, don't know like sixteen desks or something it's like. You can imagine like this 124 just thing with like fan-out expander and like two edge expanders.

A

You know the crap-ton of gill, slits and cables everywhere, like how do you do this mapping with something like LSI util, so the problems with understanding the SAS topology and like figuring out where errors are actually happening in hardware like the problem, so that we have super-dangerous doing LSI util is extremely frightening to use it's a bad day whenever you have to like copy LSI util on a machine you and start running stuff, so like really, what we need to do is understand physical device layouts, so we can start to identify where things are going right.

A

So really we need better tools and, like this hard to get right so multi padding people like I, don't know how you live, but, like you probably have the most crazy SAS topologies on the planet. We use pretty simple ones, but it's really hard to get this mapping of hardware topologies right, because they're just so many different ways that are valid according to like the like SAS, like documentation, but it's really confusing even to picture in your mind.

A

So we started some work in animals to sort of try to solve this problem. Like the first thing. We're trying to do is make things better for operators so like the first thing that, like Rob Johnson, a co-worker of mine, did was implemented a prototype support for directed graphs and FMA, so that we can because, like not all sassed apologies, look like a tree like to put this multi padding like things, can get really complicated, and then we found this thing called smhpa-300 to cross it.

A

Looking at man pages, it's actually like this thing that allows the operating system to write and like support harbor like vendor-specific support in their HPA device. Drivers like expose HBA, specific information to the operating system.

A

It was written by snoop folks a long time ago, but I don't know that it's like it's not maintained anymore, but it's really useful because it gives you some information like you would get from LSI util about the HPA itself.

A

So we can get some information from sm HPA, a P, I and then like what we did is we wrote a utility to run a bunch of SMP commands, which is like the scuzzy management broker serial against any expanders that have SFP ports available, select like doing that, we can sort of figure out the world. Why names of disks that are behind if I and then what those are attached to using worldwide names. We can also get by link air state countries through SMP. So now we can do.

A

Is we can link up matching they'll, be worldwide names in a directed graph, and then we can draw a really cool picture with it. So that's Robin thumb into this tool to convert directed graphs until like a really nice webpage.

A

So hopefully, oh yeah. So this this first tool, SAS topo, is it's like FM top one. We must, if you prefer, use that across with like LSI utils, so it gives you like it'll print out paths from initiators to targets and the SAS topology, and you can like tell it to print out to buy specific properties like fire covers the host device. Name is chasity locations that sort of stuff, like maybe the chassis location, is like front dust, zero saline.

A

You can tell a DC operator yeah, it's front-desk, zero, that we need to replace, and you can optionally serialize this director graph into an XML document, because XML is the format of the future and it can handle 64-bit numbers which things like JSON can't really. um So here's just like a sample openness, a snowcone. So this is the system. The first indication of it was just like no arguments. You can see that we've created a bunch of a like nodes in this directed graph.

A

We have initiator and we have a wide work that was identified that takes off. This is like a cable going from the HBA to an expander.

A

We've discovered an expander, no worked on the expander and a port for a target, and this is the target which is just a disc and then here like when we've discovered all the nodes, there's a lot more it'll print down. At the end, all the paths from initiator to target so.

B

Now, like we have a really good idea, maybe.

A

Have an initiator which is connected to an expander each of those has a port. This is a wide port and then the expander port is connected to a target, and if we wanted some more detail, we could just run like the cat V play and we get some details on the port. So this will like in the future is gonna include five link state air counters. Almost so maybe we'll see like a thousand airs on this on this target by 4. This is 4 5, and then this is the fMRI.

A

If you want to look up the resource in the SAS scheme and then here on the bottom, we have a target. So this super lock thing is the hardware component fMRI, which is used to look up the actual physical device and FM on wheels. And then you can see that we discovered some information about the device like where does slot 0, the manufacturer or model number serial number, a lot sort of stuff?

A

The second thing is that we wrote this tool to convert that, like the XML output from the previous tool- and this is written in rust and you can like it- produces this website bundle. So you can actually open up this topology in a web browser and it makes it really easy to see how things are connected in a system. No matter how complex it is.

A

So here's a picture that you can't really see the words just unfortunate, but like this is the LSI util up on our slinky. There are two HBAs, so here's an HPA note, there's the second HPA know and we can see all the these are the targets here.

B

A

So we can see each HPA is attached to eight disks and you can click on these and more impatient about it. So we clicked on the HBA. Here we have the part where compliant F of rx. We can look that up and we'll go out to the model number of serial number device label at source, stuff and we'll be able to put like like fine link. State air occurs in these port nodes as well.

A

So in the future like what we'll be able to do is if we find a port, that's throwing a bunch of errors. We can like automatically color box rather something so then an operator can like quickly look at this and be like. Oh okay, like the HPA is throwing a thousand airs. This is rad and my ZFS is also identifying checksum eaters on all eight of these disks. But these eight does is you're totally okay. So then we know that you do. The HPA is gone wrong or cables gone well.

A

So it's super useful and a slightly more copy to picture here which is even harder to see. So we have, in this case a single HPA which have a ban on expanded. We clicked on. We have like the same part where specific information over there in corner, and that has a bunch of disks attached to it and then there's also another expander. Here, that's attached to a few more disks so like here.

A

Maybe this expander is going bad and you know FS is identified, checks some errors on all these things, but all the other disks are okay, so like this, even having like trivial tools like this, if you guys have had to like dive into LS iu table for this is like a game changer, it's to my pain, alright! So.

B

A

Like I said, our firm goals are just to have better tooling for our operators. Longer term, though, um since Lumos has FMA, like really good fault management for hardware, we would like to like enhance FMA to actually provide more targeted diagnoses when things start going wrong in the chassis. So maybe that would be like look like.

A

Maybe ZFS stops swapping adjusts all the time when it's these checks of errors, and we also want to be able to make better pools, because I don't know why you guys feel like when I make a pool I'll just go like sequel, create like mirror or raid-z, one like stas, TBS, TCS DD, without really taking into account where that justice actually are like what the fault domains are. So that's really what we'd like to do with this work?

A

Yeah I think that's all I have, but if you folks have any questions and have to answer here's some pointers.

B

Which works the best with our.

B

I mean: what can you have this big page of everything, you're, detecting errors, having like the ideas, yeah.

A

So the question was like: how can we improve the retire agent with this information? Do we have any prototype? We don't have any prototype code for that, but one thing that we were thinking is: we could actually like in this in these like device, specific properties. We could like mark this disc as being like having being part of pool gooood whatever with vita gooood whatever.

A

So then, what we can do is, in the retire agent sort of look up in the SAS topology that node to see if there is anything along the line that has a bunch of errors, let's sort of go from there.

C

Do you think that you could take the logic of drawing this and kind of pare it down to the error case and make an ASCII art output that could eventually be the part of some sort of command line diagnostic where it shows you. You know right now. Zfs tells you that the disk has checked some errors, but if you could roll that command up and show the errors on the full path, because you're generating that that long path, stream and I'm just wondering do you? How feasible is that there is that a factory yeah.

A

I mean I think that we could certainly do stuff like that, um like you can look up and you these individual nodes just by using the fMRI. So you could conceivably do something like that and like as you cool off water like FM ATM, faulty Ani wants a skier like I. Don't know, I mean I'm asking our person but, like you could probably do probably likes a little.

A

Right yeah: well, there.