OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Debugging ZFS: State of the Art on Linux by Tom Caputi

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1YZ1RW13yY8umhQF5CQ82zzk_mVUodYbT

A

So our next presenter is that Tom kaput see so you might be familiar with Tom for his work on GFS at Reston. Sorry, encryption at rest for ZFS to the 808 Tommy's going to talk about something different. It's gonna be related to debugging on Linux. So please welcome Tom.

B

B

Hi everybody I'm Tom, Caputi and I'm here to talk about debugging, sorry techniques, I'm, sorry, I,.

B

Was out really late and between the time zones and and eating pretty much only junk food, I'm I'm? Sorry, just just stay with me and we'll all get through this okay anyway, I'm here to talk about ZFS, debugging techniques and I've been spending a lot of time over the past six months to a year or so just doing a bunch of debugging. So I I don't know, let's, let's just get into it. Okay, so here I have a system I created a pool here.

B

Let me make sure that this is gonna, be there yeah, so I have a pool that I created. If it is equal status, it is hooray and look I, don't know. Let's just start by. Let's start by making a data set should be pretty easy right. Give us create, pool, slash test, crap.

B

So this is a problem. People should be able to make data sets last time, I checked so I guess the first thing that I try to do when I hit. Something like this is I'm gonna. That's amazing! Right there that worked. Okay, usually when I break ZFS or most of the people in this room break see. If this you can't control, see it if you are able to control, see the process.

B

What that basically means is that your process is busted in userspace somewhere for the most part, and that means that this is probably a lot easier to debug than anything else. That could possibly happen. So, let's, let's try. This okay I have T MUX loaded up here. So let's do let's go back to this. Let's try this again and since obviously this is a reproducible problem.

B

Let's go over here and let's just run a CH tap. Okay. This is one of the first things I do whenever I encounter a problem and you'll notice, nothing's happening. ah Let's search ran for the ZFS process and see if it is okay, so there it is, and it's asleep it's not doing anything. It's not really using a whole lot of memory, but this is good. This is all this is all information and we know that we can control, see it.

B

So, let's, let's, let's quickly, do this one other thing: let's, let's try this one other thing: let's do go back to the original window and I'm. Sorry, if I cut out in and out a lot.

A

I'm trying to keep.

B

My face as close to this microphone as.

A

B

But it's kind of over here and let's s trace this process, how many of you are familiar with s trace? Oh thank god!

B

This is a wonderful tool whenever you're debugging anything because it will at least tell you usually where the problem that you're trying to debug started, because it will tell you the system call that it's stuck on trying to execute. So let's try this. Let's do s trace and you see it did a whole bunch of nonsense, getting all kinds of things from about what kind of features to pull supports and all these other things, but that last one is weird okay, so it's asleep!

B

So um let's start with that, shall we that sounds like a good place to start, let's go into the code and let me bring this over to do and I apologize. This is a little bit of back-and-forth over here and then let me just switch to uh it'll be easier. If I just do it like this.

B

Okay, hopefully that should be easier, um but anyway, so let's do let's open up ZFS main dot C and take a look at what's going on here. So we know that the thing didn't really the the thing. The program didn't really call anything before it got to that sleep. It just basically loaded a bunch of other it loaded a bunch of info about the features, but it didn't really get to any of the code.

B

So, let's take a look just start at ZFS main dot C and try to follow it around and see what might be the problem here. So we're gonna go to CMD and go to ZFS because that's command. We ran and ZFS main dot C and over here, and let's do let's search for ZFS. Do you create pretty easy look around? We find it. Let's see if there's anything suspicious here.

B

Dear future, Tom I know how much you love fixing bugs in ZFS I, also know your birthday is coming up in five months: that's not even accurate, so I wanted to give you an early birthday, present I added, even more bugs to see if this I didn't think it was possible able to add more I'm the best past tom, great okay. So we have some things here that we need to fix. Let's get rid of this one, because this one's obviously a problem.

B

Let's get rid of this and we'll go back over here, and this is now a problem because that's how screen resolutions work? Can everybody see that okay, okay.

A

B

Let's do this we'll quickly rebuild the code, will do just make seven install and while that's happening, let's talk a little bit about if my menu existed.

B

Let's talk about user space lockups, so this was kind of the first problem that we just saw here. We saw that everything you know the whole process was kind of stuck in user space, but we were able to get out of it just by hitting ctrl C, which was probably the first thing that you'd try as somebody who's familiar with a UNIX system like that's probably one of the first things you've tried, but that's actually a really valuable piece of debugging information. It's, whether or not you can cancel it after you've determined this.

B

You can use processes like top H top, depending on what your preferences PS aux, which can give you the same information but warrants like a snapshot, and you can check some really basic things here. We saw that the process was not stuck in D, which means that it's not stuck in the kernel.

B

It was stuck in S, which is the asleep state which basically means that it's canceled whenever we want to it's, also important to keep in mind that the process might just be slow, it might be doing its job, but it might just be taking a lot longer than you expect. If that's the case, then you know, then you have more work to do, but for right now, we've gotten to something pretty easy. So let's go back here. It's rebuilt. Let's try this again.

B

Let's do DFS create pool slash test ah crap.

B

So this is what a crash looks like pretty obvious I'm sure most of you have seen this kind of thing before and usually this is the if you're using a UI. You will see this message. Otherwise you will just see this message, which is very helpful and tells you almost nothing segmentation fault core dumped and it's kind of a lie, because if you look around this one is here from before so ignore that.

B

But you know after you remove it, there's no file there.

B

Rm does still work, so you run this in seg faults and, let's, let's take a look at how you debug this. This is also pretty easy to debug and the reason that this is kind of easy to debug is because there's all kinds of tools and stuff that have existed for you know since the beginning of time to help you figure out how to debug this stuff. So, let's, let's use the biggest one right now, the biggest one is obviously gdb.

B

So let's do and let's get rid of this message if I can- or maybe it's just there forever- oh yeah, it is, but it's not letting me alt-tab to it, so go away there. You go okay, so let's there now should be a core file here and there's not and the reason I'm sorry I brought this up before. But usually when you do, when you do, when you crash a program, it will tell you that the core is dumped, and that is a lie. What you need to do, first, is you need to set?

B

You limit see unlimited.

B

What this does is it allows your operating system Linux in this case, and this is kind of a Linux ism I'm, not exactly sure how this works on FreeBSD or any other platform, but on Linux at least you need to set this in order for it to be able to generate a core file, and a core file is one of the most useful tools that you have when trying to debug things like crashes.

B

So now that we've done this, it will actually create a core file when we do this and if we do LS core there. It final is, you know for real this time, unlike before, which was just a figment of your imagination, and now we can use gdb to solve it and what you'd normally type is GDB GDB ZFS, which is the name of the binary that you're trying to work on and then core, which is the name of the core file.

B

What the core file represents is basically the state of your system at the moment that it crashed. So, by doing this, you can tell it see right here that here's the stack trace of exactly where it crashed conveniently with no line numbers. Now we can see that right here, the top of it is this. Nice convenient crash, the program function which might have something to do with the problem.

B

Let's take a look at that. So, if I just search around here for crash the yep there, it is- and it says this will crash the application- wonderful wonder what it does. This function serves two purposes. First, it crashes the program. Second, it crashes the program, it's worth repeating for emphasis: okay, that's not good here. Let's just get rid of this okay, so now we've gotten written that rid of this, and hopefully now we should be able to create a data set and move on to you know the actual kind of hard stuff gonna bug.

B

So, let's quickly get out of this, we found this one do again make j7 and I can just hit up if I know what I'm doing there we go, and so this brings us to the second kind of the second kind of thing that can happen when you're debugging issues this one's a lot more common. This is a user space crash. Usually when you find a problem, it won't just be that somebody put asleep somewhere because most of the time people don't want their programs to just go to sleep most of the times.

B

They want them to do stuff. So crashes are way more common in you know, programming in general and they are pretty easily identifiable by the messages that come out.

B

When that happens, it's- and this is the wrong slide and I deleted the right slide so anyway, but but basically the system will cry when the system crashes, the the thing are, the operating system will tell you that it crashed, it will usually say something about, it will usually say it was killed with a sig int or a sig abort, or some other kind of signal like that. That you did not issue it and it will it will.

B

You know if you have you limit def see said it will, dump out the core file and allow you to debug it with gdb. Let's go back to this and okay. Let's try this one more time. I have a good feeling about this and still nothing, but this is something different. So now this time we've tried to control, see the program and it hasn't returned.

B

This is more often than not what I usually see whenever I'm debugging ZFS stuff, because most of the stuff that I and I'm sure a lot of people here work with you know we write code for the for the Linux kernel or for the BSD kernel or, for you know any other kernel that there is, and usually when the kernel crash is your there's not a lot. You can do as soon as I see this. The first thing that I do is I, go to be message so I come over here type.

B

The message and very helpfully ZFS has the stack trace right here. I can see that I don't really understand what this is, but I'm sure we can figure that out.

B

If we go to the thing we called SPL panic- and here, if you look, this is what a standard ZFS kind of assert looks like, so for any of you who or may be watching on the live stream or for those of you who you know who are more users of ZFS, and you see these kinds of messages, I'm sure the developers have seen these all the time.

B

But you know when, when you see a message like this from the message that says verify and then some statement, it usually is indicating that that statement is false. In this case, it's saying that whatever the current time is minus some start time is greater than or equal to five minutes and right now we're saying that no time has passed. So that's weird, but I also don't know what would be doing this, but conveniently it gives us a line number again. So again, this is really whenever you're creating a an issue report for ZFS developers.

B

This is a very important thing to include is not just the stack trace because a lot of times. This is what ends up in the issue report, but this is the important stuff. So let's go take a look at that. This is in ZFS, ioctl, dot, C and it's line three three, four nine. It says so come over here and we'll go to line three three four nine and it says ZFS allows us to create too many data sets and I misspelled. Two.

B

We should make sure that we don't create more than a data set every five minutes, then right here it says in debug mode. You can create data sets every hour. This will give you an excuse to go, get some lunch. At least it was thoughtful. That's good, okay! So anyway, so now, let's get rid of this. We found this problem. We know that this probably isn't. Actually you know this isn't actually true.

B

So in this case we can just we can just remove it, but normally you would want to very much consider why those asserts were added, and you know for what reason, because usually when somebody puts those in it means that you will break some piece of code somewhere else very, very badly and in a way that will be very, very hard to find.

B

But let's, let's, uh let's take this and get rid of it for right now and now, interestingly, we have crashed the kernel, so the only way to actually get back to a state where we can retest this is. We have to restart the whole thing.

B

I have this kind of setup so that basically I go back to a snapshot of the VM every time, but whenever you hit a kernel space crash or anything like that, you do need to reboot and you know apologize to your local cloud, er singe team, um while we're waiting for that to reboot. uh Let's talk about.

B

Oh, there was user space crashes, it just moved to slide down. Let's talk about kernel crashes, the process is basically appear to be stuck and they cannot be terminated with control C. If you look at them in PS, ox or h top or anything like that, they'll be stuck in the D State and usually they may or may not say just the word killed if you're using a bash shell.

B

The reason that they say killed is because literally, if you, if you cause a real problem, that actually causes the kernel to crash the kernel that that is how the kernel like is built to handle it. Is it literally just sends the kill to the process and kind of hangs it there forever in ZFS it's a little. We have some debugging stuff for the asserts. That's a little bit nicer and basically, we just hang the process forever.

B

Ourselves quote for inspection: that's what it says in the comment when, whenever you hit a problem like this, just remember that the most important thing that you can report to to developers- and you know for yourself if you're trying to debug these issues- is to look and be message depending on how the thing crashed and what kind of other problems may be may have arrived. The system may have become completely unresponsive and you might really just be stuck.

B

Let's come back here, so we rebooted the VM and now I'm gonna come over here and paste in just a quick thing. Let's just rebuild this pool real, quick and rebuild the code. Does anybody have any questions on this stuff? While this is rebuilding?

B

Okay, I'm? Assuming most of you, have probably seen a lot of stuff like this, and you know because this is the ZFS developer summit, so I kind of assumed that that's like kind of the basic stuff. So, let's, let's look at something a little bit different um after we create our data set.

B

This is a this screen is a little bit more cramped than when I did it in rehearsal, so I apologize, but so, let's, let's create this data set CFS, create pool, slash test, and now, let's paste in this quick little line of code that I have here, that's just going to.

B

It's just going to touch a file and do a zpool sync for those of you who don't know zpool sync, basically just make sure that all of that your transaction groups are synced and that everything's out on data so that you're sure that your data is safe, and here again, we've seen a crash now. I wonder why that is. If we come here to D message, we don't have any information, that's not very helpful and that that can be pretty hard to kind of figure out. What's going on.

B

The interesting thing about problems like this is that thank you. Ten minutes left the interesting thing about problems like this is that you kind of usually a lot of times what you will end up having to do is you will end up having to wait and wait, because in exactly two minutes, theoretically, the kernel will print out a stack trace of what is stuck and that can usually help out a lot, but there is a slightly faster way to do it.

B

If you don't want to sit around and wait- and you can do that with this handy dandy- quick little very long- bash script, which is included at the end of this presentation, so that you can copy and paste it in basically what this thing does. Is it just prints out all of the stack traces in the system and D dupes them so that you can see which ones are which ones may or may not be stuck now a lot of times?

B

You will get some that are pretty standard, and it's bet, especially if you scroll kind of to the bottom and with ZFS ZFS, does spawn a whole bunch of a whole bunch of tasks use. So you will find a lot of threads, which are normally just kind of stuck here like this and they're just waiting for some work to do so in this case task queue thread. These are all just threads that are waiting for more work to do and there's nothing really there.

B

But if you scroll up- and you look at the unique ones here- we can see that we're stuck in txt wait synced. Now that's kind of interesting because usually when usually the TFG sync wait, saying usually a THD sync should happen every 5 seconds or so so. There's no reason that this should have kind of needed to wait for it or I'm. Sorry, the zpool sync command is waiting for that to finish, because that's like it's entire job, but there's no reason that it really should have been stuck.

B

Let's take a quick look, the only thing that I did before that was I touched a file. So let's go take a look at that code: real, quick and we're going to search through the code or is EFS create, because that is the function that is responsible for creating a file and we are going to search in the right directory.

B

So, let's take a look around here: um here's the function and, let's see, if we see anything suspicious in here because helpfully I've been very nice and commenting all of my bugs that I've added and this function does do a lot but I think pretty soon whoa, okay, I'm, just not ready for that kind of commitment.

B

Okay, so what happened here? If you look at the code, is basically I commented out this line, which is DM UTX commit what this represents is effectively, what's called a kernel deadlock and basically the way that this works is the zpool sync thread is waiting on some resource that will never come, and this can come for this can happen for a couple of different reasons. One of the big reasons that this can happen is basically because that thread its itself waiting on something else.

B

So if we come back to the slide, if I can get to it,.

B

So basically, this picture right here which has been helpfully provided by Wikipedia, because I didn't want to redraw it.

B

Basically, this kind of explains what a standard kernel deadlock looks like, and basically the idea is that you have two threads and each one of them is holding on to a resource and wants the other one and they didn't ask for them in the same order, and so you have one thread: that's holding resource a or in this case it says resource one and, and it is waiting for resource two, but it can't get it because there's another thread which already owns it and it's waiting for resource one.

B

So you end up in kind of this cycle of sadness where basically, nobody can get access to the to the resources that they need in order to proceed, that's one kind of deadlock, another one is the kind that we've encountered here and basically the way that this one is working is we simply have never, given that resource back, we created a transaction and we decided never to call commit on it or abort, or anything else like that, and so the transaction has just ended up kind of stalled or, and so that transaction is kind of lost in limbo, and so it will never be returned to the main sync thread.

B

Let's just get rid of this and we'll rebuild now as with as before. In order to make this work, we need to reboot thee. We need to reboot the operating system because, basically, the whole system there's no real way to get yourself out of this kind of thing, but while we're waiting it for it to reboot.

B

Basically, let's just go over the symptoms that we kind of see here. Whenever you have a kernel lockup, it can be the results of one of a number of things. It could be a rogue call to sleep. It could be really simple like that, but it almost never is. It could be waiting on something else from the thread like in this case it was waiting on the TX commit it could also just be waiting on. It could be waiting on a resource that is simply not really there or not available or a signal.

B

That's that's not really going to ever come for some reason either because it didn't listen in time and the signal was already issued before it was listening or something else of that nature. So, ok, so we're back here. Let me quickly just get us rebuilt and we will try one last time.

B

So does anybody have any questions on that, while this is rebuilding should just take a second if I saved it I believe I saved it? Yes, I did any questions.

B

So now that we have that, let's do one last thing and basically what that thing is is let's try, let's try it. Let's try to create a bunch of files. Let's see what the I/o looks like if we try to write a whole bunch of data, I, guess in this case we'll just do one file, so I'm gonna take my convenient little command right here now, for those of you who are not, you know super familiar with some of the things that I did here.

B

Basically, all this is doing is it's going to write data as quickly as I can kind of muster to this file, which is pool slash test, /, yes, text and first, let's create that data set again because we still have not been able to yet.

B

And I did that.

B

Okay, so we're writing this data out and I'm.

B

Looking at the rate that this is getting written out, because the thing I'm kind of worried about here is performance and it's really spiky if you notice, like it, started out real slow and now it's going really fast and now it's back to slow again, that's kind of weird, usually in you know, ZFS or any file system or really any application that you'd write, you'd kind of expect it to have pretty you know even performance, especially for something like this, where we're just writing data to a file.

B

So, let's go take a look first thing: I want to do is run H top and that's kind of weird for those of you who didn't like kind of immediately see it, but basically we're writing data to a file and what's interesting here, is that we're using a whole lot of CPU, and you can see that in H top they helpfully color code this. So you can see that this is read CPU time, which means time stuck in the connell.

B

And if you look around at some of the options, you can go here to hide kernel, threats and disable that, and you can see that we are using 84, that you know a whole lot of percent way more than you know. You'd kind of expect for writing data in these z, WR ISS threads, which is kind of interesting. Now that in itself, like those threads kind of do a lot, those are the threads that happen to do all of the kind of issuing I/o and calculations related to to doing I/o, and things like that.

B

So, let's, let's do. Let's do a little bit more analysis here earlier. I forget who showed it, but we saw what a flame graph was and that's very helpful, because now I don't really need to explain it, but basically what a flame graph is. Just as a quick recap is, it is a it is a. It is kind of like a visualization of how much time your CPU is spent doing any given task. So let's restart this and I'm just restarting it, because I want to make sure that I get kind of the same picture.

B

Again so we'll just RM pool test text and we will start again so while we're doing that I'm gonna just paste in right here this code that I have and again this is all at the end of the presentation, but all its going to do is it's going to take 10 seconds and wait and kind of take a picture of what that CPU was doing in those 10 seconds should be done in just a second there we go and let's go, take a look come here and it's not that control and should be this guy right here.

B

So this is what it was doing. um You can see it spent about half of its time idle. That's what all this stuff is right here and you can see that says, do idle! So that's kind of a good indication, but you can see here that the zwr is s threads they're spending pretty much all of their time here in zyo, checksum compute, that's interesting because, usually in ZFS we use LZ for checksumming and that pretty non-intrusive it doesn't. You know it doesn't really take a whole lot of CPU to do that.

B

So, let's, let's take a look at this: let's cancel this and go back to the here and we will go to.

B

B

Okay, right here and okay, this is immediately suspicious find out if anyone would notice if I started, mining bitcoins here mined bitcoins for a bit every time we do a checksum. That would probably do it. So here we have this thing, which is obviously eating up a whole lot of CPU right here. This is just a. This is just a hard loop and let's go see, oh so, here's another interesting thing helpfully I decided to dump the bit the bitcoins to this ZFS debug message and for those of you who are not familiar with this.

B

This is also another really helpful thing for finding any. You know issues that might be happening with ZFS, and that is for those of you on Linux, it's available through proc SPL, k, stat, ZFS, debug message and, as you can see, we found all kinds of bitcoins, so many bitcoins.

B

Now this is kind of a virtual file. So this is like an self-updating list that gets, you know, recycled kind of circularly, so this does not have everything in it all the time. But this can't- and you know you can go through the whole thing, but as of 0-8- oh, this is now enabled by default, so you should be able to always at least see these messages and get an idea of what's going on.

B

So performance issues: what do they look like? Basically, in general, it's just whenever somebody complains it's whenever a process isn't moving as quickly as it should be, whenever it's moot using up more resources than it should be or anything else, whenever you're looking to debug something that's a performance problem and not just a hard crash or anything else like that. The important thing is basically to try to figure out what your bottleneck is. First, it could be CPU, in which case the good thing. The best things to check are copper, H top.

B

If it's RAM, you can also check top or H top, and look at how much RAM it's using or free M disk IO. You can check I, Oh, stat, mx1 or io top. Those are all really good tools and for network which doesn't so much happen, probably for those of us here at this conference, but in general you can look at if top and get an idea of how much how quickly data is moving in and out over the network. It might also be waiting on another process.

B

So possibly, you might need to check for other slow processes that your process might depend on when it comes to finding the culprit. The best tools that you can use our flame graphs as we showed earlier and another really good. One is perf top, which I'll show really quick. So we'll run this again and then, if I run perf top, you can see kind of the same. The same kind of information, it's a bit harder to see where the stack traces are.

B

But but this can give you a really good idea of what functions are really hot on your system in causing the most problems.

B

In addition, in terms of memory bottlenecks, a lot of the time, this will indicate some kind of memory leak and if this does indicate some kind of memory leak, what I suggest you do is you can compile ZFS with the ZFS debug mem in debug, mem enable and SPL came in debugging I believe for the two flags they're kind of confusing, but they're documented in the configure script, and basically, when you turn them on ZFS, will enable some leak detection, which should tell you when memory has been allocated but not freed or when it has been double freed.

B

So these can help you out pretty normal asleep when trying to find memory leaks other than that, you can look at some of the proc files and try to figure out where all of this memory is being allocated to some of the ones, especially on Linux have like proc slab info can be very helpful when finding these problems.

B

Finally, for disk bottlenecks, we have a ton of tools in ZFS being a file system, including 0 I/o stat, and you can add any number of flags. These were already kind of shown by Brian. You know to show what kinds of things they're capable of of displaying, but basically they can give you the latency the size and the queuing statistics of pretty much all the I/os in the system.

B

You can add a dash V if you're really crazy and want to see everything for every disk the arc debts can give you a good idea of you know. What's going on with the arc, how much memory it's using and those kinds of things and again I Oh top can be very can be. You know really helpful in just determining, especially how busy disks are because they just give you a nice percentage number for network bottlenecks. I, don't have a whole lot to say here, because I don't deal with these too much, but basically, usually.

B

However much data you send over the network, the only thing you can do is either don't send it or compress it, and so you can. You know kind of try to address your problems like that.

B

New tools that are coming out, there's BPF trace, which was mentioned before and BPF trace, is basically it can help you print information about kernel function, calls and the same thing with func graph, it's a kind of a similar utility, but it gives you like kind of the call hierarchy of everything and finally ZFS debug message for resources. These are kind of all the things that I talked about, including this nice really long, bash, one-liner that my coworker provided and you can find any. You can find that you all the rest of the stuff.

B

Just all these other tools on a github and you know, use them as you see fit any questions.