OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Debugging ZFS: From Illumos to Linux by Serapheim Dimitropoulos

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1oho9X5bkW-I-yJ-pVD8VqkaloxhGepzT

A

All right, can everyone hear me: well all right, cool, so hello, everyone as per Wilson, my name is Seraphim I work at del phix and today, I want to talk about our experience of the ethics debugging ZFS in illumos and Linux.

A

Just to give a quick background, as I said, I work at del phix and our product is basically an appliance that sparked by ZFS, and our customers run either on the public cloud or on premise of hypervisor, and we recently did this switch where we were using in Lumos and we transitioned our operating system to Linux and one of the main questions that was brought up during this transition was. Can we maintain our existing debugging processes that we have from illumos for issues that we encountered production with Linux as an engineering organization?

A

One of our main goal on that aspect is to root cause on first failure. Whatever we hit an issue in production, we want to debug it there and then for performance pathologies. What this looked like most of the time is hopping on the VM gathering some logs and, if that's not enough, trace the system to analyze its runtime behavior, but for more severe issues like panics or dead logs. What we generally do is we collect the cross down and we do post-mortem debugging, basically analyzing, that custom in-house in illumos.

A

Everything was built with these use cases in mind, so our procedures were created accordingly, now switching to Linux many things carried over, but specifically for post-mortem debugging. We found the workflows and the tools available not being exactly sufficient for our needs. So at this point, I'd like to take some time and talk about post-mortem debugging in general and specifically, why is it so important for us at del phix as an appliance company? Well consider if we didn't have grass demand, we wouldn't do any post-mortem debugging and a customer VM crassus.

A

The only state that you have. The only information that you have related to the failure is whatever the customer thought to mention at the time. Whatever support thought to ask and whatever logs were recorded, while the VM was running, that the developers thought during development that they would be useful, but most of the time, this is not enough for us to root cause issues, and you know if we didn't have craftsman terms.

A

What would end up happening is that we would iterate with a customer and by iterating I mean you would basically add some more logging statements or you change some code, give it to the customer. The customer Rani's reproduce the issue, and then they come back to you with the new data and then maybe you can root cause the issue. Maybe you need to go through another iteration and you can see where this is going.

A

This is a very slow, it's a very error-prone process and if you've done this iteration like four or five times it can be a little bit embarrassing because it looks like you don't know what you're doing so. I, don't say that crash dumps and post-mortem debugging is a silver bullet. Sometimes you still need do all the things that I mentioned, but across them is comprehensive.

A

It most of the time it has all the state that you would ever need to do to root cause issue, and it never lies it's kind of like facts, not people's opinions. Another great point about grass dams is that they decouple the activity of root, causing the failure from the process of restoring the system.

A

So if you look at the failure, the first time you see, the messing you can kinda see like oh I can apply this workaround having taken a grass dump and the customer can continue running their system while I root cause the failure. On my own I mean this is a great thing. It basically lowers the severity of the issue right away, so I have a real-world example here, something that George and Matt worked 13, 2 or 3 years ago.

A

So basically, what happens like a customer hit an issue, and you know as an engineer having this collision. You see something like that. This is very similar to the DES message: output that Tom showed earlier, so you're handling the escalation, and you want to start debugging the issue. So you look at the console and you're okay. We are lucky we hit an assertion, failure we get a nice output where it actually points out the file name and the line with it at file, name where the problem occurred.

A

In this specific case, we see that the problem is in an assertion within CIO done, so you start taking investigation, notes right. You are looking at the assertion and basically what the assertion says is that the block pointer of that CIO is not the same as the original block pointer whatever. That means for now right we're just looking at the code. Looking at the, if condition we see that okay, this is an opera egao.

A

Whatever that means- and also an extra note is that this block pointer that the CEO is trying to issue is not an embedded block pointer all right. So what do all these things mean? What's it no pride and no pride is a performance optimization that has potential savings for snapshots space savings.

A

Basically, what happens is that ZFS compares the check sums of incoming blocks to ZFS and compares them to blocks on disk to their destination blocks on disk, and if this check some mats, it means that nothing has changed, so we can actually skip doing that I/o. um This is actually pretty common when you do full backups on large random access files that have almost identical data.

A

So with that in mind, we basically rephrase the problem that we just saw what happened we're using an upright, but the block pointer that we are about to write that we would be. Writing is not the same as what's on disk, so that's, basically, a bug in the top right also is an X. An extra note is that the block pointer is not any better block pointer, so straight off the bat we know. We can know that we can relieve the customer by disabling an OP rights right away, and here I have this one line.

A

There is basically, we are passing a command to MDB. That tells it to look up. The no ZFS nope right enabled variable and write zero to it, basically disabling that functionality, but before doing that we also generate the car across them, so we can analyze it house so alright, so the system is running now the customer is happy, but there's still this bug that other customers may be hitting. So we basically have two choices. We can start reading the tens of thousands of lines of related code. The.

B

A

Code, the team view code, you know all the entrances to like the right code paths or we can start analyzing across time that we got and make targeted questions towards the root cause of the issue. So I really like this example, because you know the zero code path is it's a little bit complex, George Wilson had a whole talk last year, explaining all the pipeline stages of these things and following the control flow, just by reading the code is not very easy.

A

The other thing about this is that, looking at the stack trace, we can't tell where the CIL came from. Basically, some kind of thread was spawned running zio, execute right away, and it failed. What's even more interesting is that in the zero code, but many times the thread that actually issued the zio may not be around anymore, because this may be an a synchronous right.

A

So, even if you were to print all the stack traces in the system, you still there's a possibility that you wouldn't be able to find the thread that actually issued the zio. So what I'm trying to say is that you need to inspect the actual data on the system. Specifically that's all Zi or see what it looks like and there's no better place to do that than actually analyzing the kratom.

A

So what we would do in illumos, we would start inspecting the cio. We would use m DB that the kernel debugger in Lumos that works on live systems and crash dumps. We would print the stacks with function. Arguments and you know we would look at the function. The first function argument of zi you're done it's a jo T pointer and we would actually print what the actual structure looks like. So in this case, what we're basically saying is that take that pointer type Casa to zero and print it.

A

One interesting thing about this is that iodine is said to debuff right, override done, which means that this is a right, override io, and this is gonna, be helpful. Later I just wanted to point out, but we can do cool things like examining what the block pointers look like now: I'm, not gonna, go through like what these commands exactly mean, but I'm. Basically, pointing out that the GAO that we were about to write looks like this. The first image and the the second image shows what the block pointer on disk looks like.

A

So these are obviously different. Therefore, you know it makes sense why the assertion was, but what's even more interesting, is that the block pointer of disk is a hole which means that the blow pointer was freed and when was it freed. We look at the birth time and we see that it was freed that txt 38 91, okay, that's kind of interesting. What's the current exe, we can actually walk the our way from the zero pointer all the way to the spa structure, which basically describes a pool and things like.

A

What's the currently singing text see and we see that the currently seeking text is 38 92, so basically that block was freed one text before the current one, all right, so that's kind of interesting, so we are going back to iodine being set to debuff right over write down, which is basically a callback and looking at the code. There is only one place where these, where iodine is set to that and that's the buff right all right, so we're going somewhere. Looking at the code around this code path, we find a new clue.

A

Basically, it's a comment telling us that this block was provided by open context and we came here other by D, musing or dim be damned you buff right embedded. So we have our two suspects where the co could have potentially come from. But if you remember, we made that note earlier that that block pointer is not an embedded block pointer, so it must be d amusing.

A

So we look at the D mu sync code and we stumble upon this interesting part. Where there's this huge comment and and if case that potential disables, not rights so reading the comment, what it basically says is that there's a possibility that our blog may have been freed or it's data may have been modified, so we should disable the knob right, because the context has changed.

A

What's even more interesting, though, is that the, if statement, checks if the record has been dirtied, which means that it only checks if the data was modified, it doesn't check if the block has been freed. So lo and behold that's our problem, so the actual fix would be something like this, and this is what ended up happening. Matt and George, reproduced this in house, and they came up with his fix. They basically added this extra case here in the if statement and they restructured the comment to say to be explicit and say: hey.

A

This could have been checked by being dirty, the first condition or by being freed than the second condition so case closed. The problem was that we were using an upright, but the block pointers. The block pointer underneath us has actually changed. The root cause was that that Dimmu scene wasn't doing good enough a job of considering if it's disabled, uprights or not, and the fix was adding the extra check.

A

So just to do a recap: I hope that was motivating enough. That post-mortem debugging is like very important. It is especially for us at del phix. It allows you to examine the processor and in memory state at the time of the crash, and what we've seen is that if you bundle it with a zdb output, which is basically the corresponding on disk state, most of the time is all the state that you need to debug ZFS issue.

A

You know on your own and again the great thing about it is that sometimes, as we saw in the example, we can decouple the.

B

A

Of system recovery from the process of root cause analysis, and all these things are great and crash dumps are very helpful, but there is no use of across them. If you don't have a tool that can analyze it efficiently, and we did some research on post-mortem debuggers in linux and we found there there. There was some work there, but there wasn't. We couldn't find anything that was sufficient enough for our needs, so we leveraged some existing projects and we created SDB.

A

The slick. Debugger sdb is a post-mortem and live debugger when I say, laugh debugger. It means that you can attach it to a live system and you can introspect memory without actually halting the system. The system is still running. You don't want to just how the system in production, its user experience, is similar to NDB.

A

We basically drew from the lessons that we've learned by using MTB, and it provides basically a user interface where you can ask any questions by chaining a pipeline of commands together, and that should sound very familiar, because it's basically a cell and can it can also be easily extended by Python developers, can write new commands on it and just load them in the run in the s DB runtime and run them.

A

So I could go on about many hours talking about s DB, but this is the ZFS conference and we're gonna look at some examples of debugging ZFS with s DB, so I'm, very similar to what tom so the earlier about the duplicated stacks most of the times. You want to see what's going on in the system and keep in mind that all the command output that I'm gonna so in this presentation, works both for life systems and crash dumps.

A

So again you would say: okay, what's going on in the system and you would use the stacks command. Ndb folks should be familiar with that bear with me for a little bit.

A

You can see, for example, in this system. We have like 64 threads that are NFS servers waiting for something, and you know the way down. All these things are sorted by count. We see some networking thread doing something, but you know we want to examine ZFS issues, so we can ask okay, what's going on specifically for ZFS, so we say stacks does n z, FS. Basically, this says show me all the stacks related to the ZFS module. So here we can see. Okay, 17 threads are ethers waiting for something we have four quiescing threads.

A

We also have at the very end, let's say the debuff, a big evict thread doing something, and you can be even more specific from that. You can ask questions like okay. Are there any threads issuing any ZFS iocked dolls? And you can do that by writing. Something like this Stax does see ZFS the vehicle. Basically you say you tell stacks filter by all the threads that have this function in their stack trace.

A

So in this example, we can see okay, there's one I octal doing a sir there's another I octal, getting the history there's another I octal exporting sample, but the real power of SDP is examining actual data structures. So in this example, let's say that we want to introspect a pool state right. All the pools are part of this AVL tree called span, namespace AVL and what I have here is basically necessary command.

A

That says: ok, I, want you to examine whatever is at address at the address of span, namespace ABL, and you can see the a the fields of the AVL struct and you know what you would do in general with another debugger. If you want it to, let's say traverse this AVL tree and see all the data I need. You would basically start from the root, so you can pipe whatever you found into the member command which delivers and their references and member out of a structure starting from the root.

A

And then you know you can print up that root and then you can go to one the left child and then the right child one by one. But you know this is not very scalable if you have like a million nodes right, so you know we have s TB and we have pipes, so we can just make a command that works it for us and what's great about this is this is like definitely drunk from our experiences from MDB on MTB. You do something very similar, but you would say pipe a VL actually stb.

A

We don't just pipe integers and pointers around. We just pass like whole objects with their own state like type information, addresses value, symbol, names and things like that. So what can actually figure out that hey I'm being passed an AVL tree and I'm gonna walk it appropriately. So in this case we see that we have four nodes in our span: namespace AVL, and we can cast these nodes to spot these structures and one behold we just you know the output was humongous, so I just printed the first one. We have a pool name named application.

A

We can get a pointer to its config so far and so forth. We can actually continue our pipeline and print all the names of the pools by just walking the ABL casting everything to a spotty structure and dereferencing the span name member- and this is something that we've done so many times that we actually created a shorthand for it. That also slightly starts printing printing things. So you have the addresses of the spotty structures on the left and their names on the right.

A

What's also great about this shorthand, though, is that it's still a pipe above command, meaning that, if it's not the last part of the sdb prompt- and there is a pipe after it- it says- okay, they don't want me to print anything. So I'm, just gonna, just keep passing my spite these tracks over.

A

So you can continue introspecting spotty structures just like that in this example, I'm, just printing, all the singing TX T's, the country, singing the exists of all these structures, but you say: okay, we have all these thinking takes this, but that's that output is not that great like what if I wanted to know specifically, what's the kind singing take three of the are pole. Well, these are just numbers here. Well, what you could do is you could actually use, filter and filter. It's a pretty powerful command. In this example. What filter does is.

A

First of all, we take all this party structures and we filter all these objects that we are passed through. The pipe and we're saying: okay, I just want the objects whose pan name equals to our pool, and then you can print a spasming txt and again we do this often, so we actually added argument parsing to spa. So we can do this right away but filter by itself. It's just like a powerful command and drew ow.

A

I have another example actually of how good filter is drawing from whatever we learn by Paul earlier about Metis labs and being loaded. Let's say that we wanted to know how many Metis labs are currently loaded in the our pool. This would be a pipeline that we would construct and we can walk this together. So these are the rail for the relevant, see structures in the code. So what happens? We go through all the spot, distractors structures and specifically look for one. The one whose span name is our pool.

A

We dereference the spa Metis labs by Flass member, which is an AVL tree. Then we walk that AVL tree, basically all which basically it's the AVL tree that contains all the Metis laps in the pool. We cast its node to a metal, slab, T pointer, and then we filter all of these structures by the MS loaded field, which basically indicates if a metal slab is loaded or not, and then we can pipe lists to count and that's basically like a pretty wedding. Okay, like how many objects were passed in this pipe, so that's pretty cool.

A

We. We can basically ask almost any question that we want- and you know we can create a pipeline for this- to answer a question. For example, and another thing though, besides the pipe lines, is also pretty printing. So let's say that you wanted to print all the and Flast allocation segments in the art bowl you construct a pipeline. That's similar! You know you walk all the meta slabs, you cast them and you get the MS and Flast allocations, member, which is a rain stream, and we have a command that you're pass.

A

It array you're, passing in terrain tree and it pretty prints things. For example, okay, the first strange tree didn't have any entries and zero bytes in it, but the second rain streak has 151 entries and you can see them all pretty pretty nice with their offsets and their lengths and what's again cool about this- is that rain straight is not just a pretty printer.

A

You can actually continue the pipe from there and say: okay from this like segments, how many are above this offset, so you use the filter command at the end and you're saying: okay, I want RS, start range segment start to be bigger than this value. Then you can't so just focusing a little bit more on the on this concept of pretty printing most debuggers. You know for something like a block pointer here, ID reference. Basically, the block pointer of the over block would print something like this. This is pretty standard.

A

We're like okay, it's a C structure, I'm gonna show you each field and the value of its one of the fields and that's good, that's helpful, sometimes, but most of the time you know you have things like blk prop, which is basically an integer where each bit is a flag. All you have DVS divides virtual addresses, which here are represented with to you in 64 eats, but they actually means so much more.

A

So we have the block by turn command, which can basically teach GDB GDB stb to make things more readable, decode these values for you and actually present useful information. That answers your question so in this example, we decoded the DBA words to the axle DVS here and you can see. Okay, the va0 is an addressing vid of 0 at this offset with that length continue in this example, you know blk prop, as I said, it's like a bit field. Where now we can pretty print and say, okay, what Oh?

A

This big number means is that this this is a level 0 block we're using flat, sir for for checksumming we're using LG for for compression and so far and so forth. Another very useful command. That's currently, actually that's currently work in progress. This is the only sample that I could make working for now.

A

Didn't have a lot of time before this presentation, but it is doable, is to ask the question unless abuse, to ask us to be the question of whether is there any IO currently on the system, specifically at ZFS and we've developed the CIO command, where.

B

A

Can actually print all the CIOs in on the system and again this is a this can be either for a running system or across them, and this is pretty powerful because you can see the relations of all the CIOs. You know, parent to child. You can see what type of zio it is. You can see the current state and you can see if there is any waiter waiting on that CIO and yeah continuing in this small tour of commands there.

A

When you hit issues about memory usage, sometimes you want to see okay like what do the SPL Casas look like, so we created the SPL kmm Casas command, where you know you get a nice report of all the Casas you can and they're ordered you know by the top offenders. Basically, you can see here zero.

A

That above has like the most active memory, but what's even more interesting is that you get a mapping of where the caste is actually reside in ZFS on linux, for certain, depending on the size of the allocation, we do different things for smaller locations. We are using an underlying Linux slap casts and for bigger ones.

A

We use Casas kind of like maintained on the SPL layer, so that's pretty useful to kind of like know and see where to look for different things, and specifically, you know with the Linux lab allocator, just doing merging, basically putting two different classes together, sometimes just to like save space, you can be even more specific and say: okay, art, buff header. T fool is a cast that's, but by Linux lab and it's actually part of the task stats, Linux casts and then just adding on top of this command.

A

You can add more specific things like okay. What's the memory usage of the be tree leaves that kind of like Paul created. So what you do is like you use the spamm casas command you filter by the caste name, specifically, you look for B tree leaves and then you pop it back just to print it print it.

A

So you you see that okay, it's using 316 kilobytes memory, it's backed by a Linux lab Cass named K malloc 4096, whose total memory is 3.7 megabytes and is utilizing basically 8% of that gas, and you can actually just break out of that and ask more questions. Okay, like what's the actual caste look like, and we have the slabs command, which is basically SPL, kmm Casas, but for the Linux kernel, and you can look up that caste and you can see okay, the actual cash utilization is 65%.

A

It's just that Beatriz just take 8 out of 65.

A

All right, I have even more examples and I can we can talk about them during questions so and if we have time but I just want to quickly go through how SDB actually works. You can't talk about s DB without talking about dragon. Dragon was a is a small C library in Python developed by Omar Sandoval at Facebook. It basically enables the Python interpreter to introspect live systems and grass tons. It comes with a nice Python API and object model. It's pretty fast. To start up.

A

It's still like some fixture features like function arguments, but we're working on that the community is still small, but they've been they've, started marketing pretty aggressively and it's growing and their overall open to patches. So you could actually do a lot of these things that I just saw just with dragon as long as you are willing to just type on a Python or apple just Python code, and this can be very cumbersome because you know imagine that you're debugging an issue in production and you have the Python or Apple in front of you like now.

A

You know your focus should be debugging, but you're actually distracted by trying to write a program, getting your spaces right and making sure that you don't get syntax errors. So I mean there is a reason why you use the cell for everyday things instead of like something like a Python repo. So, okay, that's Ragan. What is this to be? Sdb is basically a layer that leverage is the dragon API to provide the debugging experience that I just saw it can be extended, as I said in Python, with new commands.

A

These commands generally use the Dragon API to query whatever they need from the kernel, and then you know these kind of like code that you can write to query. The kernel can be reusable by using SDP constructs, so it can actually receive and pass objects through a pipe and again as I want to point out like point two is a pretty powerful concept because we're just passing like dragon objects that have oh, like a whole context with them. They're, not just pointers that we pass along.

A

So just to recap: SDB is a debugger for live systems and customs. It leverages the dragon library to introspect its targets. It can be extended in python to work, complex data structures that developers write and things like filters and aggregators and pritam printers of data, and basically, what we're going for here is that allowing the user to ask almost any question that can be answered, given the available state that they have either.

A

This is a live running system or across them, and hopefully that's what makes it so great about debugging ZFS I have other content, but at this point, I would like to take any questions. Yes,.

A

Where does what.

A

No, no, it's not it's actually out what yes. So the question was: where is this tool located and is it a work in progress? And, yes, it is a work in progress, but we've already started using it and you can use it too. I have a resources page with a github repo it. Currently, it's currently out there at the Dell fix a github group, but you should be able to clone it. You know work on it and I'd be more than happy to accept patches to.

A

All right, so the question was basically have have we made any changes to ZFS to better support this tool, that's kind of like one question, and the second question is in general: how do we handle ZFS changes in SDB?

A

So to answer the first question: no, we haven't changed anything in the in the ZFS module specifically to support stb.

A

As for the actual changes of SDP catching up with whatever changes happen in the kernel you currently do, we can only do this by hand now we are fairly stable, like certain things like the spa or Metis labs right stay, pretty constant, so we are safe for now.

A

One discussion that I would actually like to bring up at some point is potentially having some stb scripts decoupled outside of this DB repo in ZFS. So, ideally, when a developer introduces new code and adds new structures are made, some changes, they also change the corresponding sdb Python scripts, together with it. This is kind of like a model that we had at del phix from before, with the Lumos and MDB, and it it's something that has worked pretty well for us and it also decreases them.

A

The burden of like hey I need to maintain two different things in two different repos. It also went you like submitter PR, at least me Ezzor. If you're, it makes me a little bit more confident seeing that hey you've actually added some scripts together with your code and we, if something fails in production and I, have no idea about your feature. I can at least see what's wrong with like this has to be command that you provided.

A

Other questions.

A

Yes, so yeah during a transition, we actually looked at a lot of different things. You could ask like: why didn't we just use GDB or a class, or you know something existing in Linux or why did we not port NDB right, and there are multiple trade-offs to all of these things right specifically for MDB, it would be the porting effort. Is that it's pretty big, meaning that NDB like, even if we just ported it as it is out of the box right currently like it words with see only with CTF.

A

It has some dwarf support, basically kind of like translating dwarf to CTF on the fly, while reading it and I actually work with Robert Misaki to do that initially, but we had so many problems, because city of convert couldn't handle new dwarf constructs after Dorf version, 2 and other weird things like that. So basically porting wouldn't be that easy. Now you could say: okay, like you, already have this infrastructure right built on this, like don't you care about that too?

A

But it turns out that once we found something like dragon that does most of the stuff that we currently need implementing a new interface on top of that, it's not a lot of engineering effort and it actually works already with the Linux ecosystem. Like you know, the Dorf simple information everywhere, like the kind of like different types of grass dams like we wouldn't have to deal with writing that code anymore and dragon was something that was already used.

A

So this basically, as I said, you know like you could do you could do everything you could make gdb understand crass times right. The question is: do you want to take up that work and maintain it forever? Just by yourself, yeah, okay, yeah.

A

Any other questions.

A

A

A

Alright, so I just have future work slide, and these are some of the things that we're thinking of doing in the future. I just wanted to point that out just for the github repo. Here there is actual. As for the actual community, you can start by checking out the github repo there and you may actually find some references to something that we've started doing. Basically, we've created a small organization that we're trying to attract people working on Linux debugging across the Linux debugging landscape, meaning from debuggers to actual kernel developers.

A

You know changing the craft and format and things like that. Unfortunately, I don't have a slide with all this information, but I'd be more than happy to share that with you yeah. We have monthly meetings and we're basically trying to seek up all together to make sure that there's no duplicate work. You know, and you know, no two people are working on the same things and kind of like yeah. Just basically do decision make. You know how we want things to look like in the future.

A

All right, oh yeah, first thing over there at future work is that we need more commands, ZFS or even outside. If that's your jam, so I'll, be there tomorrow at the hackathon and I'd be more than happy to help everyone write new commands or even just like, set up stb and use it to introspect the system.

B

All right, thank you surfing. This is new great tool. We're gonna have a break now, since we don't, we have one presentation, less we're gonna recommend in a little bit later to be a so. We don't have 13 minutes for the break, but we have further like 23. So, let's reconvene at 4:10.