OpenZFS 2018 OpenZFS Developer Summit, 21 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Device Removal by Matt Ahrens

Description

From the OpenZFS Developer Summit 2018
Slides: https://docs.google.com/presentation/d/1u4gIGHJCbKAUxpFjU6VpUJaTX_pHAyT_HVpQJxUOFR8/edit?usp=sharing

A

So our next presentation is going to be about device removal by my parents- and probably most of you have already heard about this. It's a new feature which has been in the works for several years and we've been using the del phix already for mayor several years as well. We have made to Linux relatively recently I believe so. Matt is gonna, give a good deep dive into a device removal and its internal functioning. So please welcome Matt.

B

Thanks everyone so yeah, as people mentioned, some of you might know what device removal is, but I'm gonna start by explaining. What is the point of this? So what I'm talking about here with device removal is like, for example, in this case, we have a pool with three top-level v-dubs, three mirrors: each one is the mirror of two disks, and what I want to do is remove this whole mirror and I'm, reducing the total amount of space in the storage pool.

B

So this is in contrast to, for example, detach where I have a mirror and I want to remove one side of the mirror. You've been able to do that since since the beginning, but that doesn't actually reduce the total amount of space in your pool, it's specific to mirrors. So what? What is the point of this like? Why? Why would you want to do this? Oh sorry, before I continue on it notice that this is not being rendered correctly.

B

All right, hey luck, surrender it different fine. What is the point of this? Why would you want to do this? Well, when we were design this feature at del phix? This is back in 2012, maybe one of the main use cases which is you know coming from our field, people as well. What if customers over-provision they add too much storage to this storage pool or like they have a temporary project where they're like oh I, need a bunch more storage just for like the next month.

B

While we do this project and then after that, I want to. You know: remove this, remove that space and use it for something else. You know for a different storage pool or something else entirely um mm. Has anybody had this problem? No wow? Oh one person has over-provision their pool yeah. This turned out to not really be the really use case.

B

Another big use case is like, if you add the wrong disc or add the video as the wrong type like you meant to add it as a log device, but you added as a mean as like a regular device, it's kind of stuck in there forever. Don't worry, I won't ask if anybody has experienced this use case, but the main use case that we've actually seen in practice is for storage migration. So the idea here is like you want to Mike: you have a poor. You have a storage pool.

B

Basically you want to remove all of the discs and you an ax, my great the storage, to all new disks that might be a different size or different numbers so like in this example, say: I have 10 1 terabyte drives and I want to replace them with 4 6 terabyte drives, there's some special cases where you could kind of hack. You do this in a super hacky way before this project, but this makes it a really first class, so right. So what? How do we do this? That destined?

B

My talk is about how do we do device removal? So all that we need to do is find all the allocated space in this storage pool, which is represented by these blue and purple squares and then allocate new space for that in the remaining devices. In this case, I'm I want to remove the device on the left here.

B

So I'm gonna allocate new space for the new places for the allocated space on this device and then I need to read it from the old locations and write it to the new locations and then keep track of all the changes that I made so like, so that I can find the data when I go to read it from a new place.

B

Yep, that's what I said. So. That's it now I'm going to tell you about. Ok, don't worry, I'm going to be getting into it like if you're happy with that explanation, that's great! You can say it.

B

You can pay to edit for the next like five minutes and then go back to sleep, but hey if that wasn't quite convincing that that's what the rest of my talk is about, but first I wanted to kind of motivate that by showing how you actually use this and what it looks like, so you can get an idea of what we need to be supported like what we need to be doing under the hood to make this all work so the way that you typically use device removal is first, you check to see if there's enough check to see how much memory will be used after the removal.

B

So, as I mentioned, we need to keep track of the mappings from the old locations to the new locations that uses memory. So in this case we want to remove this device. C 2, T, 1, D 0. We use the end for a no op flag and then it'll tell us great after you remove this. We think it'll use 37 megabytes of memory. Then you just run it without the N and it kicks it off.

B

In the background, while it's running in the background, you can check the status it shows up in zero status, so it tells you like, oh I'm, in the middle of removing this device, and we also call it device evacuation to be really explicit about the fact that hey what I have to do to remove this- it's not like. Oh you run people, remove and then yanked advise you you renze full remove, and then we evacuate it by. You know getting all of the all the allocated data off of that device on to the other devices.

B

Once the evacuation is done, then you can light the building on fire, but not before, and then there's like a you know. It tells you what rate we're going and an approximation of how long, how much longer you have just similar to like a scrub.

B

If you want you can cancel the removal, you could cancel it. We just kind of set everything back the way it was. You know that you might want to do this. If, like I, you you're you in the mode of removal, you started on Friday night and then like it's. It's Monday morning and you're like oh whoops like this, is taking longer than I thought people are going to be coming in to work and they need really good performance. So let's cancel it and rethink what we're doing here.

B

If you lose power or you reboot during your removal, it'll pick up where it left off, so it remembers all across it remembers all the progress I mean you can do everything else, while you're in the mode in the middle of the removal. You know you can add new devices, you can take snapshots everything and then, after you complete the removal. The gee-pole status tells you about that as well, and it tells you how much memory is used in this.

B

This memory used applies to all device removals, so you can remove a device and then add another device than that and then remove another device. You can remove as many devices as you want until you get down to one device, and then you have to add some more before you can remove that.

B

Okay. So now, if you didn't buy my three line, explanation of how it works. This is how it really works in detail. So if you want to follow along later, first thing, I'm gonna be talking about the removal process and then, after that, I'll be talking about after removal has completed what we need to do to handle the new state of things. So this is mostly happening in V. Dev. Underscore removal see, there's a there's, a bunch of really huge comments in there explaining this stuff in more detail.

B

If you want to check that out later so first I want to mention we, we start by checking the removal types, so you've actually always been able to remove devices from a pool, and so this this I octal is not new. It's just that most rules that you might want to do will rule result in Ian Val. So first we check which type it is so you've always been able to remove inactive hot spares, remove cache devices when we've logged devices, but this talk is about removing top-level devices.

B

So if you see that term in the code like remove top we're talking about removing a top level normal device from the pool and reducing the total amount of space and evacuating etc, so first we gotta start by doing a few checks.

B

First of all, you need to have enough free space because, like I said, we need to allocate new locations for everything. That's on that disk and one one little caveat that we ran into when when doing some testing. Accidental testing is so we require that you have like enough space plus a little bit. The little bit that we decided on is is based on the slop space in the pool. The slop space is just three percent of the total pool size, but it's it's the total current pool size.

B

So what means is that, if you want to remove a device which, on its own that one device is more than 97% of the size of the pool, you can't do it because of this check which we could relax, but hopefully you don't accidentally like. Hopefully, you aren't testing and want to add a one terabyte device, but accidentally add a one petabyte device yeah, that's how we discovered that there there can't be any known damage.

B

So this is just you know, kind of sanity check like if you, if you're trying to remove the device- and we know that the device is missing- some data- we're not gonna. Let you do that and then. Lastly, the blocks have to have the same on disk layout. So what that means is that all the devices have to have the same a shift and unfortunately you cannot have any raid Z I'll kind of get to some- maybe feature work there later, but for right now it does. It works with plain disks: it works with mirrors.

B

It does not work with raid Z, ok cool. So now that we've decided we're actually doing this device removal. We start by disabling allocations to device to the device. So we maybe won't have any rights to that device while we're, while we're in the middle of the removal and to make sure and then to hopefully make sure that we really really don't have any rights to it. We do this spa reset logs. Does anybody know what that means? What does this function? Do? No excellent Oh people know so, what's my reset logs does?

B

Is it basically like clears all the Zil logs and then reallocates them? The reason that we need to do this is because the Zil, the Zilla, is like a singly linked list of blocks. We've always allocated the next block that we're going to write to, but we don't know when we're going to need to write to it.

B

So we've already allocated it, it might be allocated on the device that we want to remove and we don't want to, like in the middle of the removal, have to handle a right to the device that we're trying to remove. So by resetting the logs. We make sure that they all get reallocated since we're doing it. After we've disabled allocations to this to this device, they'll all get allocated on the remaining devices, not the one that we're trying to remove, and then we we kick out a sync task.

B

So sync task is like a callback that runs from spa sync, while we're syncing out with txg and that's gonna initiate the removal. So what that does? Is it initializes the on disk state that says we're in the middle of a removal? We have made zero progress and then it kicks off this new thread. So this is pretty different than a lot of the other background operations. So you know, if you think, about what kinds of background operations you might normally see in zequals status like scrub, Andrew, silver.

B

Those do not work by creating another thread, what they work by doing all of their work in syncing contacts. So when you're doing a scrub, basically like he does the normal rights, the spa sync does all the normal rights and then it's like great now, there's like some time reserved for doing scrub, and this is not really great for performance, because it means that you're you're, basically taking some overhead out of out of your overall possible right throughput.

B

So we thought this is a lot better design to create a new thread to do this and then the thread operates in the background in, but it puts some design constraints because now the thread background thread can do all the you can do reads it can do writes they can do allocations, but we can't change any state in the moss.

B

So the the moss is the meta object set MOS and that's where we store all the pool wide metadata, for example, how much progress we made for a removal or what is the mapping between the old in the new locations? So we can't actually modify that from this threat, so I'll get into how we do that later on.

B

Okay, so we kicked out this new thread. What is the thread need to do? First, we need to start by finding the allocated space to copy. So the interesting thing here is that we're doing this by looking at the space maps, not by the block winters. So if you're familiar with zpool scrub or resilvered those go through and find all the block pointers, which means they have to traverse the whole tree of indirect locks and everything in the storage pool.

B

I, don't know if anybody's noticed, but in some circumstances the scrubber nuru silver don't have the best possible performance, and this is something that you might want to do. You know when you're, not in a disaster scenario, so we wouldn't. We wanted to make sure that this is something that um well, it's already slow enough. So I'm glad that I did this. We find this space by looking at the space map and what that means is that we can find the allocated space in order by offset on disk.

B

So we're able to do the reads from the target device, starting from you, know, offset 0 and then increasing and in, but but also skipping over parts that are actually allocated so fast discovery of the dated copy. We get sequential reads, but the caveat is there's no checksum verification. So the check sums are stored in the block pointers. We aren't finding all the block pointers instead we're just finding what is actually allocated. So we aren't able to verify the check sums during this and what that means is that for the most part, everything works.

B

Great and I'll explain how this works with mirrors and data integrity a little bit later on, but I mean it does mean that transient errors can become permanent errors right so like if I read from the device- and it says, here's the data like we trust it- we write that to the new location. If it actually gave us the wrong data, then well I mean most of the time. It's like it gave us the wrong data. It's gonna keep giving it wrong forever.

B

So, like we didn't make things any worse, but there are occasionally the transient errors where the disk gives us. It says: here's the data, but it's not the actual data. But if we asked again later it might give us the actual data. This is usually due to differing rugs in in the storage hardware.

B

Okay, all right, so we found this space, we allocate a new place for it and we keep track of the mapping from the old to the new locations I'll get into the mapping in a bit. So in a little bit more detail in order to find the allocated space, we're iterating over the Metis labs in the device that we're trying to remove loading that Metis Labs space map into a new range tree. This svr Alex eggs.

B

That tells us this is what we're working on copying right now and then we find the next chunk to copy. So the simplest way, and the first way that we did this was to just say like find the next allocated region. However, big de-allocated region is that's what we're copying, but you know this could result in a large number of mappings in thus a large amount of memory used, especially on very fragmented pools.

B

So we made a few tweaks to this, so we have a limit like you, can't allocate more than 16 Meg's or on Linux or one mega and listen and FreeBSD.

B

But the other thing that we added is: we can span across free segments that are less than or equal to 32 K, and this provides a theoretically thousand X less memory usage by your mapping, because the mapping can be much much smaller.

B

So in this example, I'm showing like the red blocks are allocated, the green blocks are are free and each one of these is like one one sector, one a shift size four kilobytes in this example. So here each of the free brought you to the runs at free blocks or free sectors is less than or equal to 32 K, so I can actually allocate almost a whole 16 Meg's for this, so I'm kind of copy this whole almost 16 Meg's that I'm in you notice.

B

I could have gone all the way to the whole 16 Meg's, but that would land me in between two allocated chunks and I might not want to do that. I'll explain whether it is later so we go back to the end of the previous allocated trunk.

B

So we're going to read this whole almost team eggs, we're gonna, allocate a new, almost 16 Meg's for it and then write that all to the new location, including the free space, so we're actually allocating space for these unused sectors and we're reading those unused sectors and then we're writing them again and the again. The reason to do this is to reduce the number of mappings and the reason that we do.

B

We could say well, we know what's allocated in freed, so just issue they're, like sure, allocate that big chunk, but just do the reads for what's actually allocated and just do the rights for what's actually allocated, but that actually tends to result in worse performance because, like making the disk skip over little bits results in worse performance, and this is actually one of the reasons that well that's one reason. The other reason is just having a lot more as the I/o is to deal with in the software level.

B

It has overheads and so actually add the vfq layer. We're already aggregating reads across spans of up to 32 K, so I could have issued a whole bunch of reads for each little allocated bit, but then the layer below me, the evita of queue layer, would have just said. Oh I noticed that you read things with like a little gap there. Let me just do one big read and then be copy out the parts that you want, which would have just been.

B

You know a lot whole lot of work for nothing, so instead I did it all at this layer cool. So so why.

B

So right, so why I said that we could have allocated, we could have copied that whole 16 Meg's, but instead we're gonna do a little bit less. Why do we want to do that? Well, the the reason is that we want to minimize the amount of split blocks. A split block is where there's so I showed what's allocated and what's freed, but we don't actually know where the blocks are. So, for example, we might have this whole 16 Meg region is all allocated, but it's actually a bunch of you know.

B

20 kilobyte blocks next to each other. So, from the like logical point of view, I have a whole bunch of 20 kilobyte blocks and they're all packed. They all happen to be allocated right next to each other, nice and contiguous. That's just what we want, but from the space maps we don't know that we don't know where the block boundaries are. All that we know is hey the whole 16 plus megabytes is all allocated. I need to copy all of it great, so here I've actually labeled the blocks with.

B

Hopefully, you can see that, like slightly different shades of red, to indicate where the logical block boundaries are. So if we look at the end there, the a123 are three sectors that are part of one logical block.

B

If we went to the whole, if we went to exactly the 16 Meg's, we would have to split that, and so like a 1 and a 2 would be part of the first mapping that we're copying right here and then a 3 would be part of the next one, and that has some downsides in terms of performance and some other things that I'll get into more details later, which is why, like in the top example, we go back to the end of the previous allocated segment, but that isn't always possible right.

B

So, like these split blocks, we want to avoid them when we can but they're unavoidable in some cases. Like the second case here, where 16 plus megabytes are all allocated like wow, we have this constraint that says you can't copy more than 16 Meg's, so I guess I got to do exactly 16, Meg's and just hope for the best I mean the reason that we have the 16 mag constraint is because that's the biggest block that ZFS can allocate.

B

So in theory, we probably could have taught the alligator to be able to allocate bigger things, but in practice we felt like like a 16. Meg's is big enough and B, even though we could extend arbitrarily large. We know that we still do have to deal with the case where we can't allocate it because, as we'll see later on, you might, we might see a great 16 Meg's. Let me go allocate 16 Meg's, but there isn't 16 Meg's contiguous free space, and so we have to chunk it up smaller anyways.

B

Okay right, so that's what kind of what I said we may have to chunk it up into smaller. We, if fales, we have to split into two smaller allocations and then what we do is we kind of learn from that. So for the rest of this transaction group, we won't try to go back as 16 Meg's every time and say: oh great, 16 Meg's. Oh, that didn't work. Let me go to half of that. Oh next allocation, 16 Meg's! Oh that's, still didn't work. Surprise! Surprise!

B

Let me go back to the smaller one that that was very wasteful and also we wanted to make the back off go to the exact size available rather than exponential back-off. So like it'll, go to 16 Meg's you'll try to allocate 16 Meg's if that doesn't work, it'll split it in half, so 8 Meg's and 8 Meg's, but next time we'll try like 16 Meg's minus 1 sector, so we'll eventually find ok.

B

This is the actual largest segment that we can allocate in the pool on this gxg and then then we'll be able to go really quickly. Allocating those so this is this. Is this kind of algorithm is very important when your pool is actually fragmented and all the space is fragmented, not- and you don't have like like, if you just added a whole bunch of free, a whole bunch of empty disks. You don't have this problem, but we wanted this to work in all cases, not just the ones that are easy.

B

Okay, cool! So then we're going to read it from the old we're gonna write to the new, and then we need to free the unused parts of the new location right because, like in this first example, I'm I'm allocating that whole 16 Meg's a little bit which includes all those free bits but nobody's using those free bits like there aren't any block pointers referring to them. That's the definition of being free, so we need to free them from the new locations after we've done this.

B

Cool, so let me add one more wrinkle of complication to this before I explain the master strategy. What if you have mirrors? So if you are, if, if you have this example, where, like you, have a pool I'm trying to remove I, have three mirrors, YouTube 2 discs I want to remove the one on the left there might be.

B

The mirror is healthy right because we said we don't do removal, while the mirror is unhealthy, but there might be unknown damage so there might be. It might have been some silent damage to one of the sides of the mirror and we want to be able to. We want to be able to handle that in ZFS and we can in all the other cases, so I was like. We should be able to handle it here too.

B

So the way that we do that is rather than just reading like if you, if we were to normally read the data from the mirror, we would say great what you want to read it from the mirror. Let me choose like choose one side at random and there's there's the data, because the two sides of the mirror are the same right, they're supposed to be, but they might not be because we don't trust the hardware. So, instead, what we do is we read from both.

B

We want to read from both sides of the mirror and write each side to the corresponding side of the new location mirror. So in this example like the dark, blue and purple blocks we're reading from the left side of the of the removing device and we're writing them to the left side of the new locations. And then the light blue and purple were reading from the right side of the removing device and we're ready writing them to the right side of the new locations.

B

Okay. So how do we do that? So against keeping with the example of the two-way mirror? We create a zio tree, so this this harks back to. Hopefully you all were paying attention during George's talk- and you know by looking at this diagram exactly what will happen, which is that you know so. This tree is showing the zio dependencies so remember from dirges talk, the children have to complete before their parents.

B

So in this case, what we're saying is there we're going to we're going to issue these two reads that are the Leafs of this tree and when the reads complete, then we're going to issue the rights and then, when the rights, when both rites have completed, then this topmost, no zio is, is complete.

B

So here you know, we have one like side of this for each side of the mirror that we're accessing. So the one on the left here is reading from the left, child child zero and then writing to the left child child zero on the right on the right here, we're reading from child one and then writing to child one of the new location and then the null zio. That's the root of this whole thing.

B

When that is done, then we know that the reads and the rights have all completed and that's when we can issue the frees to the segments that are no longer needed. If there were any.

B

Okay, so this is all happening from that open context thread that I mentioned so right, so we're setting up we're we're figuring out what the mappings are by allocating the new locations, but we can't actually like write that into the moss because we're in open context.

B

So what we're doing instead is accumulating a linked list of mappings so and then we arrange for every transaction group to do one sync task that will write out all the new mappings to disk in- and this is a pretty cool like this- is not exactly how sync tasks were made to be used like if you look at all the other sync tasks, they're, basically like okay I know, it needs to be done.

B

I just need to pass these arguments to this other function like I just needed it like I, just need you to call this function from singing context and pass these arguments to it and I'm going to be sitting here, waiting for you until you do it like. Basically, it's just a like context, encapsulation mechanism, but you could almost think of it as like, like if we had a more powerful programming language than C. You could just say like here's.

B

Here's a block of code just run this in this other context, great, which is essentially what we're doing with the callback and the arguments and whatnot, but but critically, like the arguments are static through this whole process. Normally. But what we're doing here is we're kicking off this sync task, but we aren't waiting for it and we kick it off and we give it the head of a linked list that we're then adding to like we're continuing to add to that linked list.

B

Wow like, even though we've already dispatched the the sync task, and so there's a there's, some trickiness here, where we keep track where we know like which transaction group the copy is going to complete in, and we make sure that we're adding to that one. We know that it's open it hasn't started so that we know that we haven't started processing the sync task yet while we're modifying its linked list.

B

So this is like a little bit tricky, but it actually worked out really well with the infrastructure that we had and we found this paradigm to be really useful in in, like a bunch of other scenarios, where you want to be doing something in the background from an open context thread and yet have it be modifying things that are stored in the moss in singing context. So we've repeated this I think the the redacted sender receives stuff. Does this when you're creating the redaction list and a couple of other things to do as well?

B

Okay, the mapping so right. So why do we need a mapping? Well, as we mentioned, we are not finding all the block pointers we're just finding the allocated space. So all those old block pointers still refer to the removed or the device that we're trying to remove.

B

So when we get like when we get a zio, read it's gonna say: zio read device ID, zero! Oh, that's gone! What do we do? Well, we need to know what the new you know. We need to know where we should read from so we keep this mapping. It maps from the the old offset dasu on the remove device and the length to the new device and offset and I'm showing here as a table. This is this is how it's represented on disk and in memory.

B

It's just like an array of structs mmm excuse me and but but critically, the it's sorted by the old offsets.

B

So what that means is that when we want to go, do a look up in it, we can look up by just doing binary search because it's already sorted so we can look up in in log and time and because we're generating the mapping by going through the space maps in offset order. We generate it in this sorted order, naturally, so there's no like post-processing step. In order to that, we need to do in order to sort it.

B

So like this thing tasks all that it's doing is like I have a linked list, a linked list of like structs, where each row here is one of these trucks. And then it's like it's just as great take that struct plop it into the object in the moss at the end, append it to that object in the moss, great we're done copying.

B

We have everything in its new location. We know what the mapping is between the old and new look and new locations. Now we just finish up. We freed the space snaps of the device we're trying to remove, because we don't need them anymore, there's some other little bits associated with removing Veta. If we get rid of those and then we replace the Vita like in this example, it's a mirror.

B

V dev, with two with two disc V tabs, as the Leafs may replace that whole thing with an indirect repeat of which is a new video type.

B

Okay, so I'm good I wanted to talk briefly about like how big is this mapping that you're talking about how much I noticed you said you need to use some memory.

B

How much of my memory are you going to be using well, in the best cases we're able to do 16, Meg mappings, which means that you only need well, so each each row in my table is 24 bytes, so that comes out to one and a half megabytes per terabyte of device that you are removing, so one and a half megabytes of memory will be used for every terabyte of removes device, which is not too bad, and this applies both in the case where, like all of your allocations, are super contiguous.

B

We is the first example and the ones where, in the example where your gaps are less than or equal to, 32 K, which so are on. Maybe surprisingly, the the first example here is 0% fragmentation. The second example here is: is high fragmentation like more than 70 percent fragmentation, both of those work really well.

B

But the worst case is, though the worst case possible would be where we have a run of fries, that's just more than 32 kilobytes, and then we only have like one sector allocated in between it. So in this case it could be really bad. 600 megabytes per terabyte of disk- this is um terabyte of like space in the disk not allocated space. The allocated space would be much less ubi one.

B

You know one thirty one, whatever one eight or whatever of that, because you know only one out of every 16 is actually allocated, but the worst that we've actually seen in practice is about 100 megabytes per terabyte, and this is this is with fragmentation. That's between those, so less than 70 percent fragmentation. But you know more more than I think it's like more than 20 percent, which we found that this is pretty pretty tolerable. It's actually has been much worse, so we originally didn't implement this gap, skipping mechanism, and then it was a lot worse.

B

That and adding the gap skipping had gave us about 10x improvement in the worst case that we saw that we see in practice.

B

So all right, so we talked about the main removal process. What that thread needs to do to do the removal, but what can happen when we're in the middle of the removal? Because remember at the beginning, when I explained like how you use this I was like everything just works, you can do it. It runs the background you can do whatever you want, while you're in the middle of it. So what do we need to do it?

B

Okay, writes nope, don't need to do with those because remember we we kicked out all the Zil blocks and we disabled allocations reads well reads: actually just worked fine, because we can just read them from the old location, because it's still what it always was, and you know in theory, we could read some of them from the new locations, but for simplicity we can just say just read it from the old location, it's fine, but what about fries?

B

Remember: we've allocated some some some of this space we've allocated new space for, but not all of it. So what if we need to do a free in the middle of a removal? Well so first we can free it from the old location. We know that is not needed anymore and then there's a couple of cases that we need to think about it. We might be in one of these three cases or more as it turns out, if we've already fully copied this this region of space, so we've already written the the we've already done.

B

The sync task that writes the mapping to disk and that's done, then: we need to also free the block from the new location, because we have like the old location, the new location, new locations fully baked. We don't need either of those. So let's free them.

B

If we haven't start, if we haven't started copying it yet, then we don't have a new location for it, but we want to make sure that we don't copy it and remember I loved the space map into this svr Alex eggs range tree, so I might need to remove it from there I might not. It might be that, like you, know, I'm in the middle of copying, this meadow slab and I'm freeing something from over here. It's not in it's not relevant to the svr, Alex eggs, or maybe I guess here in that case.

B

Finally, we ignore it, but it might be in that svr Alex eggs, which is telling us this is what we are are going to be working on copying. We haven't started copying yet, but we've figured out what we want to copy from this meadow stop, so we need to remove it from there. It might be in flight, meaning that we've issued the read we like we've allocated a new place for it. We've issued the read for it, but the mapping hasn't yet synced to disk and we might the write might or might not have completed.

B

In that case, we are going to need to free the new location, but we can't do it just yet. If we did it right away, then that location might be reallocated before our right completes, in which case our right would stomp on whoever had allocated that out from under us.

B

um So we we remember that range that needs to be freed in this SVR freeze, which is a arranged tree that we indexed by the transaction group and then, when that transaction group syncs as part of that sync task, that I mentioned we're also going to free everything that is in this range tree and you might be in multiple or all of these categories, because a free is like a range.

B

It's like free, this one megabyte, and we don't really know like the thing that's going in and copying stuff like its operating on whatever increments at once. So it might be that of that megabyte like a little bit of it. We haven't started copying a little bit of it is in flight and a little bit of it is in flight in a different txt and then a little bit of it. It has been already fully copied, so the routine that does all this is extensively document extensively commented.

B

Cool so, as you saw in the beginning, you can cancel the removal. We want to put everything back the way it was so what that means is we need to free all the new locations that we've allocated.

B

So we cannot just go through the mapping and say, like whatever the mapping points to free that for two reasons one is the map like the mapping could span free chunks and two is we could have had those concurrent fries happen so like it was originally used, but now it's not allocated anymore. So if we, if we tried to free it, we might be freeing somebody else's stuff. So instead we go through the device that we're trying to remove.

B

We go through its space maps again to find, what's actually allocated on it now and then look through the indirection table to tell us where we move that to if anywhere and if so, we free that.

B

Okay, that's what happens while we're removing a device and I still have some time. So that's good, because now I want to talk about after the removal has completed. What do we need to do? How do we deal with it? So the first thing that you might need to deal with is opening your pool when you open the pool the first thing that we do after, like we read some stuff off of the labels and then we go into the moss and we read the moss config object.

B

This tells us about like what what devices are there? What are the devices IDs? Oh it. This is like there's some mirror, there's like a mirror like v divide. E0 is a mirror and it has two children and the two children are you know, device ID, x and y so that that everything to get to that cannot be on indirect PDFs. I should say all this stuff that I may be talking about is in Vita of indirection C, as I mentioned the device that we remove.

B

We call an indirect Vita afterwards, it doesn't show up in like zero list or the inter. You know in the CLI interface anywhere, but under the covers, there's still like AV divide, easy ro or whatever that you've removed, like you movie, divided easy ro, but then afterwards you still have a video of ID 0.

B

It's just an indirect V dev which handles all the mapping stuff so right, so the MAS config can't be on any indirect vetoes because we don't have the mappings in memory yet so in order to make sure that is true, whenever we do a removal, we make sure to dirty the entire mas config object, even the parts of it that we might not happen to be modifying as like otherwise, and then we load the indirection table.

B

The representation of it on disk is the same as in memory which is really nice. It's just this array sorted by offset, but the key. The tricky thing here is what, if I have removed multiple devices, then on, like the all the the the first device that I removed its indirection table, might be on an indirect vtf. So I need to load these in the right order so that by the time I get to that.

B

First, one I already have all the mappings that I need to load its mapping, so we have to load them in reverse, chronological order right because older mapping objects may be on more recently removed, indirect videos, okay cool, so we've we've opened the pool, everything's cool. Now what operations might we need to do that would interact with this indirect to be deaf? Well, we might need to read from it. That's like the most common one. When we read from indirect we'd, have we go through the indirection table?

B

That tells us where we need to actually read from, and this is pretty straightforward, there's this function via of indirect remap, which basically you give it a call back, and it calls your call back on on telling it what the new location is or new locations are because remember, we might have split blocks, meaning that, like one logical block, part of it, we move to one location and part of it. We move to another location and it might not just be two locations like I'm using in my examples.

B

It might be like a thousand locations in practice, probably not a thousand but like it could be more than two, and so we did some work to make sure that, oh so you have this split but then also on. In addition to it might be a split block, it might be multiplied and directed so like I might have, if you think of so I had to come with all these use.

B

All these scenarios to test this, the the one of the hardest scenarios is like I, have two devices I remove this one and then that copies everything to here and then I add it back as an empty device and then I remove this one. So it has to copy everything over here and then add it back as an empty device and then I remove this one so like I'm, just migrating the data back and forth, and back and forth, and back and forth back and forth.

B

So every single thing in the whole pool is like in directed through as many indirect v-dubs as like I've done removals. So you can do this in the test case. The test suite has cases that do this where it just like. Does this remove, remove, add, remove, add, remove, add, remove, add and you can get like hundreds of layers of indirection, so we needed to implement this function non recursively. The obvious way to do is like okay, great I go with Rooney and DirecTV dev.

B

It tells me here's the new location and then the new location is some Vita right. It doesn't matter what kind of viewed it is. Then I just do the read on that, but that one may also be indirect and then that results in this could result in this recursion. So we we made this video in directory map aware aware that it might point to another indirect V, dev, oh geez, alright I only have five minutes left so we'll see how this goes.

B

So we made this actually working on recursively, with, like a explicit the allocated stack of things to do. Okay, but in the common case, where it's not a split block, then we can just do a child I owe to the to the one new location passed. The check sum in and then that child IO handles the data integrity so like it might be a read from a mirror, the just like the normal read from a mirror. It has the. It knows the check sum it can try both sides.

B

It can take into account its details and do it's. You know, repair, iOS and all that easy.

B

But what, if we're reading a split block, then we don't so the issue is that we don't have sub block check, sounds so like we. We have the checksum of the whole block, but half of it is here and half of it. Is there when I issue the read for this half I, don't know what the checksum of that is, so I can't pass it down to the the child's I/o. So instead we need to handle the integrity at the indirect veto of level, and this is kind of similar to what happens with the raid-z.

B

So, in that case, we're we issue a child I/o for each segment of the split that's going to target the the top level of eat of that it points to so that. So, if that's the mirror, it's going to read from like you know some random healthy feet of, and then we get them all we stitch it back together. We verify the checksum.

B

The checksum is good and we say great now we have your data, we know we have your data, but what, if you have split locks and you're reading from an indirect you reading from the in direct review dev, you have slid blocks and you have mirrors and you have silent damage. So you know in this example, I said: okay, we read from a writ from a healthy leaf and then we get their corrected data and the checksum is correct. But what if the checksum is not correct, we want to again.

B

We still want to be able to handle the cases where, like there's silent damage and the disk says, here's the data and it's not actually the data. So, in that case, what we need to do is we don't know which part of it is damaged like right? We, we split it in this case I'm just showing but like it could be more slits right, like I, could have five different parts of the split in five different locations and I, don't know which of those five is actually has the damage.

B

So what I do is that we need to read all the copies of all the splits and then try all the combinations of those copies. So what I mean here is like on the right there, I've shown like the dark, colored ones are like the left copy and the light colored ones are the right copy.

B

So like let's say, first I try both the both left child, both left children and the actual damage which I happen to magically know, is on both on the left child, which I've marked here with those red do not enter symbols. So we verify that we check the checksum and it's like nope, that's not the right data. So then I try, okay! Well what? If I tried the left?

B

The first part of it is the left and then the second part of it is I, try the right side of the mirror, so the light purple there. Well, that's still not the right data okay. Well, then, let me like I'm kind of doing binary, counting here right, I'm like adding one so I flipped the next bit so now, I'm doing the right part of the first mirror and the left part of the second mirror huh still not the right data geez. Where is that data? So then?

B

Finally, I try the right side of both of the mirrors and fine. Aha, that's the correct data. Finally, and then we can issue the repair rights to any segments that differ.

B

So this is also kind of unique and I think better than what we've done in a bunch of other cases like with raid-z, because we actually like compare like we're gonna, be be comp the blocks and see like okay, this one like because you might have a three-way mirror, for example- and you know we want to figure out which one is right and which one is wrong and we might not have gone through every possible combination, but we can check and see. Okay, here's the right data.

B

I know this is the right data and I have every other copy. Let me just see what any of them that differ from the correct data if they're different, then I'm gonna issue, the repair right in this example. I only have four different ways of doing it: combinations to try, but you know every additional split is exponentially more.

B

So if I had three, if we've split three ways, then instead of four, it would be eight combinations and if you have too many combinations which rarely, if ever happens in practice, but does happen when you're like running z-test, then it might take like until the heat death of the universe to try them all right. Cuz, like you know, two to the hundred is a big number like counting from tu-tu-tu-tu-tu-tu to the 100 is like takes forever and then, like doing you know doing like sha-256. That many times is also not practical.

B

So instead we randomly select some of the combinations to try and hope, that's good enough and then on linux, there's some extra tricks to reduce that search space to only the versions that actually differ, which which is a great improvement, we'll bring that to other platforms as well: okay, I'm, I'm, I'm, sorry, I'm gonna go a little over I.

B

Guess, that's what you get to do if it's your conference. So what? If you get a right to an indirect view, dev wait wait a minute. What didn't we start out this by saying like there? You are not writing. There's no rights. We made sure that the syllables I'm gonna write to it and whatnot, but it turns out you could get like a self heal right. So the most common case that you might see this is like. We discover that there's some bad data via like a doodle block.

B

So, like you, have a block pointer, it has two DVA s. Let's say one of them is concrete and one of them is the indirect one. We read for you, the indirect one and when we say uh like I tried all the combinations like the data is just not here, but then we can like the layer above us in the Zeo chain, is going to try reading it from the other D DVA and and maybe it finds up alright, that's this one. It's still good.

B

So then we should repair right, a self-healing right to the indirect beat of, and then we have to handle that. So it's interesting that this happens, but it's also super trivial to handle this. We just say: oh well, like we go through the indirection table and write, you know to the new location.

B

Okay, freeze from the indirect VF, so pretty simple, we just free it from the new location, but interesting thing is that now that means that some parts of the mapping are no longer needed right. I freed that the free, like by definition, means nobody is ever going to read that again, so that part of the mapping is no longer needed. We call that obsolete. So maybe we could reduce the memory used by the mapping table now that we know that some part of the mapping table is no longer relevant.

B

So we can do this by like tracking what parts of the mapping what parts yeah, what parts of the mapping are obsolete and then condensing the mapping table to reduce its size when, when we have like an entire when, like one hole entering the mapping table, is all completely obsolete.

B

Maybe all right I'm gonna fast for it. So the caveat to all this is that we implemented all this before we did the large mappings thing and when you have the large mappings, this is much much less necessary to do this, managing the obsolescence to reduce the size of the mapping, because the mapping is just much much smaller to begin with and because the mapping tends to cover many many blocks like you have to wait till a lot of things get freed because you might have you know, maybe it's not 16 Meg's.

B

Maybe it's like a 1 Meg mapping that cover is like 20 logical blocks. You have to wait until all 20 of those logical blocks are freed or remapped until you can free until you can get rid of that one entry that one and 24 by entering the mapping table so.

B

Because I'm running out of time, I'm only going to show you the cool pictures that I did because I spent a lot of time in them and really this is this is not so much to convince you of like. Oh, this is super cool. It's more! So you like it when you're going and looking at this code and you're like what the heck is. All this you can understand like the high level of where it's coming from so.

B

Right so there's what happens is when you do a free. We append that space to the obsolete space map that tells us everything that's been freed and then, in the background, we do this condensed operation that basically takes that obsolete space map and incorporates it into this obsolete size which is not rendered correctly. But that's essentially like to the side of the main mapping structure. We all saw this piece of information that tells us like how much of that size is obsolete. If it gets to be the entire size, then we can.

B

When we rewrite it, we can omit that entry and then we can also say like when you're. Whenever we write an indirect block, we can say: hey here's an I'm writing this indirect block and it happens to have like kind of irrelevant to what. Why I'm? Actually writing it. There's this other block pointer in here, which points to an indirect Veta. Maybe I can write that. Maybe I can rewrite that indirect block pointer to point to the new concrete location which maybe then I could now like mark that obsolete.

B

But you can't, if there's snapshots, so the snapshots might still reference it via the indirect block pointers. So we have this new undo structure called remap dead list which keeps track of DBAs that are that are referenced in the snapshot, but then have been remapped. And then, when you delete a snapshot, you can find everything that everything that has been remapped and is no longer part of any snapshot, and this algorithm is like almost exactly the same as regular dad this. So a lot.

B

A lot of places you'll see like Oh like do the thing to the Dead list and do the thing to the remap dad list, and if you want to know more about dead lists, then watch my other talk on how how snapshots work.

B

You can also run this new sub commands DFS tree map, which will explicitly remap everything in that file system receive all cool. So thanks. So this is not a one-person project.

B

We've been working on this for many many years and a lot of other developers have helped us with this. We're listed here and we've been using this in production in del phix. In our product since 2015, and then it's been in upstream in all the repos since early this year, so future work, two really cool things that we'd like to do. One is being able to queue up multiple devices to be removed. The main point of this would be to mark them as analogous ineligible for allocations and to do the space checking upfront.

B

The big win here is that, like, when I'm removing a device, rather than moving that space to all of the remaining devices, I'm going to move it to the ones which will be remaining after I've completed all the removals that I want to do right now. I have a prototype, but I need some work. The other really cool thing that we would like to do is to be able to remove a raid-z group. I think this is.

B

This is possible within this framework, if the other V devs are all also similar raid Z's, but there's there's a couple of like pretty tricky gotchas about alignment and other things. Okay can I have two questions: okay,.

B

Well, you can ask one question and then we'll we can talk later No, so the indirect means there forever I mean in theory. We could say like Oh like once, once that mapping table gets to zero size, we could remove it but like there's no cost to it. So whatever that's what you use your one question on all right.

B

Yes, so the question was about the spanning those those free segments and then the impact on SSDs yeah, like so like when we were doing the copy. It's like good for SSDs right because we're doing big chunks, but then yeah like when we do this freeze, it's gonna. Do it's gonna, do the trims, if freeze, do trims so like if you're running on a platform or a config, we're freeze do trims, it will trim them because it's just so normal free but I.

B

Will that reminds me that there's a trade-off to be made here like before when we were not doing spanning the free chunks? It meant that if you have a fragmented device- and you do divide in you remove it, then all the free space gets compact like everything gets compacted and then your fragmentation like it goes away versus with the spanning the free chunks it. It basically preserves your fragmentation, which, like is not great but like the memory and performance benefits, are really really huge. So if you need to you can change that tunable.

B

But hopefully you don't.

B

B

B

Know so we, the question was like: if I have more than 16 Meg's allocated contiguously, what happens? It's just going to get chunked into 16, big chunks. We could up that tunable, but it might not be a great idea.

B

Yeah, that's true, and then you could consolidate it if they happen to be allocated it contiguously as well, which this size field 24 bits is way more than enough for that.

B

Yeah, oh I, see so the question was basically like what, if I just have a regular like an on split block, just like a regular block could I do this combination thing yeah? You totally could, if, basically, if, if what you're trying to protect it is like I have a mirror, I have a block. It's like it's!

B

It's just a normal mirror, normal block, same thing on both sides and what what the failure mode that I'm concerned with is like the beginning of the block, got messed up on this side in the end of the block got messed up on this side, then I want to be able to stitch it all back together on a like sector-by-sector basis or like byte. By byte basis. Yeah I mean you. You could totally do that. I would not like.

B

You would definitely need to do the random selection thing here in order to get it to complete before the heat death of the universe.

B

Yeah yeah that that that would that would be a big help too yeah yeah.

B

Yeah I mean that thought definitely did occur to me to do something like that, but it that's a pretty unusual failure mode, maybe if you're using 16 Meg like maybe if you're using record size, equals 16 Meg's, it's not as unusual, but it's like in other cases. It's extremely unusual yeah.

B

Yep yeah, so the question was like what, if my pool is not homogeneous and I have like some three-way mirrors in some two-way mirrors. In that case, we basically do the best that we can in that scenario so like if you're removing a three-way- and you only have two ways- we're just going to read like two randomly selected children and then write those over here and if you're, like I'm, removing a two-way and I have a three-way. Well like we're gonna. One of these is gonna, get the same thing.

B

You know copied twice yeah, but yeah it does. It does handle that and does kind of as best as it can, given the situation. Great question, all right, I'm already like way way over. So thank you all for indulging me, and this is my first full-length open, ZFS presentation in six years, I've been organizing this conference for six years. This is my first full-length technical presentation. So thank you.

B