OpenZFS 2017 OpenZFS Developer Summit, 31 Oct 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Fast Clone Deletion by Sara Hartse

Description

From the 2017 OpenZFS Developer Summit:
http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit_2017

A

So yeah I'm Sarah I'm working at del phix and I'm gonna talk about a ZFS performance, optimization with regards to how we delete clones. So through my talk, I'm going to talk about how we delete clones now and why it can be really really slow. The fast solution, algorithm and then changes that algorithm we've made to basically make it more scalable within the ZFS ecosystem and then I'm going to talk about hormons gains and basically why this is an awesome, awesome thing. So clones leishan. So, as you know, clones are mutable copies of existing datasets.

A

You take a snapshot which is frozen version of the data set and then you can make a clone, and then that means that you can change things based on that snapshot. Copy on right means that creating a clone is as simple as pointing to the route of a given snapshot so creating a clone super fast super inexpensive throughout the course of the clones lifetime.

A

It diverges from the original, so the user writes data, they delete things, they change things and the clone originally was sharing all the data blocks, with its snapshot and then gradually develops, add blocks of its own, so deleting and clone requires determining which of these blocks are still shared with the original and which blocks belong to the clone specifically and should actually be freed, and the basic strategy of this is to iterate over the on disk tree and ignore sections of the blocks based on birth time.

A

So here's an example: the green are the blocks that pertain to the clone and then the blue is the underlying data set. So in some cases this isn't too bad. One example is a contiguous right, so down in the bottom are the blocks written by the user, and these are all contiguous. They take up one contiguous chunk of memory, so not that many indirect blocks have to be modified and traverse to find things you wanted to leave.

A

On the other hand, there's no rule that your rights have to be contiguous, and if you have sparse rights all over the place, then you end up basically having to touch every part of the tree when you want to go and find what have to delete. This can be really expensive, because you have to read in your indirect blocks and say: okay, where the other blocks when, where they modified, should I actually care about them.

A

So the kind of takeaway from this is that in the worst case, deleting a clone is proportional to the original size of the underlying database. It doesn't matter that you only made four tiny changes to the clone. If you made them in a scattered enough place, your deletion cost can be proportional to whatever that clone was based on, so that can get really costly. This isn't like a super dramatic example, but you can see these are all underlying data sets of varying sizes. The amount of data that was written but to the clone is fixed.

A

It's 500 Meg's, but you can see that 500 Meg's of Rights looks like very different deletion times, depending on how big the data set is and the red bars are for random rights and the green are for contiguous. So you can see that random is worse than contiguous, although contiguous can also get pretty bad. So that's all folks, we've seen users kind of run into this situation where you may have to maybe terabytes of data. You make a clone. You make a few small changes, you go and delete it and it takes you 45 minutes.

A

So it's a problem all right, our algorithm. So the basic premise is to change this underlying issue that deleting a clone means caring about what the original data set looks like and the size of the original data set. What we want to do is keep track of the clone specific rights and deletes as they occur so that when it comes time to delete them, we only have to care about the actual changes that happened on the clone. So the premise is to store them in a live list to delete a clone.

A

We just have to process each element in this live list. Work is proportional to the number of rights to the clone, not to the size of the data set. It was based on so here's some more detail on this live list algorithm. So we need to have this data structure and we are going to in queue block pointers that we're both allocated and freed on the clone as the rights occur.

A

We need both the alux and the freeze, because if we only keep track of the blocks that were allocated over the course of the Clones lifetime, those walks might have been freed, like naturally by the clone stuff that happened to the clone, and we don't want to double free them in the end. So when we want to delete, we need to determine which were they not yet freed blocks, the ones that are circled in red and the way we do.

A

This is we're going to step backwards through the live list, insert freeze into an AVL tree check for memberships of Alex and the AVL tree, and then based on that you'll know if you'll have to delete them. So, a little more step through of how that works. Exactly we start at the end of a line list. We have this handy invariant that all frees of a block will happen after it's allocated. So that's why you can go backwards. We see a free, we keep track of it in our AVL tree.

A

We see an alec and we check. If it's been freed, it hasn't so we're going to keep track of it as something we actually do need to free. We see another Alec, oh it's already been freed. We ignore them both store our freeze and then check for memberships of the Alex. This Alec has already been freed, so we ignore it. This Alec has not so we free it and this Alec has so we ignore it. So it's basically the core of the correctness of this algorithm.

A

We have this data structure, we process it. We know what we have to delete and significantly. This is proportional to the things that have happened to the clone not to its data set. So where are we now? This algorithm was implemented this summer by adelphic intern Serrano Gopal, and she saw like really good and performance improvements and science. We're looking good, there's still some things that we want to tweak to this algorithm, to make it more scalable. One, as you might have noticed, is that the live list can grow arbitrarily large.

A

Basically, a clone lives a long time. A ton of things will happen to it, and this data structure can get really really bloated.

A

This is also something that's tricky to incremental e destroy I'll talk about that a little bit later, but the kind of pattern of incrementally destroying something and not doing all of your work and single transaction group is an important pattern within ZFS to keep things basically flowing and to not spend too much time deleting when and like prevent users from from actually doing stuff. So the next step to try and address these issues is to break the live list into smaller, fixed size, sub lists.

A

So the idea is that if we can make the guarantee that each sub list contains both the Alex and the fries for a given block pointer, then we can process these sub lists independently. So in the first sub list, if we want to delete this, we don't want to care about anything that happens in the other sub lists, because we only want to load one into memory at a time instead of loading, a giant amount of stuff into memory and arbitrarily large amount of stuff into memory.

A

So one question is: how do we guarantee that blocks matching blocks are grouped together and the solution to that is to group them based on birth time. So basically, this sub list will be all blocks that were born in this transaction group and then the next one can be blocks that were born between these transaction groups, and one of the key ideas is that we're creating these extra sub lists dynamically. So, as we run out of space and one we create another, and that way we don't have to like pre-allocate how many initial sub lists.

A

Another question is: how big should an individual sub list be because of the transaction group grouping? We know that the smallest it can be is the number of Rights that happen within a given transaction group. So that's kind of a lower bound. That's obviously like not fixed, and then an upper bound is the question here.

A

So this is still something that hasn't been like fully decided upon is one of the main tunable zhh for this performance feature, but basically, what we're doing right now is: what's a smallish fraction of RAM to use in a dually, so we're doing about 1% of RAM right now it and a sub list can have about half a million entries, so we've solved that problem of arbitrarily large amounts of memory to read in when we delete the other nice thing about this. Is this presents a handy interface for incrementally destroying something?

A

If we can process these sublists independently, then we can destroy them between transactions between in different transaction groups. The other handy thing, or at least thing to consider, is the concept of asynchronous destroy. Mark may be mentioned this in his talk, but the idea of doing a lot of your destroy work in the background is pretty appealing. So what we wanted to figure out is what part of this delete algorithm actually does need to be in sync in context and which part can be done by worker thread in the background.

A

So when we call ZFS destroy, clone will basically in queue this clones ID as something to be deleted and then we'll sing oldest worker thread that will do the work of loading in the live list. Doing the algorithm with the AVL tree, to figure out what we actually need to free and then it can call a sync task which will actually do those deletes, and so this has the benefit of again like pushing this loading in work to the background and making sure that we're only really doing critical work in the singing context.

A

And then this like lets us, do it incrementally as well. So where are we now pros? We've limited how much memory is loaded in it once with the sublists, and we can delete things incrementally and more quickly. So a con, you might have noticed that the number of sub lists is unbounded. We still have this problem of we're using kind of like arbitrarily large amounts of disk space throughout this algorithm. If a clone lives a really long time, we can have a ton of sub lists. This gets expensive in two ways.

A

One is just the space we're using up a lot of memory that we don't necessarily need to use, and the other is that the more sub lists we have, the the cost of actually modifying and updating the sub list throughout the course of the Clones lifetime becomes greater.

A

So every time you write to a clone or you delete a file or you move something around we're going to have to interact with this live list, and we don't want to make that interaction so costly that we're actually degrading like the performance of writes and things to a clone. So the more sub lists we have the no, the greater the number of I/os we have to do to propagate these changes. In the original case, back when we had one giant sub list, you just append to the end of it. That was it.

A

There was only one place to put things, but when you have a bunch of them, the cost of insertion of a bunch of different iOS is greater because or a bunch of different block pointers, it's greater because they're spread over over different areas cool. So the next idea is condensing after a block is freed. The live list is now containing irrelevant information. So, for example, in this one we have block pointer 1.

A

We know that we freed that block, but we're still keeping this information around we're still keeping this history, even though we're not actually going to do anything with it when we're actually deleting the clone. So the idea behind this is before we actually delete. We can condense, so we basically use the exact same algorithm. We use for deletion to determine which are the still like living block, pointers which are 2, & 3, and we just get rid of the ones that are no longer alive, so that reduces the amount of disk space. We're using we're.

A

No longer storing these block pointers that we don't care about. Can we reduce the number of sub lists we're using? Well since we've compressed these things into a smaller sub list? If they're under like half of the minimum size of this or maximum size of the sub list, then we can merge them together and reduce the overall number of sub lists. So here we've deleted the unnecessary and then we've merged it together in one sub list that basically stores information more compactly and is information we actually care about when we're deleting.

A

So in general, all of we done, we've made the work of deleting a clone proportional to the number of rights to that clone, as opposed to the size of the data set. We've limited the memory in that we load at once. We now have like tight control over like what happens when we delete we're not going loading, tons and tons of memory when we delete something which was an issue with the original life list algorithm, and we did that using sub lists.

A

The sub list paradigm also gave us a strategy for incrementally deleting sub lists and asynchronously destroying them, and then this condensing merging strategy gives us the ability to limit the number of sub lists by periodically condensing them should have mentioned this in the earlier slide. But the condensing strategy kind of gives us a lot of questions like in an ideal case. We would never store redundant information, but loading in everything to get rid of redundant Alex would be pretty costly to do forever.

A

So we probably want to have a kind of like background worker thread, that's periodically condensing when it can. So how often should we condense? When should we condense? These are all questions that, like we don't we don't know yet and we'll be interesting to kind of try and figure out how how the system acts under like different workloads.

A

So here's some performance numbers. These numbers are from surah Baz internship, her implementation of the live list, so these are kind of just like the vanilla one big live list. This is the continuous write case, which is where we saw kind of the least concerning behavior from the original delete strategy. So in this is basically just a small example but turns out, we have actually reduced the number of I/os with this live with strategy, so deletion is a little bit faster. Even in this contiguous right case, I said: we've deleted.

A

The number of I/os we've deleted then reduce the number of I/os for the deletion case, but we've also added some iOS which happen while we're adding things to the clone right. We we've kind of spread out some of this work right, instead of just doing all the processing and all the deleting work when it's time to delete we're kind of like paying it forward by keeping track of what we're gonna delete, which is handy. So the most improvement is really dramatic.

A

You can see in this case we have 500 gigs, which is the underlying data set 500 Meg's, a sparse shadow or written to it, and it took 46 minutes to delete with the old method and with this method it takes 30 seconds. It's like 100 times improvement. It's pretty good, so this again is the the sparse, the sparse case, and that's where we see you like. It's also really interesting to kind of look at this.

A

Like small data set 500 gigs, we can see that the the live list, deletion time stays relatively contained and relatively like constant, because we're keeping track of this data as it happens because we're doing a delete, that's proportional to the size of to the number of writes we do as opposed to the size of the data set.

A

So yeah things are looking really promising with this and yeah so live list. Clone deletion gives us really dramatic performance improvements, especially in this worst case.

A

This worst case scenario can show up when people have really big really big data sets they're, creating clones they're deleting clones or trying to make more clones and people just like they think they've deleted something, but they can't actually create a new thing, because the blocks have not been reclaimed so speeding up. This is really good for high high usage systems with like really big data sets, and then we saw these tweaks that were made to make the algorithm more scalable.

A

These really interesting, because we're balancing things like how much memory do we want to sacrifice to this live list system? How much work should we spend optimizing it throughout the course of the live lists or the clones lifespan and yeah? So it's it's underway. Hopefully it will be coming soon to open ZFS, so yeah I just wanted to say a quick. Thank you. It's a seroma who did a lot of the original work on this project and to Matt and Seraphim who helped me out a lot adèle fix, so questions Yeah.

B

A

Yeah right now we have this set up as like a property, so yeah you can enable or disable it sorry yeah. The question was: is this feature going to be optional, because there are cases where the old delete method is fine and you don't want to keep track of extra data as you go, yeah.

C

A

Yeah we're actually so the question is in the case of clone, promote how do we handle this because you're you're, basically changing the roles of these and the relationship between these data sets? So currently we are not supporting this feature for promoted clones. I can't supersede people but yeah.

A

So the question was: how do we resolve the fact that sub lists need to be independent of each other, namely all Alex and all the free use for a given block pointer should end up in at the same sub list. So the way we do that is we group the sub lists five birth time, and so a block pointer that was created will have a birth time and then, when it's freed it will have the same birth time as its created copy because they're like the same block pointer.

A

So when we insert them, we basically index by birth time, and so, if something is created in transaction group, 1 and then freed in transaction group 9, it will still have the first time of one. So we can just insert it.

A

Yeah yeah that that's a kind of like interesting feature because, like all Alex will go to the like current sub list, it's like the newest one, but freeze can go anywhere yeah.

A

A

So while the clone is still in existence, it's stored within the clients, offense data set and then once it's deleted, it's moved to the moss for like pool light access for deletion. Oh.

D

A

I forgot to stay the question, but the question is: where is it stored, where's Lila, stored.

A

Yeah, so the question is: why didn't we use face maps? They share a lot of similar functionality, so I know space maps were definitely considered for this strategy and I. Don't know the details of why they were not decided on. Maybe Matt can address that.

A

Cool any other questions, yeah.

A

Yeah, so the question was: have we looked at the case where a clone has almost completely diverged from the parent and whether or not this strategy is is still beneficial. So we haven't looked at the performance in that in detail, but we did see that, like the contiguous right case still gave us performance gains.

A

So in the case where I'm just contiguously overriding an entire clone, we would still expect to see some performance games kind of gains on the scale of what we saw here so probably not huge but yeah other questions nope all right cool thanks. So much.