OpenZFS 2022 OpenZFS Developer Summit, 10 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Block Cloning by Pawel Jakub Dawidek

Description

From the 2022 OpenZFS Developer Summit: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022

Slides: https://drive.google.com/file/d/1eyvv_5madwBwlianA-Rb049gCmJ4GtMW/view?usp=sharing

A

I hope you guys can hear me. Okay, uh my name is Pavo and uh I wanted to talk about uh blog cloning for ZFS. It's been a project I've been working on for for quite a while. This is not like a huge project, but basically it's I'm working on this whenever I can so I think Matt knows about that. So he invites me to talk about this, so he knows that this will motivate me to work a bit more and like push it forward.

A

Most of the stuff I will be talking about is actually written in a in in the pr for for this feature, but nobody reads text these days, so I will record Tick Tock video later uh Okay, so uh the talk is mostly about the design, but I will uh I could take this off right.

A

So the talk is mostly about the design, but uh I will say a few words.

A

What we are, what we are actually doing here so uh so blog cloning is like on demand the duplication, so uh the duplication in ZFS is automatic and- and this is uh I have to specifically say that I want to clone like, for example, a specific file. So the idea is that we don't want to copy any data. We just want to reference the data blocks from to different files.

A

A

So the idea here is that uh we want to I'm actually connecting this to copy file range system call uh if you are not familiar with the system call normally. If you copy a file, it's uh the CP will read the data from one file to user land and then send it to the kernel to the right system. Call so copy file range actually tells the kernel uh here. You have two file descriptors and offset from the source file and offset into the destination file and length, and just do the copy in in the kernel.

A

So don't bother like sending the data to user land and and back so it actually have some nice properties. Other than that, let's say you have NFS mount and by doing copy file range, you can actually tell NFS server to do server server site copy. So you are not sending any data over the wire. You just tell the NFS server to copy the data at the server, so I I'm I want to reuse this system call for this purpose. So you will.

A

You will say that uh hope you want to copy the data, but actually you will be cloning. The data uh between two files- it's also possible to do this for civils as well, but this is not not implemented uh okay, so uh we cannot modify uh block pointer. So we cannot keep any reference counter in the block pointer or anything like that. So we need additional table.

A

So uh you will notice that this is actually pretty similar to the duplication, but uh it's different and there are quite a few differences actually, but uh we will keep this. uh This additional table with reference counts very top level, V diff, so each top level v-depth will have its own block reference table and I will. I will talk about that a bit more uh later. So when we clone, we don't read any data.

A

We just need block pointers and also when we write, we also don't write any data so uh um because of that it's way faster than even copying the blocks. So we save space- and we also save save time uh when you move files between data sets between, because this is also supported.

A

uh When you move files between data sets, you don't really grow the the table uh uh because you we just need to bump the refer accounts uh for a short while uh write those uh indirect blocks in the destination data set and just free those The Source ones. So you don't really grow the the table and the feature is also always on which have some consequences, but there is no additional cost when it's unused. So if you don't use it uh there is.

A

There is nothing you have to worry about, but if you do use it, we have to consult this table on every free request. So whenever we free a block, we have to go and check if this block block is not in this table- uh and this is optimized. I will also talk about this a bit more.

A

So uh the brt entry is extremely small. So actually we need three uh um three things in order to be able to.

A

To tell which block is it? We need vdf ID. We need offset into this video into this vdef and we need the reference counter uh and if you can uh it's easy to imagine that if you have that, you have quite um that you have only few v-depths and you can have many blocks so in most of those block, brt entries, the vdef ID will simply repeat so uh we decided to to have a the stable per top level. V diff, that's the reason, so we don't have to store vdf ID.

A

So actually the brt entry is extremely small. It's just the offset as a key and reference counter and just as DDP to use micro, zap.

A

um Okay. So what's the differences between this and the duplication?

A

So, as I mentioned, there is uh no cost on right with the duplication we have to have the data uh and we're actually writing the data and ZFS will calculate the checksum and then it will decide that you actually don't need to write the data to the disk, but you have to have the data, so we can calculate the tracks. So with block cloning, you need no data.

A

You just provide the source and destination um and it works with any checksum algorithm with the duplication you have to have cryptographically strong uh checksum uh block cloning co-works, even with checksuming disabled.

A

The entries are with the duplication. The key is the hash, so if especially with cryptographically, strong hash uh blocks that are resides close to each other on disk can be actually in totally different parts in the duplication table. That's why it's really hard to like cache the duplication table, because the blocks are just scattered across the entire table with uh with brt. The key is the offset so entries that are close to each other will be close uh in the in the table.

A

There are no entries with single reference, so if you have data block and you clone the block only then we will create an entry in brt because there is additional reference to the block, but once you free one of those copies, the entry will be removed from the from the table. With the duplication, you have to store all the blocks in the table. So on each right on each block creation. We have to create an entry in dedup table.

A

So that's the huge problem with the duplication as well that the table grows, even if, if you have a lot of blocks that are with just one reference with brt, there are no such entries in the table uh and as I mentioned it's on demand, so you actually specifically clone the file or block or given range of bytes and it's always available so use cases. There are a few. So, of course the big one is: is space savings uh when cloning files?

A

uh It's like the duplication, you can have I know many thousands of of reference to a single block and of course the savings will be huge. uh Another one is how you recover files from snapshots. So let's say you took a snapshot. You removed the file by you, you removed a file by accident and you want to recover the file. So currently you have to just copy the file from the snapshot into your data set, so you will actually duplicate the storage. You need for this file with block cloning.

A

You could easily just clone the file from the snapshot into your data set. So you don't you don't pay this? This extra cost of of additional storage also cloning is extremely fast because we don't read any data, we don't write any data, we just need the block pointers, so it's it's super fast and uh you and it can also be used to to moving files between data sets and uh when I was uh listening to the AWS talk.

A

It's it's. uh It's really interesting to imagine when, when you have NFS mount on ZFS, let's say across the ocean and you need to copy like a huge file and NFS will use copy file range. So you don't really transfer the data over the wire and on the server side it will use block cloning to actually copy the file. So, even though the server is extremely uh is somewhere else in and it's the link is very slow.

A

Coping extremely large files will be just super quick, so that would be cool okay, so cloning is uh divided into a two parts. So uh when we uh do the Clone, we will call ZFS clone range. It will read the level zero block pointers from from the source file and it will uh create all the it will create a um it will remember which blocks needs needs reference bump uh in in disbanding pending on the spending list.

A

Let's say, but once we get to the syncing content uh context, uh we will apply all those changes that we made. So we we increase. Some of the reference counts for some of the blocks Etc and uh in syncing context, we'll just sync everything to the disk.

A

So there are like two parts to consider so when we free a block, if if the block was cloned- and we are freeing the block, but it was cloned in the same uh transaction group, we will just remove it from from the pending list and and we are done if the block was cloned and it's already on disk the um we have brt entry, then we have to decree decrement the the brt entry. So there is additional stage in uh in Xeo pipeline for this, which comes just before uh dedup free.

A

um Okay, but uh we don't want to pay this cost on every free, so we don't want to go to the disk and read the table and and try to find if we have an entry uh for this specific block. So so we had to optimize this the specific case.

A

So uh what I do is I divide each vdaf into um a regions- let's say one gigabyte in size and uh and in this structure, I keep track how many additional references I have in in the hollow region- uh and this is extremely small. We can. We can hold this in memory. So let's say for one terabyte of storage with one gigabyte regions.

A

You only need eight kilobytes of memory, so actually Matt suggested that we could even use much smaller regions like one megabyte and uh and have and require eight megabytes of memory per one terabyte of storage. But this can tell us when we are doing free. uh There is a function. Brt maybe exists that the function will uh will tell us that for sure there is no brt entry for this region at all.

A

So there is not even one block in this region that was cloned or it will tell us that there might be a block in this region, so only then we'll go to the disk and read try to find the entry corresponding to this block. But if it's uh fine-grained enough uh I think that that would be like a huge optimization that we won't be actually going to the disk almost at all, to find out.

A

If the block was cloned or not, we will be able to just free as usual, and it will be cheap so and there will be no additional work to be done uh in order to to free a block, because this this is what we don't want to. uh We don't want to definitely slow down, freeing the the blocks if somebody doesn't use or just use brt, but maybe just for uh for a very small amount of data and and this this array of uh reference counter counters is stored on disk as well.

A

So we will keep track on this all the time. There is also additional bitmap, uh because for large regions, like one gigabyte, uh the whole array will take eight kilobytes. So we can just sync the whole array all the time, but let's say we would try. We would want to switch to one megabyte regions. So then we'll have eight megabytes, so we don't want to flash eight megabytes every time.

A

Somebody clones a block so I keep additional bitmap, which which can tell us which part of this array is actually dirty and should be flashed to disk I even have a picture. Maybe it will be easier to understand. So, let's say uh at the top. We have a brt table, so we have one block at offset one gigabyte. We found five reference with five references right. Then we have another block with three references, another one with seven references, but this is all within one gigabyte region right.

A

So in the middle we have this array of those rigid, ref counts. I didn't came up with any clever name. For that sorry, so we will have like a sum of all the ref counts in this region. So, as you can see, region, 0 and 2, there are no block clones in those regions. So if we are freeing a block which is within those regions, we immediately can tell there is no need to go to the disk and look for entries, but for Regions one and three.

A

If we are freeing some block, we have to check if, if the block is in the table or not- and this dirty bitmap is only stored in memory, and it can tell us that some parts of the of this array are dirty or not so we'll be just writing to disk. Only a small part of the uh of this uh region ref count array, but currently this is not yet implemented. There is some code, but uh with one gigabyte regions it's probably not yet required, uh but it definitely can be done so also.

A

We have to consider how block cloning interacts with the duplication, because you can easily imagine a block that was in the loop table. It was stored multiple times, so we have few references in the table and the block was also cloned using copy filer range, so the block is in both tables. So when we free the block, we have to choose one of the tables. Where do we in decrease the ref count?

A

We can do this in Dido table and we can also do this in in brt. So actually, Alan came up with idea that uh if somebody is already using a dedup table, we can just use the dupe table and not use brt in this case. So if somebody is cloning, a block which already has the bit the debit or D flux set, will just increase the ref count in the dedup table and there will be no no additional entry in in brt.

A

So we will have no such problem at all to decide which table we should in which table. We should decrement the counter foreign okay, so uh Crossing data set boundary is uh a bit challenging. It's perfectly doable like that's. Why we have like pulled storage.

A

The whole storage is is shared, so we can definitely do that, but there are some limitations, so we don't want to be able to clone the blocks between data sets with different encryption keys, for example, because then we won't be able to access the data and the same with we don't want to clone blocks between data sets where one data set is encrypted and the other is not.

A

uh There are some problems uh arised by Alex with Zeal, because.

A

Attached to to single to one data set right, so if data set has separate, uh separate intent, log and uh and we have to- uh and we have to address, we have to clone the blocks that are actually part of a file from different data set. So when we replay the Zeal, the blocks may be freed already.

A

So uh yesterday, actually Matt mentioned an idea that we could. uh We could make use of the Zeal claim to actually uh bump the reference for the for the blocks before we actually replayed the Zeal, because the Zeal is reply when we Mount file system. So we can import the pool without mounting all the file systems and uh then Mount. Some of the file systems remove a file that had blocks we wanted to clone, and then we will Mount the the file system that we cloned into and then those blocks are already freed.

A

So we cannot really reference them anymore, but the claim Zeal claim is done on on full import. So we could try to bump the references, then maybe, but uh this needs some more experimenting.

A

uh Okay, um so uh the logging is implemented uh differently than any other function uh so uh because for every single operation we have one function that is actually used during normal operation and the same function is used when we replay the Zeal.

A

In our case uh it wasn't possible because what I do uh I don't just want to reference some object from another file system. So what I do uh I copy the block pointers into the log record?

A

So we have all the block pointers we need when you replay the Zeal so- and this is why for replay, we need another function to actually Implement Zillow replay log replay. So this is.

A

Maybe not something I'm super happy about, but uh uh but I don't see any other choice for now.

A

And as I mentioned, uh there is a problem on when block resides blocks resides on different uh file systems that the blog that the blocks might not be valid anymore. uh So I mentioned some solutions here, for example, we can not use Zeal at all in those cases, but maybe with the Zeal claim idea, we could actually have Zeal as well uh supported for this kind of use case.

A

So there are three new pool properties, so we can uh like see how brt is being used, so uh the first value brt used is how much data was actually cloned, then how it will be archaeologically used how it will, how much space it will be used without block cloning and brt ratio. So basically you can calculate this brt ratio by dividing those two values.

A

There are some special cases because initially I thought that this is like very similar to the duplication, but actually it's not like not having the data. It changes a lot. So first, one is uh the block we want to clone might have been created in the same transaction group.

A

So somebody writes the data and wants to immediately clone the data, so the block pointer is not yet allocated, so we cannot really clone it yet so uh so I'm simply in this case, I will wait for a transaction group to sync and then we will be able to continue and clone the block another one. uh It's pretty similar that the block might have been modified in the same transaction group that we are cloning.

A

So again, we just wait for transaction group to sync.

A

So block could also be cloned multiple times during one transaction group, so the spending list I was talking about, uh is actually a tree, so we can quickly look up the entry and just bump the the counter. So with on the spending list we can. We can tell that. Okay, this block was actually cloned five times in this transaction group, not only once.

A

um So also another case we have to consider is that we clone a block and we throw a block in the same transaction group. So this have to be handled as well.

A

um Another interesting case that we uh clone a block and then we clone the Clone in the same transaction Group, which also requires some special code to handle.

A

And, of course, we have to make sure that we will properly handle holes in the file and also BPS with embedded data, and there is another interesting case uh which actually might be even useful for some for handling temporary files that I create a file I delete the file, but I keep the file. Descriptor, open and I can just use it as a temporary file. And, of course, if I crash, the the data will be freed, but so there is no file on the disk.

A

I just keep the the file descriptor open, so once I'm done with whatever I had to do, I can quickly recover the data from the already deleted file using block cloning.

A

So uh I'm not sure if this will be useful, but it's also needed some special handling and some random notes.

A

V Dev growing is supported, shrinking is not so. uh If the video grows we automatically will extend the table, extend those uh actually not the table. That will extend the uh this region. Ref counts array, uh but we cannot shrink. The shrinking is is not supported.

A

So if it, if brt, is not used or no longer used, we will, for example, we're free the last reference in the brt. All the structures and all the objects in most will be freed, so uh the property, the the feature will change from active to enabled in in zipper properties, um offsets and length in copify range have to be a record site aligned so for now, I'm just returning an error.

A

If there is no alignment, so uh the operating system can still use the regular copy, but this this can be fixed by just copying the data which is which are misaligned and using cloning for everything which is uh record size aligned, but for now I'm, just returning an error and- and there will be regular copy in this case, but most of the time you want to just clone the entire file, so it does not huge deal. I guess uh you cannot also.

A

You cannot clone into a file with different record size, because this is uh this this. This cannot work you you can only clone when the record size are the same or the file the destination file is uh empty.

A

uh Copy file range on the FreeBSD operates only within single Mount point. There is a check in VFS for that, but this can be easily fixed. I actually have a patch for that on Linux, and there was a recent change that allows copy file range to work even for different Mount points, but for the same file system type. So if it, if it, if it is the same file system types, let's say we are copying between ZFS and ZFS, then ZFS method will be called and we will try to clone it.

A

Of course it might be still different pools, but uh if it's the same pool and all the conditions are met, uh it's not data sets are not encrypted Etc. We can do the cloning uh and a big one. People cannot actually uh give up on this one when we send a snapshot or when we send the data using ZF Ascent. We lose all the savings on the receiving end. We are not able to rebuild the brt.

A

uh We just send the data and I have no solution to that. The only one that comes to my mind is just turn this into ddube.tell that okay, those blocks were cloned on the source, so maybe add those and those of those blocks into the Duke table and uh because they seem to be this data seems to be uh maybe the duplicatable.

A

So uh but that's the only idea. I have it's so that's unfortunate, but this is how it works.

A

So, as for the status, the implementation is pretty much ready. Of course, there is always something to work on, uh but.

A

Some Plumbing is required, but actually different, record sizes. This this was easy actually did that yesterday, I can just return an error record size sizes, don't match uh uh handican aligned, requests again, I I'm for now I'm, just returning an error, but we could partially do the copy and do the cloning for the majority of the of the request and uh Zeal between uh when we are cloning between data set, that's still unresolved.

A

So that's it. Thank you.

A

There are some questions: yeah go ahead.

A

I'm not sure if I will be able to repeat all that, but uh but the idea is uh so. Your idea is to to try to transfer this information that uh that the block that there are multiple copies of the or multiple reference to the block within one sense, zfl stream, correct yes, but this will all work only within a single ZFS sense stream right. If we have, if we have the duplicated data in one stream, then we could duplicate using brt.

A

Yes, we will have to bring back the duplication option for ZFS sent for that. That would be possible. But again, this is just a not full solution right. It will only work within a single stream. If you do another ZFS send, then you won't be able to figure out that you already have those blocks actually the same data right uh yeah.

A

A

A

A

Yeah so, but this is again it's uh it's only partial solution to the problem right. We could only use brt within a single zfsn stream, uh but uh of course, would be great if we could figure out how to like be able to rebuild brt uh somehow on the server side, but uh I don't know.

A

So uh you know that much so the question is why why the record side have too much because well when the DFS is, is building a file it? Basically, the the block size will grow until it will reach record size, and that will just add another record size and only the last block can be smaller right. So uh so we cannot just uh so uh so a block in the middle. It cannot have smaller size.

A

So all the blocks up to the very last block have to have the maximum size or the record size of the data set or or at least the same block size. So we cannot just punch punch a block in the middle of a file with different block size. It wouldn't work simply uh so how does block cloning works with Device removal? I think there is uh there's actually no special handling needed, because all the block pointers will simply work right.

A

So uh if, if let's say I free a block, I just use the old block pointer if I clone the block, I can still use the old block pointer. So there is no interaction between block cloning and video free mobiles. Actually I think I tested that some point and there is nothing special we need to do there.

A

Okay, so the question is how we determine if, if we that, if I, consider different data structure for determining, if, if I need to free, if I need to consult brt when I free a block like um Bloom filters? Yes, so this was actually uh blue. Filters are better for, like the duplication, I guess, but for this we we don't really need anything fancy, because the structure is extremely small and- and it can tell us already with like- uh can give us pretty much very precise answer.

A

If, if we need to go and look up the entry on disk, so I didn't look for anything else, because I think this is just works very well. So we need.

A

So this cannot be used for dedup because it will, with.

A

Why why this cannot be used for dedup so uh for dedup, because you have all those like random Keys. You will quickly fill the entire table right, so every region will have some reference count right, because it's just so Random, so it won't be possible to to make it work efficiently for data, so for DWP would need something like Bloom filters, but for this it's it's much simpler.

A

Any other questions.

A

Okay, so uh how does it work that we only need offset and ref count in brt entry? And we don't need length, because uh the only thing we need is during the Clone to make sure that, uh like block pointer already has length or you can determine the length right, the size of the of the block? So we don't really need to store that in into the in the entry. That's why also we have just one offset.

A

We don't really need entire block pointer as in the duplication case, because just a single offset and vdf ID, uniquely uh it's Unique for for each block pointer there is. There cannot be any like overlap or anything like that. So.

A

Because exactly.

A

Yes, if the record size changes, then then BP then ubps will have different sizes and and that's it right, so we don't even need anything special.

A

So Matt actually suggested that we could even go uh even make the the brt entry even smaller, to just try to fit because now there are two 64-bit uh values of sentence: ref count and maybe we could just uh squeeze all of them into 64 bits which would be possible, especially for like let's say you have uh like 4K a shift or 4K sector size. Then, of course offset cannot occupy those first bits and we could use those for ref count.

A

But of course there will be limit of how many references you can do and I think we are it's small enough for now so I don't know any other questions.

A

Yes, yes, so uh so the question is what happens when we, let's say clone blocks which are not stored on the disks yet so we have to wait for the transaction to sync and one what if there is a problem during the sync and we cannot really clone the an entire thing right, so uh it will just work as a partial right. So if you do the Clone and the cloning will fail at some point, we'll just do partial success, we'll just return partial success.

A

So that's it and uh I decided to wait for the transaction to sync, because the only other solution would be to just return an error. But this would uh tell the upper layers to just copy the data. Let's say and I, think it's better to just wait for the transaction to sync than to do copy, because somebody just wanted to clone within the same transaction group. So.

A

So the question is: if we can, we do a dmu sink or just force transaction group sync if if the block is not on the disk, yet probably we can for now just to stay safe, I, just wait for the transaction to sync.

A

But definitely we had this discussion like about privileges when somebody does Zippo sync, if unprivileged users should be able to do that or not so uh so, maybe that's not a problem, even if we force and and there will be a way for the user to just force transactions to sync by just doing this, creating a block and cloning. This will force the transaction to sync every time, so I'm not sure uh I'm, not maybe convinced. Yet if this is like safe, if we want to do that or not, but maybe.

A

Okay, so the question is uh that we only uh that we only read level zero block pointers and if it would be possible to clone also like a higher level block pointers, so uh maybe it would be possible, but for now you have to recreate the entire tree of indirect blocks. uh It's still I think it's I think it's uh I'm, not sure. If the, if the savings you will get from not doing this, if that's possible would be uh or reward the complication that that it brings I didn't really try to do that.

A

I just followed the uh my idea was to like Follow The Experience on the duplication, so with the duplication, you also only the duplicate data blocks and uh so I wanted to like copy some of the experience we had with the duplication that are good and not to copy some of that that are bad, so so I just focused on the data blocks.

A

So quota accounting Works uh exactly the same way as with the duplication, so each data set will be accounted for those data, so it will be accounted twice if you, if you clone into different data, sets because we cannot really determine who's the owner of the data right. So we have to just do that: I'm, not sure how would the duplication accounts in within the single data set?

A

Sorry yeah you get charged so it it will work exactly the same.

A

So does that support fi, clone and fi clone range ioctals from Linux uh I implemented no interface for Linux to use that. But it's definitely it definitely should be supported. If you just create this, uh just teach SPL layer for Linux how to use ZFS clone range, because there is EFS clone range function in operating system, independent code, which does the whole work, and there is some interfaces like in FreeBSD specific method to just call this one. So it has to be implemented for Linux as well, but I'm pretty sure it should be straightforward.

A

So the question is: if I did any testing to see uh how how practical, how practical savings look like for for this? So.

A

I, don't really see how you can like determine that because, like for uh D dupe, you could uh use zdb to determine like for your live data.

A

How much is the duplicable, uh how much you could save by the duplication, but for this one it's on demand, so whenever you need to like, of course, some tools like, for example, on FreeBSD CP, already supports copy filer range, so CP will automatically be able to use that right, but, uh but you still have to like uh you have to use the system call to in order to to do the Clone.

A

So uh the only thing you could test is to like how much the duplicate, how much data, how much duplicates I have in my pool. So maybe if I start to use that I will be able to eliminate those duplicates, but probably not all, uh but some I guess so. The question is in CPU: utility is uh dash dash, redirect or referling reference option, yeah yeah, so on Linux. There is option like that. So yes, it means to use this kind of feature, I'm, not sure the details.

A

There was some like uh problems uh with this, how it works, because there is also like dash dash always which won't work. Interestingly, but uh but on FreeBSD you don't need any options. It will just use all. It will always use copy file range, so it will always try to once this is done. It will always clone basically.

A

A

Okay, so the question is: if I did anything special for scrub? No in indeedup uh there is, there is code which basically allows not to scrub the same blocks multiple times. Unfortunately, with this it it's it's not possible to the extent it's possible for a uh with dedup, so unfortunately you will be. You need to scrub the same data multiple times.

A

uh Probably somewhere can be done to like remember what we scrubbed uh uh I, don't know. We would need to maintain another table in memory during the scrub, so.

A

um Any other questions, yeah, sorry 30 seconds, go.

A

So the question is uh to to try to Once once again explain why uh cloning between different record size doesn't work, because once you have a file in ZFS right, it should only use one block size within the entire file right. So the only block that can be smaller is the last one right.

A

So we cannot really so we can't really punch a block that is bigger or smaller, because the ZFS will expect that it actually has the same site for the entire file, because this is just a single property per file right per Z node, the block size, so you won't be able to cope that, for example, you have 4K 4K and 64k.

A

It won't be able to do that because the block size is property of the Z note of the uh or D note it's just one for the entire file, so we cannot really use different ones for different regions within the file, so we just have to fall back to to full copy in this case.

A

Okay, thank you very much.

A

Thank you. Thank you.