OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: File Cloning with Block Reference Table by Pawel Dawidek

Description

From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/1csE8OuPotfhaFi9KvTGKMGy86KxrBu2W/view?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

Block reference table, so, what's the what's, the general idea, so uh general idea behind this feature is ability uh to uh to reference uh one data block uh from uh two separate files. So uh a lot of people actually when they hear that zfs is copy on right. They wonder why this feature is not already implemented it's. It must be so easy.

A

uh You can feel think of this feature as a uh let's say: file cloning. So you have a file you want to clone the file and basically, all both copies of the file are now uh have this copy on the right property. So if you modify one file, it doesn't modify the other file. So it's not a hard link right where you have two separate uh file names, but this is basically the same file same data.

A

You modify one file, the other one is also updated because it actually exactly the same file it just it's referenced by two different file names.

A

So uh this is different: two separate files: they have their own properties, their own permission, ownership etc, but can share data blocks.

A

uh It turns out that linux already has this feature through uh with battery fs. There is a special ioctyl and there is an option for a cp called ref link that you can use to to clone a file, and there is a dedicated system call on mac os clone file, which also allows to do exactly the same thing.

A

So what are like general benefits of of future like that? So definitely the main benefit is uh space savings with cloning files. If you have large files, you just want to modify them slightly, you can just copy the file and- and it takes no additional space.

A

Another huge benefit is when you try to recover a file from a snapshot, so you accidentally deleted a file and you want to bring the file back so now, if you just copy the file back, it will consume additional space because the data blocks will be copied it has to. It will allocate additional space for all those data blocks.

A

So some people complain about about this, so uh it would be nice to be able to to not to pay the the cost uh to just recover files from from a snapshot.

A

Also such a copy will be super fast. So if you clone a file, we don't read data blocks. We just read the parent blocks. We just need uh block pointers that will uh that will clone. So we we read and write only a fraction of of what is read and written when you do regular copy, so it should be extremely fast and another benefit is that you can move files uh between data sets.

A

This has some limitations. I will mention about them a bit later.

A

And some may think: uh why do we need this? If we already have the duplication, it serves pretty much the same purpose.

A

And you can think about this block reference table as a manual d-dupe, that you decide what you the duplicate.

A

But with a block reference table, there is no cost on the right with the duplication you have to look up, uh you have to go through the dub table.

A

So uh when you write- and you have large debuff table, there's a lot of problems with that, if it doesn't fit in ram, you have all those performance problems so.

A

It can be, it can be quicker. It can be slower with block reference table. If you do a regular write, we don't touch block reference table, so there is no additional cost at all.

A

If you do of course, file cloning, there is a cost on the right, but it's much smaller. As I mentioned the copy, it will be super fast and it works with any checksum algorithm. You don't need uh cryptographically, strong checksum in order to be able to use block reference table because we we don't care about uh about the checksum itself, and uh another benefit is that with deduptable and because we use cryptographically strong checksum, the blocks are scattered throughout the entire d-dub table.

A

So if you have one file few blocks, you may end up reading from totally different places in the dub table, with block reference table.

A

There is a high chance that all the entries that you need are in in very close to each other.

A

So, uh as I mentioned, d-dub tables can grow very big d-dub table entry is pretty large. This is in memory size of the dupe entry. uh So it's almost 400 bytes. It's uh one-fifth of this uh for block reference table, which also means that you can fit much much bigger block reference table in ram.

A

So with ddop, the big difference is that.

A

In the duke table, you have a lot of entries most likely that if only a single reference, this is not possible with block reference table. There are no entries with a single reference. If there are a single reference, you basically just remove the the entry from block reference table, so the table only contains the reference that I actually meaningful that actually reference blocks that are referenced more than once and, as I mentioned already the duke table, all the reference can be all the blocks can be scattered throughout the d-dub table.

A

So other costs and no cost. So as I mentioned, there are no penalty for regular rights. There is no penalty, also for regular reits, of course. So, but this is similar to to do.

A

If you don't use this feature, there is no cost.

A

So simply if you're afraid that block reference table can also grow big and there can also be performance problem, you just doesn't have to use this file cloning facility and basically it will just work.

A

There is a cost when you free a block uh so on every three, every three you we have to cancel uh block block reference table. That's the difference between uh this feature and uh d-dupe with d-dupe. We have a special flag in block pointer the d flag that says that this block is in a d-dupe table. So if there is no d flag, uh we simply don't consult the dedupe table with block reference table.

A

There is no special flag in block pointer, the reason being that we want to clone existing blocks.

A

So when we uh so when we wrote the block, uh we simply didn't know if this block is going to be uh cloned or not so there is no special flag and of course we cannot modify block pointer uh so simply uh on every free. We have to go and check if this block is referenced in block reference table and if it is, we have to decrease the counter.

A

If it's not, then we we did some extra work.

A

uh As I said before, if you don't use it, there is no cause because the table will be empty, uh so you don't have to uh worry about that and when you just want to use this feature to move files between data sets, let's say.

A

The table won't grow because it will grow only for a short while when we put the reference into the table, but when you remove the original copy, we will remove the entries from from the table, so the cost is, the cost is only well. The table will grow only for for a very short amount of time.

A

So when it comes to design, uh as I mentioned, there is no bp rewrite, so we cannot modify bp, so we cannot uh either put a special flag or we cannot put a reference counter into a block pointer.

A

That would be, of course, like obvious idea to just increase some increased some some counter within block pointer, but block pointers cannot be just modified, so we need additional table uh when we clone.

A

Don't we don't read data blocks it. It must be super fast. Just the parent blocks, just the block pointers.

A

We just add reference to to the table.

A

And I hope that this feature can be always on, so I would prefer not to have a special.

A

A special property just for that because, as I mentioned, if you don't use this feature, there is no additional cost.

A

And block reference table entry is extremely simple. It's just we just need to say which vdf and and the offset of the block and, of course, reference counter, but the entry itself is is extremely small.

A

There are some limitations that it's worth mentioning.

A

Blocks that are cloned, this information cannot be sent over. So we cannot preserve this information when we do zfs sent.

A

This might be uh a bit, uh um let's say discouraging, but uh the same with uh uh there is similar limitation with d-dupe, but because of the checksum d-dub can reconstruct the uh basically when we send and receive the block. uh D-Dub can figure out based on the checksum that this block have more entries and simply update the dupe table.

A

Here we have no idea if, if the entry is on the target system and the target system has no idea if the data that is coming have more references, so uh if you use uh block reference table a lot and you will send data set like that over, unfortunately, it will be much bigger than the original data set.

A

So that's a bit um disappointing. In my opinion,.

A

So I would like this to work across data sets. I would like uh to be able to move or clone files between data sets, but of course I don't want this to work when there are different encryption keys. So this is similar limitation to dedupe. If the keys uh is in key is encryption, key is inherited like through zfs clone.

A

Then that should be fine, but if it's totally independent uh encryption key, then we won't be able to clone files between data sets with different keys and, of course, between data set that it's uh on that it's not encrypted and data set that it's encrypted and and the opposite.

A

I would like this still to work between data sets with different compression algorithms, because you still don't pay the the additional price.

A

So even if you clone a file to data set with compression off a compressed block to data set with compression off or or the other way, I would like this to to simply work, because when you start to override the blocks, we will simply use the the our algorithm configured on the on the specific data set.

A

uh Okay, I was trying to be brief, but it really went quickly. So I guess there will be some more time for for questions, but I will uh just mention uh status of the project, so I don't want to put your hopes too high. This is just my hobby project, so the progress is very slow.

A

I was able to to hack some very early prototype.

A

I'm using additional system call.

A

But I'm not sure if this is best idea on freebsd.

A

You can register additional system called from kernel modules, I'm not sure about linux, probably too, although battery fs is using uh additional iotal, so maybe this is a better way to go, but for now I'm just using additional system call.

A

I am able to to read blog pointers.

A

To some extent. I can update the update the table on the right.

A

There is some some things that already work, but uh I'm sure there are a lot of corner cases. We have to consider that are not considered at this point at all.

A

So it's uh it's it's a very early prototype.

A

uh So if there are any questions, I'm I'm happy to to answer them.

B

uh It looks like there's a few questions: um if you click the q, a do you want to read them or do me to read them to you.

A

Okay sure one sec um so.

B

You can probably stop screen.

A

B

And then you'll see the um in the zoom there's a q, a with those four questions. Cued there.

A

uh So a question from crest: first: uh could this be used to implement rebaseable, tin jails?

A

A

uh Definitely you could because I think that the the story behind that is that if you would like to implement uh jails on top of zfs, a common practice is to use zfs cloning.

A

So with zfs clone, you have initial free copy of of your base system, but when you update the jail, you simply have to override the files and- and you start to lose all the savings, because zfs clone only gives you the initial savings.

A

So I think yes, uh definitely if you will just uh install the new the updated system, files from some template and you will use- and you will use uh file cloning for that and not just regular copy or install tools which, of course, we can extend copy, cp and install to use file cloning if feasible.

A

So then, yes, definitely and- and you can keep the savings uh even after uh after updating your jails. I hope that's answered the question.

A

So uh christian is asking what is the reason uh why the brt files not be sendable?

A

So the problem here is that uh when we send the file, uh we uh we don't really transfer any information about block pointer. Matt can correct me if I'm wrong here, but uh at the dmu level we lose that information. So we simply just send data blocks, but at the destination uh destination zfs they will have totally different block pointers, so there is different vdf different offset, and this is what we use to reference the blocks.

A

So we we lose that information when we send it's totally different on the destination data set, so so simply, and especially that I would like this to use across data sets, uh but even within a single data set, we cannot assume that there is a copy of the block already, because dfs sent and receive is one directional.

A

So we discussed that initially. If anyone has ideas how to address that, I'm I'm I'm happy to to discuss that. But at this point um I think that's uh that's. Unfortunately, uh a limitation of of this feature.

B

um Just to clarify the uh there may be some confusion about whether the files, whether like are you allowed to zfs, send something that has brt stuff in it.

A

Okay, uh okay! So yes, definitely because when you read the file or when you read, you simply don't touch brt. So this is just regular uh files and so yeah.

B

Yeah, so you can send you can send it it's just that the the block sharing of brt doesn't come across with the zfs send. Is that correct.

A

Yes, although that's uh interesting question, uh if, if I clone a block that it's much much older, it will be included in in a specific snapshot range uh on the destination data set. So maybe you could detect.

B

A

B

But in any case the discussion is really about. uh Could we preserve the block sharing of brt not about whether send is going to work at all? With this.

A

Yeah so preservation, I think it's not possible, but.

A

Cool okay, a question from johnny: uh why do we need another table? uh Can't we simply use dedup tables? So I was initially. I was thinking about that.

A

But I believe that additional table is a better option, because it's much lighter so the entries are are much much smaller because there is no checksum. It's simply vdf offset and counter and also uh youtube table is a bit different like like I mentioned, uh uh we put every single uh block into the table uh even blocks with a single reference.

A

So uh with brt uh there are no such block with single reference. So if the block reference count drops to one it's removed from the from the table, so there's different dynamics, it's also a different way to clone files, because we have to do special read because we don't want to read the data and we have to do special write because with dupe you actually write the data itself and then within the zeo pipeline.

A

You decide that after calculating the checksum, you decide if this entry should go uh to well, you decide earlier because the depend uh property is set on the data set, but but basically you have to provide the data that should go through zero pipeline, so we can calculate the checksum and either put the block into the table or uh just write the block to the disk uh with uh brt.

A

We uh we don't want this. This last step. There are no zeos with with the data itself. We finish at the block pointers and we we have no data.

B

Yeah, I think a.

A

Big reason that you.

B

Wouldn't want to combine it with the d2 tables in case you're, also using dedupe. You don't want to like when you're doing a free whenever you're doing a free, you have to look in the block reference table. You don't want that block reference table to be larger than necessary. So you don't want to be you. Don't you wouldn't want to have to go? Look in the d-dub table when you're doing a free that doesn't involve d-dupe but might involve the block reference table.

A

Yeah, so my hope is that the brt will be much much smaller. Of course you can always go to extremes, but, uh but there is you need to you know. You need to make much more work to actually make brt a performance problem because, as I said, it's much more compact and specialized uh okay, a question from crest. Could this accelerate copify range for the nff server? To be honest, I'm not familiar with the system call, so I'm not sure how it works exactly so. Maybe somebody else uh know how it works.

B

I think that it, I think that it would um be able to take advantage of this. It's basically like copying a range of a file to another file.

A

So that's uh so that's interesting because there are two approaches. I'm considering so one is a clone file system call that basically clones entire file, but this can also be implemented as cloning individual blocks.

A

Then you could punch a block in the middle of a file that is basically a clone block, but of course you have to preserve like alignment, so you cannot punch the block anywhere into the file. It has to be at the block size boundaries. So it's a bit limited. So if a copy file range allows to to move the data into any place within a file, then it won't work for general case, but it can work for uh for some specific cases where you preserve the boundaries.

B

uh We're running a little bit short on time, so I'm just trying to answer some of the questions um uh and text as you're typing.

B

But we probably.

A

B

Time for like one or two more, if you want to answer them, live.

A

So question from alan: uh what type of data structure does the brt use? Is it stored at the top level of the pool? uh Yes it? Basically, I I simply try to reuse as much dedup code as possible, or at least use it for uh more, like an inspiration, but uh is very similar to to ddp in that regard. So it used the same data structure, similar data structures in place in the same exact place in the pool, so there's a lot of uh of code that is reused from from dedup.

A

Okay and uh question from tenzin: if the source snapshot removed, what happens to the copied cloned file? Does the file get copied up to the live system.

A

That's a very good question and, to be honest, I didn't figure out yet uh um the interaction between brt and and snapshots. There are some potential questions: how uh how this fits um uh into intuition of the user? uh How the block is the life cycle of the block, because this black pointer is not uh is not updated so uh so this is something I I still need to.

A

A

Okay, so last question from christian: uh you could have the zero hint, whether it it used the brt or not,.

A

So yes, we we do send this information through xeo pipeline that we are using brt and there there is no data and it has to be. It has to just update brt table. So, yes, it handles in a special way.

B

Cool uh well thanks everyone who, uh thanks thanks paul for telling us about this. This is a really cool idea um and thanks to everyone who asked all those great questions.