OpenZFS 2018 OpenZFS Developer Summit, 21 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dedup Send Receive by Paul Dagnelie

Description

From the OpenZFS Developer Summit 2018
slides: https://docs.google.com/presentation/d/1Ecwg3L52FQ7B86QPleVqXDJe9g5Jv_S4TJsNyK5uJLI/edit?usp=sharing
ZoL issue: https://github.com/zfsonlinux/zfs/issues/7887

A

Fair number: yes, all right, thanks, Matt hello, my name is Paul de Leeuw and I'm gonna be talking to you today about ZFS deep sense sort of what it is, what if it's in trade-offs it has and what its future might be so I know. Probably a lot of you already know what it is or how it works, but I'm sure there's some that don't sure there's some people online that junk so I want to go over sort of an introduction of terms before we get started.

A

So ZFS sense is a utility in ZFS that allows you to use or native backup and replication of data. It.

B

Serializes, the.

A

File system, it sort of finds all the blocks, writes the map over to the network or on to a tape for later use for backups, it's very efficient in terms of being able to suppress, you can do incrementals between one snapshot. Another snapshot there's a variety of very useful features that make it a very high quality data replication at backup.

A

Zfs deep is a future ZFS that allows you to deduplicate the data in your pool. It will find data with the same checksums and then rather than storing multiple copies of what is this? Essentially, the same data will store it once and then we have references to the other. This is a very powerful tool. It can dramatically reduce your storage costs for certain workloads, but in order to keep sort of performance acceptable, it needs to store a lot of data in memory, and so it can be a very expensive feature to use in practice.

A

It cost a lot of memory.

C

And it can cause a lot of write.

A

And read amplification as well: ZFS t-tube sends sounds like it has something to do with DFS D do, but it doesn't other than sort of the name. There's there and like the big premise of beauty, duplication there.

B

A

Very little to do with each other, it does have something to do with DFS. So that's good, so basically ZFS D dupes end is a post processing pass that goes over. The send stream finds records that contain the same data and then removes references to later ones and sort of points back to earlier references in the send stream. It uses a special kind of record called a rank by ref record. That sort of points to another data set already been received as part of the same package. It has a couple problems with it.

A

One of them is that it uses a significant amount of memory and consumes some CPU and usually do the processing, and then it also has some cost from the receive side, and it has some complexity associated with it. I'm.

D

Gonna go into more detail.

A

To talk so you go into a few more details about how the feature works.

A

The on the send side of dee doop sends we create a hash table in user land that is very similar to the video table that started the kernel when you're using video, it's a mapping from a checksum to a combination of a snapshots, viewing the object in that data set and to be offset within that object. So that's how it does sort of the back referencing to previously receives data, it's sort of every time it sees a new record. It looks at the checksum if the new checksum it stores it in the hash table.

A

If it's an old one, it will find the value already in the hash table and replace the entire record with a right by ref record, pointing to the other data. All of the processing, for this happens in a single thread and user lands, and it has absolutely no connection to ZFS. Do you whether you have ZFS need to be enabled or not whether you is table is started stored in memory stored on this? Whatever is.

C

D

A

This just doesn't, it doesn't know about it, it doesn't talk to it. Many people don't realize this. I should talk to some people yesterday thought that it.

D

Took advantage.

A

Of the GD T, but it does not- and this is one of its many misleading features on the receive side when you're receiving one of these right buyer of records. You have the gooood of the snapshot that the data is stored in the object in that snapshot and the offset within that object, and so your next step is, you have to go and open that snapshot, find that object in that snapshot and like hold the appropriate debuff with the data. So you can write it into the correct location.

A

There is no sort of standard map between gooood Xand snapshots in the kernel, so when you're doing a ZFS receive what we do is we build a mapping of all of this thoughts received as part of that receive commands like when you type CFS receive on the command line.

A

That is one command, and so we build a map from gooood to snapshot over the course of a command and then keep track of all the receives that are part of the same command by passing the enough to and from the colonel once we have sort of found the right snapshot. We hold the right debuff, we copy the data from the existing snapshot into the new snapshot and again, if you have deed of the needle or not on the target, it doesn't matter.

A

None of this is relevant to like it doesn't take advantage of that it isn't better or worse. If you have, you been able to probably the blocks we'll d-do but I, don't know CFS, weird, sometimes so, there's a couple problems with dee doop send/receive.

A

One of them is that you have to store this sort of pseudo dee doop table in user land on the sending side and.

D

That consumes memory. It's.

A

Not a huge amount of memory, it's you're not storing it, for the entire size of your pool, but if you're doing large replications it can take a non-trivial amount of memory in.

B

Addition adds more.

A

Cpu load to the sending process and adds just yet another place in the Senate process, where you could have a CP 101, it doesn't have any sort of super gross architectural features on the feminine side, unlike the receiving side.

A

So on the receiving side, this map between snapshot and view it actually needs to persist over multiple calls to the receive IAP, ttle and so a token is sort of passed back and forth between the user and the kernel to keep track of this state and that data needs to get freed when the process finally finishes doing its job, using some on exit callbacks that are used in here and one or two other places, but not a lot.

A

You'll have to read the data from the other data set when you're like receiving a PI RF record, and that involves doing a demand read before you can do your write, which can Prudential II, increase, latency and decrease your bandwidth. There should be prefetching to help make it better, but prefetching does not always do what you wanted to do, especially in sort of high I/o situations.

A

Overall, the big drawback to this feature is that when you're doing a ZFS send this cube feature only works within one send command. So if you're doing like a normal, send or an incremental, that's one snapshot and the data in that one snapshot that you continue against if you're doing a send with R or I the capital R, which.

C

Will allow you to send multiple datasets.

A

You have a larger space that you can get over. You also spend a lot more memory, storing this to do table in user lands, and so for a lot of use cases do you do actually isn't gonna get you very much because you don't store a lot of copies of the same data in the same data set and, of course, much like normal DD if that's only works on whole blocks.

A

In this case, it also only works at the checks and algorithms being used on the different data sets are the same, which is usually true, but not always, and then one of the big problems is that this isn't a feature that a lot of people use a lot of tools.

A

Don't take advantage of it, a lot of users, don't use it, and so it's not super well tested, and every time we cope with a new feature in C, if I said, we have to expand that test matrix to cover this I know there were some issues with raw sense for encryption, where there were crash bugs relating to YouTube sends across sets being combined, because it's difficult to cover.

C

All the different.

A

Possibilities when you're writing tests- and so it adds maintenance cost adds development cost. It adds code bloat. It has a bunch of sort of development, focused downsides in addition to the actual downsides of using it. So sort of the question that this talk is intended to ask is what happens if we deprecated it, and this is an interesting question, because we've never really deprecated a feature like this before CFS has always sort of had a very strong philosophy, a very good philosophy that the data should be mean.

A

You should always be able to keep your data. You should always be able to look at old data and sort of weave new bits, load old fools and still get all the functionality and with new, receive bits, receive your old streams and people to restore your backups, but as sort of the filesystem grows, and we add more features and time goes on. If we never remove anything, the codebase will grow without bound and complexity will grow and will become harder and harder to maintain and understand all of this logic.

A

So, even if G dupes and isn't the thing to remove it's a good template to think about what deprecation should look like for CFS. How do we want to do something with this? If we decide that we do need to remove features to make our lives easier as developers, so we would probably sort. This is a proposal for how we would do this for dnews end. We would start in much the same way that you basically any feature from any programming language or utility. You start throwing warnings when people use it.

A

You note in the man page or remove it from the man page that the future is going to be deprecated. Eventually, you would remove the flag from ZFS. Send. Maybe you'd have like a special super-secret backup flag for people who like actually need to use it for something?

A

Eventually, you remove that and then, after a while, when people have sort of gotten used to it, you remove it for the logic from Libs EFS on the sending side, and then before you actually move the receive logic, you could create some sort of utility that would DD duplicate the data. So, basically, what if you have your stream stored on disk? It can iterate over the stream, find the by RF records and replace them with the appropriate records.

A

You do have to maintain this script, but the script doesn't need to consider certain new features that can add into ZFS its test. Matrix is fixed in time, because once you tube is gone, you no longer have to worry about adding new test cases, for it will have to deal with new features and complexities and then, finally, once everything is ready and we feel secure and we feel confident about it, remove the logic from the receiving side and remove all traces of the crumpet code, and at this point the feature would be gone.

A

There would be no way to use it anymore, except rehydrating, your sentence and then receiving the model.

A

So obviously this is not something that like we could should or want to do as like a declaration or a patch that we would even propose without getting a lot of feedback first. This is kind of a first for the community and we want to hear from everybody not only about potentially doing this to this feature, but also what this protocol should look like in the future. People who use the future other developers of ZFS other members of the community who have feelings about how this process should go.

A

We want to talk about this and have a dialogue and figure out what the right thing to do is in the future, because it seems like at some point we are going to need to think about ways that we can stop the file system from just growing endlessly. At some point, we do need to think about what things should be in and out of scope and how we deal with making those changes and making those decisions. So I really I look forward to people asking questions about this. Talk movie about this.

A

A lot of time for questions now, but I'll be around during the hackathon and I'm available on slacking on IRC and on email and github. So please send me messages, I'd love to have a conversation about this with everyone. Thank you.

E

One question is: how do we send feedback.

A

How do we send feedback like to me like who should.

E

Receive this feedback as.

A

A community so I feel like this should probably be a discussion that happens or in the open in the open on some sort of public forum. And so maybe the right place is like an issue on github or a message in the mailing lists. And so probably the right thing to do is to look into doing something like that. Where we can track sort of discussion and store feedback in a way that it's actually persistent and not just receiving emails. That I will not be able to keep.

E

And very quickly, stop using this that.

A

D

Another technique to use.

A

And I'm happy to this is the free awesome, Sara.

B

Do you or anyone else, our community of probably get a sense of how yeah.

A

It's we don't have a good way to gather metrics or something like this. The limited evidence that I have is mostly sort of looking at how bugs get discovered like when we do release a future that has bugs in it and then looking at tools like Z, Red, Bull and a couple of the automated other sort of data, backup solutions, and none of them really take great advantage of this feature and almost every tool that I've seen that does it doesn't use that capital R capital I.

A

So they would even if they started using it only get to do your data within the same data set, which is again a relatively limited opportunity. I don't have a good way to get usage numbers, and so I am really hoping that we can reach out to the community.

B

A

On Twitter, asking github talking in IRC, or something like that to try to get a sense of how the community use.

F

The future I think also education, like I. Don't do that so they're gonna say like oh, it's a lot of people are using it and we'll keep it around, because we kind of think.

F

So, who do we need to give more detailed information to really do this? Yes, that's actually super useful and.

C

That kind of goes to what I was going to suggest, whether it's deprecated or not. It's been awhile since I looked at the man page for it, but I think there is actually a soul implication that it isn't some way related to actually do I'm sure and cleaning that up and actually adding the explicit commentary that it in fact has nothing to do with with the deegeu feature and that there's a very limited use case for it would might be.

A

An interesting start definitely said what you take as well, regardless.

D

We're not we deprecated, we should.

D

Morning in saying that this right and let's see whether people have.

A

Enough default.

D

A

Problem with something like that is that existing users, probably I, know look at the man page.

F

But it was yeah you directly who wants to try to use the teacher? It says yeah. He doesn't have to tell me unless you say.

D

That's gonna take at least that much time before we.

F

D

The feedback words like.

F

You need to wait long enough that Ubuntu, a shifter version that.

A

Has the deprecation morning yeah, for example: alright, are there any other questions? Okay, thank you very much.