OpenZFS OpenZFS European Conference 2015, 26 Jun 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Matt Ahrens - Send&Receive - OpenZFS European Conference 2015

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So next I wanted to dive a little bit into ZFS, send and receive so I'm gonna talk about like what is this thing? Why would you want to use it? How does it compare to other tools that that solve similar problems, and then I'll talk up a little bit about? You know how it works like. How did we design it to be better than these other tools and then I'll be talking about new features since 2010, and then some upcoming slash almost done almost integrated features like resumable send and receive.

A

So what do you use this thing for ZFS? Send is a command that it she realizes the contents of a snapshot and allows you, and basically just like dumps out the contents of that snapshot on the standardout and the key thing about it is that you can create incremental, send streams between two different snapshots, so you can periodically say: okay, send all the changes, since the previous snapshot to the current time to a remote system and then use the FS receive to recreate that snapshot.

A

So the primary use case for this is remote replication. So you can use this replication for like disaster recovery or failover. You know you might have the other machine like in a different data center and also you can use it for a data distribution so like I, have one master node where I'm putting my new content and then I need to distribute that content to a bunch of different machines that are you're consuming it or serving it out. You know to other to other customers other clients. It can also be used for backup as well.

A

So here's an example of just how you would use this on the command line, so in this case we're doing a full. A full sin of this of this snapshot, pool /fs that monday pipe the output to another machine over ssh and then into ZFS receive so now we're going to create on the on the target machine.

A

This new file system receives slash now, when the next day comes around, we can do an incremental by doing ZFS n eyes and then the first snapshot so that we were basically saying the other machine already has the Monday snapshot and now I want to send them. The Tuesday snapshot based on knowing that they already have what's from Monday just and then the difference is and then SSH that to ZFS receive so in terms of terminology. Sometimes I'll call this the from snap because they're sending from that snapshot to this to snap snap shot.

A

So there's a lot of other tools that can kind of do similar things. So you know why would you use ZFS send receive I mean our sink? Basically, does this it works on any file system, any server any platform. Why would you bother with like.

B

A

Tool like this, so then I think probably the number one reason is performance, especially performance of incremental sense, so the ZFS send and receive it uses the internal ZFS on this format to be able to very quickly locate just the change blocks without having to look at every file or every block.

A

So if you compare this, if you compare this with tools like our sink, they need to look at every file to see if that file was changed and if the file might have been only partially modified so like if it was a VMDK image or like a database where you you're, modifying just certain blocks of it, then you know arsenic has to look at every block to determine whether that block is the same on the target system.

A

As the sending system, ZFS only needs to look at basically oh of the number of blocks that were actually changed in order to find those change blocks and send them to the other system.

C

A

So, as the end result of that, we're able to use the full I, ops and bandwidth of the storage we're able to use the full benefit of the network and the lane that I think this is kind of. The key thing is that the latency of both the storage and the latency of the network has no impact on the performance of send and receive. So you can be going over like a really slow.

A

When link that has, you know hundreds of milliseconds of latency, but as long as you have sufficient bandwidth, it's gonna perform great, and this is also in contrast to things like our sync. That needs a lot of communication between the two nodes in order to sort out what things were changed.

A

So another key thing about ZFS send/receive is that it maintains the block sharing between ZFS snapshots and clones. So, if you're, already using snapshots and clones as part of your product or as part of your administrative practices, ZFS Center receive using that for replication, allows you to have those same snapshots and clones on the remote system, with the same block sharing versus with tools like our sync.

A

If you wanted to have like both the data from yesterday and the data from today, you'd have to send to hold different copies to the remote system, which is really impractical and, lastly, it ZFS send receive is very complete in that it captures all of the semantics state of your file systems, so even kind of weird ZPL specific features like like Windows Sid owners and crazy and nfsv4 ackles. All that is captured without needing any special code to be updated when every one of these things is added.

A

Ok, so that's kind of a that's, a big big big promise there. So next I'll talk a little bit about like what are the. How do we accomplish that like? Where were the design decisions that allowed us to get? You know that great performance and other features, so we locate the change blocks using the birth time, which is a part of the DFS on disk format and I'll.

A

Talk about that in in a little bit, we use prefetching to issue lots of iOS in parallel, so when you're doing is the FSN we're actually issuing lots and lots of I/os in parallel, so that it? If you have lots of disks, then were able to take advantage of the full bandwidth of all of those disks rather than just querying one file at a time and the FSN receive is unidirectional. So we're only sending data from the sender to the receiver. The receiver doesn't have to send any information back to the sender.

A

So basically we can go at the full bandwidth of whatever a TCP connection or whatever your underlying transport is and and then. Lastly, it's built on top of the dmu, which means that any kind of complexity that's implemented in the CPL or XIV. All layers is just kind of naturally abstracted out and we don't have to worry about that at all in ZFS, send/receive, so I think locating the incremental changes is kind of one of the more interesting aspects of ZFS send and receive so I kind of talked about this a little bit.

A

Let me show you the diagram, so the key thing here is to look at the incremental change. We only have to look at the two snap. You know there was the snapshot that we're actually sending, and all that we need to know is what time was the from snap created. So here we know in this example we're saying that the from snap was in 63.

A

The from snap is in 65, so we need to find all the changes that happen since time five, and to do that, we look at this tree of blocks, and this is how all the data is represented on disk, where the the data, the user data, is in these leaf blocks. And then these into these interior nodes are called indirect blocks and they point to other blocks and as part of the pointer it. It also knows what time the block that it's pointing to was written and because the efest is copy-on-write.

A

We know that whenever we modify a block, we have to modify all of the ancestor blocks, all the way up the tree. So that means that if we modified this block, then the birth, his birth time is six. We know that all of its ancestors have to be six or later they can't be earlier than six, because we couldn't have modified this without modifying the parents.

A

So in order to find the changed blocks, which are the ones that we need to send, we look at the birth time and we compare it to the birth time of the of the from snap, which is five. So here we look at this. Is this is born time three, so we know nothing below here. It could be born after time. Three. So therefore we don't have to look at any of these blocks, because they're all born before time.

A

Five, this one was born after time five, so we have to read the block that it points to and then look at it's the birth times. Inside of it, and so on down the tree, and we find that you know this block was modified after time. Five, this one was not in these two were modified after time. Five, so, in the end we'll end up sending just these three data blocks- and you know in this simplistic case we had to read, you know for metadata blocks to get it, but in reality this tends to be.

A

You know the the number of blocks right here is much much greater, because there's actually like a hundred pointers, rather than just the two that I could fit on the slide.

A

Any questions about this.

D

So what is Z PLU mentions yeah.

A

So the CPL is the ZFS POSIX layer.

A

So it's this layer here, basically, the the virtual file system layer, talks to the CPL and the CPL is responsible for things like file ownership file, size permissions directories. You know symbolic links all those kind of like file type things, but it's not responsible for it. Basically, this interface allows it to make atomic transactions on objects. Objects are kind of like a file without any attributes, so.

A

This this layer is like it's complicated because there's a lot of you know, there's a lot of nitty-gritty details and implementing permissions, and things like that, but we've isolated that complexity from the complexity of actually like. How do you structure a file? How do you get a block pointer that points to other blocks, and how do you manipulate that treat of blocks that I showed you later so that tria blocks is all managed by the dmu, so so everything that ZFS send and receive is dealing with is like kind of interacting at this this layer.

A

It doesn't need to know about anything, that's happening above it. Thank.

D

A

Cool any other questions about this. Yes,.

E

So basically, all the GFS attributes on one or something will be sent on the transaction right, yeah attributes more or less yeah.

A

So like file attributes, so that includes things like I'm extended attributes, and you know owners and permissions, and things like that, the file mode, all that kind of stuff other questions- oh cool, so I mentioned right. So I mentioned. We issue lots of iOS at once in order to take advantage of the full iOS of all of the discs that are attached to the system. So the way that we're that we do, that is when you run ZFS, send it's actually creating an additional thread. That's the prefetch thread.

A

That's basically running out ahead of the main thread and issuing iOS prefetch iOS for all of that whole tree of blocks, the the key way that is able to stay ahead of the the main thread.

A

That's actually, the main thread is the one that's actually pushing the data out on standard output, and then you know over over the network connection is that the prefetch thread doesn't have to wait for the data blocks to be read in so it does have to wait for the indirect blocks to be read in so they knows what data blocks to prevent underneath it, but for the data blocks, it just issues the prefetch and then moves on.

A

So this allows it to keep ahead and the amount that it the amount ahead that it is, is set by this tunable ZFS PD bytes max the default is 50 megabytes. So this basically allows you to choose like how much memory do I want to use to to get as many iOS going in parallel as I can and the advantage of putting this Biggers. Is it's not just um getting.

A

You know a bunch of iOS you're just at the same time, but when, if you have more iOS queued, then we can issue to the disk, then we're able to sort them by offset.

A

So, if you're using spinning disks the skin, making this bigger, can have a big performance increase because we're able to sort all the iOS and then issue them, you know in offset order to the disk so that the disk doesn't have to seek the head as much and just to know that this this prefetching that I'm describing is, is totally separate from predictive prefetching, which tries to guess at which blocks you're going to need in the future based on the ones that you've already done so like.

A

Oh, you read block 1, 2, 3 4, you probably are gonna, read 5 6 and then 7, that's pretty clear. You fetch this is kind of. What's the opposite of predictive, like initiative, prefetch like we know, what's going to happen, we know that we need these blocks and we're just issuing the reads for them a little bit sooner. So there's there's no possibility of like wasted work with this type of route. Fetching in contrast to a predictive prefetch, um so I mentioned that ZFS send receive is unidirectional.

A

So that means you know, ZFS send it just emits the send stream. It isn't getting any data from the recipient system, so one of the consequences of this is that the you have to provide on the command line. All of the information that we need to send the right data to the target system.

F

A

This is generally the responsibility of like the system administrator or you know, maybe another layer. On top of this, you know like in our in our product. We have remember application. That's driven by you know a very complicated. You know: Java based application stack, that's actually orchestrating the connections and, of course, we are actually using SSH. We have like a private communication protocol, so you, you know I.

A

Imagine a bunch of most products have something similar to that, so that upper layer would be responsible for figuring out like what to send so, in other words, telling it, you know, what's the most recent common snapshot, that's what you're going to use as a from snap. What the sense dream over the years has gotten some new over the wire features that both systems need to be able to support in order to communicate so like turning on enabling large blocks and embedded blocks.

A

You need to know if those are supported on the receiving system and if so, then enable them in the ZFS send. But the the consequence of this is that it's insensitive to network latency. This means that you know you can use it for backups as well, because you don't your the send. The sending process isn't like talking to anyone. It's just kind of spitting it out there, so you can just spit it out and put it onto a tape or onto a hard drive that you're using for archival and then later on.

A

Take that file and feed it into the ZFS receive at a later date, and it also allows like flexible distribution models where you can send from one system to another to another to another.

A

Any questions about that. Yes,.

D

Is there any difference in doing a ZFS send and receive like in one command over SSH or in buffer or netcat and outputting it to a file? I mean, of course, there's a difference, because the methodology is different. What I mean is? Is there a difference like in the number of bytes that get output.

A

D

My question is like, if you're receiving into a system that deduct pool and you're receiving into a system that supports that you're I guess obviously only sending like without rehydration. You know the deduct differential, the deed of data stream, but what, if you output to a file instead of doing a send and receive in one in one go? Does it also is the byte stream that gets output, also like in a deduct forum, or it would that be different? Yeah.

A

That's a good question, so the data that's output from ZFS send is going to be the same, whether you put into a file or a network stream or whatever, because it doesn't, the FSM doesn't even know what you're doing with output. So like semantically, it's same but performance wise. You know obviously performance wise. It could be very different, specifically in terms of D doop. The it's going to be different, like you can. Actually, the the receiving system is what's really implementing the D Duke, because it's the D dupe is pool pool based race.

A

So it's like each pool has a giant hash table of all of the Chuck sums of all the blocks that are d duped within that pool. So when you receive its DDP against, what's already in that pool, and if you were so, you might think like. Oh, if I do, does the FF send and I put the output to a file? That's on that pool that I care about then. Maybe that's gonna do doop the same as if I did is, if SN and put the output to ZFS receive on that pool.

A

But that is not actually the case, because ZFS D deep works on a per block basis, so basically each block it checks and sees. Is there any other block in this pool with the same exact contents?

A

If you take the ZFS, send stream and put it into a file, then all of the like logical blocks within that send stream are not going to line up with the blocks in that file because there's like little headers in between each of them in the sense stream. So in general, like it's not going to be able to D, do the like. If you just take this ZFS and put it into a file on a pool, it's probably not going to be able to do doop that against other stuff in the pool. Let.

D

Me rephrase my question: let's say on Monday I, add ten video files of one gigabyte, they're, all identical so on my D do pool it's like. Only one gigabyte gets written, ok, I snapshot on Tuesday I, make a differential between Monday and Tuesday that differential, D duped. D is only one gig and if it was hydrated, it would be ten gigs now I do a ZFS send between Monday and.

E

Wednesday differential.

D

You say it: it's the whether it goes to the file or to SSH, because it doesn't know what it goes to now. Is it one gig that is going.

A

10 gigs, that is going so it depends when you run ZFS and there's ad dupe option. So you can use EFS, send capital D, and then it will d dupe the send stream, but the way that that works is kind of different than the like normal on disk D dupe. So in your particular. So the way that it works is it's D duping only within the send stream.

A

So if there is duplicated data within the sent within whatever send operation that you're doing, then it would be like condensed away by by the ZFS n capital D D dupe. So in your case, like the differences between Monday and Tuesday are ten copies of the same file. So it would detect that those are all the same and it would only send one copy of the file over the send stream and then, when you receive you could receive that either into a D duped pool.

A

In which case you would just take the one gigabyte or you could receive it into a non d, dude pool and you would rehydrate it as it's being received. The the downside of the way that this is done is like, because it's unidirectional right, we don't know what's on the other side. So when we do this the send it could be like if you had instead of your example. If you said on Monday, we we put one file in then on Tuesday.

A

We put another copy of the same file, then on Wednesday we put another copy of the same file. So by Friday you have, you know seven copies of that file, but each of them are in a different snapshot. Then, if you do, if you say like just send Monday to Tuesday, it would say great here's, the one terabyte file or one one gigabyte file. If you send just from Tuesday to Wednesday, it would say: oh, that here's this one one gigabyte file, because we don't know that the other side already has de doop so.

D

It does rehydrate, we.

A

D

Rehydrates, the byte stream and then it's I.

A

Mean it's not a direct replication.

D

Of the source to destination kind of knowing that both are D duped, so.

A

D

Sending what is what is new.

A

It's definitely not like a perfect, like ah let me figure out what you have and I'll only send you things that you don't already have, because that kind of the unidirectional 'ti kind of implies that you know something like that, certainly could be implemented. It would be like it would be very different right because you'd be having a conversation between the two machines. There'd be a lot more involved. I.

B

Think we have a hackathon project for tomorrow.

A

Congratulations yeah, so the ZFS send Capaldi like it does. You can think of it as like did you play like in some cases it's gonna be able to reduce it a lot if you're using dee doop. Then it's probably a good idea to use the FS n d, because it'll probably be able to reduce the send stream somewhat. But it's not it's not gonna be perfect. It's not gonna! This.

D

Is great because I didn't know of that option, and so we do a lot of put differentials and files that we did our sync in order to benefit from gzip 9 compression / slow links. Is there in ZFS send like there's a capital D for YouTube? Is there like a way to compress the stream between the send and receive so.

A

Another great hackathon project idea: this is actually something that we're hoping we're getting some interns this summer I'll fix some some college students and we're hoping that one of them will get one of them to implement this. This idea of compressed ZFS send and receive. This has recently been made much much easier to do by some work. That George did that he'll be talking about called compressed Ark, which basically means that we store the data compressed in memory for like a longer period of time.

A

The idea of the compressed Center received that we're talking about me would be that we would take the data as it is on disk. So if the data is compressed on disk already, we would just get that compressed data and then send that compressed data over the wire on ZFS end and then, when it's received, we take the compressed data and write that compressed data directly to disk. So if your data was already you know lz4 compressed, then he would go over the wire LZ for compressed with like zero additional CPU overhead.

A

If it was already gzip 9 on disk, then again, she's at nine. Over the wire.

D

If it was not compressed on this, because.

A

We have slow links, you.

G

Know because, okay.

D

You you wanted non-compressed on this because on your local Nass, you want it to be fast, but then you are replicating through.

A

D

Or two megabits, uplink and bandwidth is your: is your a short resource, and so you want gzip, nine or.

A

D

Compression you want just for that.

A

What I recommend there is to, rather than like putting it to a file and then juice at nine here and then our syncing? It would be to just say, like you know, when you do this DFS and pipe just use the FS n pipe gzip pipe SSH that basically solves the hole on the other side. You know SSH host your G unzip pipe ZFS. If.

D

Only we could resume them, but I know you're gonna talk about.

A

We're gonna get so that's you. Thank you.

D

Thank you very much.

G

With the dedupe sending option.

C

G

A nice feature, but last I checked on on Linux. It will lock your box.

A

G

They're, aware of it and then they're working on it, I think.

A

G

Think just on sending on.

H

Okay, a small one, ZFS scent is really performant, but it is very bursty. It.

A

H

And it can be bursty, so in general, especially if you have a higher bandwidth link, including M buffer, is sort of mandatory yeah.

A

And that's kind of one of the things that we are trying. I I haven't done a ton of work on this, but increasing this today.

E

I know can.

A

Have a similar effect and buffer there's also upcoming, yet another layer of like buffering that we're adding so I think that I need to figure out why this is not sufficient. This is not exactly equivalent to an buffer I know, but I don't know exactly why. But I think that this new thing that we're adding is basically exactly identical. Dan, buffer and I think that'll allow you to solve their problem without an external tool.

I

I did some experimenting a few years ago with this, and the burstiness was on the receiving side. Okay,.

F

I

Was like yes, it kind of did like a five megabyte, read every.

A

I

A

Yes, so on the sending side, all that that I said it's true- maybe maybe maybe the issue doesn't even really exist on this sitting inside I'm the receiving side. It definitely exists that is solved. I'll, be talking about that later.

A

B

More questions, yes,.

F

On the honors NFS sender, in a necro montolo way.

A

F

If I said sender, ID.

F

Sucks of the snapshot so yeah on the on the device for size, so ID must be on the pool also on the data set. So.

A

When you're doing is the FS and it's it, it uses the the ID of the particular snapshot to identify it. So when you're doing mainly that's used, when you do an incremental like when you do this incremental and you do the receive, we need to be really really sure that on the receiving system, the snapshot that they have is actually the same as this Monday one that we're sending from otherwise the receive is just going to end up giving you garbage.

A

So it doesn't actually matter the name of the file system or the name of the snapshot. There's internally like a gooood, a globally unique ID that we associate with the snapshot and when we sent the Monday one first when it was received here, it was it's a sign that gooood and then, when we do this, and it's gonna check that the Inc, the from snap gooood, that's in the sense stream matches. The good of the snapshot that you already have on the target system is that what you're asking about.

F

A

F

A

Pool doesn't matter so you could send it to the same pool. You could send it to a different pool that shouldn't matter. Yes,.

F

But if I, if I, send Monday shown pull on the another data set FS on the I.

F

Send Monday on the FS three: if.

A

It's a different file system, then it's a different stab shot so like the the full name of the snapshot is like pool / FS that Tuesday. If you have fool /fs at Tuesday, that snapshot has totally different contents and it's unrelated to fs1 at Tuesday it just that they happen to be.

F

Jason, that's the question we zoom:1 if I.

A

F

That I make a mistake: I try site on the it works, I'm.

A

Not sure I'm, not sure, I understand the question. Does anybody else have a oh.

A

Okay, so there must have been something mismatched about what you had on the source versus what you had on the target. Let's take this offline, and maybe we can try to diagnose your problem during the break or at the hackathon tomorrow.

A

Okay back to where I was okay, more questions or shall I move on. Okay, great lots of questions I enjoy it. So this is just a small example. There's a utility program called Z stream dump, which decodes the ZFS sent stream and I can print out in various levels of detail what the contents of it are.

A

So this is an example showing you know when you do a send. Basically there's going to be a begin record. It tells so this is getting to like the two good and from good, which I was just mentioning so, like the from good says we're sending from the snapshot with this identifier and it's going to check and make sure that's the same. If that doesn't match, then it's going to give you an error and then show you there's a bunch of other record types in this. Shows you how many of them there are I'll.

A

Give you an example of what they look like. So if you do is Z stream dump dash V it'll actually show you every single record, so you can see that most of the records are right records, but there's also these object records. So an object record says: there's this object. It has number seven it's this type. This is probably like a plain file as block size of 512. It has this big of abundance buffer and then it'll actually have the bonus.

A

The contents of the bonus buffer after that the bonus buffer has some ZPL specific metadata and then right records that kind of constitute the bulk of the data. The right record just says: I'm about to give you some data. You're gonna write that data to object 12 at this offset and I'm giving you 8 kilobytes. So you know, as you can see here like there's nothing about, like you know, file ownership or like directories is there's no difference between a directory in a file because they're all just objects.

A

Any questions about that cool. So now we'll get to some exciting new features so first to talk about some features that are a little bit older, better that are unique to open ZFS. So these are things that are like differentiate us from what's in the proprietary Oracle's DFS.

A

This has been in for quite a while there's send stream size estimation and progress monitoring. So this is useful if you're like I'm, doing an incremental send and it's been going all night and now it's the morning and it isn't done yet. Is it almost done or is it not almost done, because if it's not almost done, then I should probably kill this off, because people are about to come into work and it's gonna make us know or whatever.

A

So if you do ZFS Envy it'll tell you. The estimated size is two point: seven, eight gigabytes. If you're doing like a sin, capital R then it'll give you the estimate for each individual snapshot and then also the total and then it'll print out. Every I think it's every second. This show tells you like how this is how much I've already sent. This is what snapshot I'm working on right now, then you can. Do you like what we've done with our product? You can use the capital P parsable option.

A

It prints it kind of in a more computer, readable format, and then you can hook that up into you, like it, a nice GUI that shows you, your X % done.

D

A

Enables embedded blocks to be sent so embedded blocks is another relatively new feature of the on disk format in general, which allows for much better compression ratio of highly compressible data. So basically, if the whole block of data can be compressed down to fit in just the block pointer, which is about a hundred bytes, then it'll take that compressed data stuff in the block pointer itself and the block pointer doesn't actually point to anything. It just has the contents right embedded in it.

A

So this gives you both better compression ratio, as well as the big benefit is less IO so like in our use case del phix. We store a lot of databases, so in Oracle, especially most databases, but especially Oracle databases. When you create a new data file, it initializes the data file by writing a little header and footer on every single block, and this little header and footer compressed down really well, because the middle of the block is all zeroes. So basically we're able to.

G

A

It down into like less than a hundred bytes and then when they do this initialization operation, we don't have to do it. We get like a hundred times less number of I/os, because, instead of writing every datablock, we just write the indirect block. It has all the data embedded inside of it, and this is kind of like a quick and dirty benchmark that a lot of database admins like to do is like how fast is my storage.

A

Let me create a data files see how long it takes, and so you know this makes it go a lot faster.

D

A

So the with the option, it's kind of like compressed, sent compressed, send like you were asking about, but basically we were able to do this. The implementation of this is much easier than doing a full-blown compressed send receive. So this is kind of like a little bit of compressed send receive with these embedded blocks.

A

Other questions, I'm, gonna I, think I'm gonna go these a little bit quickly, so I can get some more interesting stuff, so we improve performance of ZFS, send of files with hole so like sparse objects like Zee walls and VM decays by avoiding sending a bunch of stuff and improving the improving it from like event a constant time. We also did this DFS bookmarks thing it lets you not have to keep around yesterday's snapshot on the sending system. Instead of keeping your on that snapshot on the sending system, you keep this bookmark. Instead, okay,.

I

Does that, basically, just still the GUI that you were talking.

A

Exactly so, it just stores that gooood and the creation time, because that's all that's actually used when you do the ZFS, send we just look at the from snaps creation time, and then we send its grid over.

D

Is it something that happens transparently internally or is it something you have to know.

A

D

Way that you, the.

A

Way that you're using it when you do the send you just specify a bookmark instead of a snapshot so use EFS, send I and then the name of the bookmark, but you would change your procedures so that you know, rather than the normal way of doing it is like I, send today's snapshot and then I wait a day and then I send from yesterday's snapshot to today's snapshot and then I'm like okay, great, yesterday's snapshot is done.

A

I delete, yesterday's snapshot, but so up today's snapshot and then I go again to the next day and then I say ok. So during that day I had yesterday, snapshot was you know it was still there. So I always have one snapshot on the sending system. That's like only there because of send and receive all the time, regardless of event actually running ass and at that time.

A

So you can imagine that, like if I keep doing this every day and then like this, the target system goes offline and then, like a month later, I'm like oh well, it's a month later and I'm still holding on to this snapshot from a month ago. That's actually taking up like gigabytes and gigabytes of space. This solves that problem, so the the new procedure would be like I start I send my snapshot, then I create a bookmark from the snapshot and then I delete the snapshot so now, I just have a bookmark and I.

A

Don't have any snapshot then to tomorrow I send from that bookmark today's snapshot and then again, I create a create a bookmark of today's snapshot and then delete today's snapshot. So generally you don't have any snapshots on your source system that are dedicated to this replication. You just have a bookmark and the bookmark doesn't take up really any space, but it still.

D

Has to keep the blocks that no.

A

It doesn't need to keep the box because when you do a send, we only look at the two snap, so in other words, we only need to look at the data. That's in the snapshot that you are sending not the data. That's in the snapshot that you're sending franck.

A

We're gonna eat into time. Let me see those sides.

A

F

A

Remember this diagram is showing you the blocks in the two snaps, so this this is a snapshot that we're sending, but we aren't reading any data from the snapshot that we are sending from so the incremental source. Like yesterday's snapshot, although we looked, we just said Oh yesterday snapshot is that time? Five? That's all that we needed to know so the bookmark just remembers. Yesterday's snapshot was time. Five do.

D

We need to manually delete the snapshots like we manually delete the bookmark like we manually, delete the snapshot.

A

D

A

You want it to be deleted. You have to delete it manually, but the space used by bookmarks is so tiny that it doesn't really matter and.

D

There's not if there's a thousand or a million snapshots, not gonna affect performance, bookmarks marks or whatever. Then no, it's.

A

Not gonna affect performance. Thank.

D

You very much yeah.

F

G

A

Yes, cool- this is all great stuff, so do I still have like 15 minutes, or are we as long as you like? Okay, cool, so I'm we're talking now about some upcoming features. So all of these are things that yeah. So all of these are things that we implemented at del phix they're open source. We published them on our github page a couple months ago: they're not upstream yeah, this so they're, not really in open ZFS.

A

We're working on this have a whole bunch of code reviews out to get this upstream, because there's a lot of interrelated code changes, so the first one is resumable send and receive so this has been a long time coming. I know: we've been talking about it for years and years.

A

So the problem that this is solving is that, when the receive fails, we have to restart the whole, send and receive process from the very beginning. So you know it could fail because of like your network dies or the sending system reboots of the receiving system reboots or you hit control-c by accident. I, don't know if I made it as that, but in any case the result is that all of the progress on that snapshot is lost so on the receiving system.

A

We just say: oh well, that received didn't work so I'm going to destroy all the stuff that I've already got and you just got to restart from the beat very beginning so I mean this is a real customer problem for us. This is why we implemented it. We have a customer that liked it took them ten days to do a send, send receive and they're mtbf of their network was about a week, so it took like three tot you know.

A

Instead of taking ten days it took like forty days because we had to keep restarting every time the network died. So the solution is conceptually pretty straightforward. We when you do they receive and the receive fails. We just keep the state that we already have, and then we remember what.

A

Basically, we remember how far we got remember that by recording the object and offset- and this works, because I don't know if you noticed in these previous slides, but the order of these records is is sorted by object and then by offset so it always goes forwards. So, therefore, we can just remember the last received object and offset, and that tells us exactly everything that we have received and everything that is yet to be received. So now I'll mented, the sender is able to resume by picking up from this object an offset.

A

Basically is it seeks directly to that object and offset it doesn't have to like go through and read all the previous stuff or anything, so it resumes basically in constant time, so the way that this works, the way that you use it send it is still unidirectional. So we still need to kind of manually input to ZFS, send all the parameters basically of what is on the receiving system.

A

The way to do this with with resuming is that on when the receive fails, it will still create the filesystem it'll create this new property called receive resume token, and this is basically like an opaque string with lots of you know random letters and numbers, but it essentially it encodes the object and offset that we need to restart from as well as some other information, and so then you know the either the cells main or the application. That's driving. This is gonna pass.

A

This get this token on the receiving system and then input it into ZFS send on the sending system.

A

So what this looks like in practice is you do the FSN you pipe it over the network or whatever to the ZFS receive you use this new receive dash S to indicate that you want to save this state if there's a failure and then, if, if the receive fails, you can use EFS to get to get this receive resume token, and then you paste that into ZFS and a dash T and then the token.

A

So the token tells it not just where to resume from, but it also incorporates information about which snapshot you want to send what the incremental source is. What other features are enabled like you, did the dash e to have embedded blocks, so this makes it really straightforward to do the resume.

A

The only other new interface is, if you do the receive s and we save the state, and then you realize, oh, like that, sending system like it just burst into flames and I'm never going to be able to resume it or whatever. So then you can abort the resumable sense streets receive state with this ZFS receive dash capital, a it discards the partially receive state under the hood. Basically, it's doing like I, don't know if you guys are aware this but like when you do receive it's actually receiving the data into this invisible clone.

A

So there's like you, do the receiving to pull /fs, it's actually receiving the pool, / FS / % receive, and basically this just destroys that you know pool /, FS, / % received and then there's like another special case for the special gazes, depending on whether you were doing it incremental or full and stuff like that, I mean and, of course, there's all equivalent API calls in live ZFS in the ZFS core to drive this from your application.

A

Any questions about this yes, Dan.

A

So, while he's getting the mic, this is just showing when you do send V it'll, also like decode token that that you provided. So this shows you like what all those things were, that it's towing my.

C

Question would be about the performance. How does performance changes when you just add the s have.

B

A few people on the live stream.

A

Yes, so the question was about: um when you do the receive s is performance impact it? The answer is no, so, basically for all this, the performance is just the same as if you're doing normally, except like when you do this n t it restarts from that point. So there's really no performance impact. The only the only impact would be kind of the obvious one of like. If you do the receive yes and then it dies, then we're saving all that state.

A

So it's using that space on disk so that you can resume at a later point in time and then obviously you use the receive capital a if you don't want it to use that space and you're not going to resume it but yeah. It's the receive, there's really like no changes to the receive. Yes, basically, the received address is just saying: this is a resume a bowl receive. In other words, you will be able to resume it if it fails, and so it's keeping track of that state.

A

The it's writing out like the objects and offset, but it's doing that like once every txg once every transaction group. So it's not like you know, every every record that we get. We have to wait for like an additional right. It's basically keeping track of like this transaction group. This is the most recent objection offset and then, when we happen to synch it out, it'll write it out. So there's like a window where, if you're doing your receive actually the well, there isn't a window.

A

It's just like the fact of the matter is when you are doing your receive if the receiving system like, if you pull the plug on the receiving system, then some of the data that was received won't have made it to diskette and the like the objection offs. The most recent objection offset will reflect that because it's updated, you know as part of the transaction group syncing when we, you know, write the new data to disk we're. Also writing. The new objection. Offset latest object an offset.

A

So performance is great right. This is just showing the token the data. That's in it. Oh one other thing I forgot to mention. It also includes the amount of data already received and that's used to update the estimate. So when it gives you the new estimate, it's saying like I'm doing the sin it's resuming and now the estimate is gonna be smaller because it knows like it computes. The estimated doing the whole thing is whatever 20 Meg's, but you've already received. You know nine Meg's of it. So there's 11, Meg's left.

A

The nice thing about that is like at least the way that we the way that it happened to get integrated in our product. They're, basically, no changes above no changes to the like progress monitoring stuff, because this just kind of worked.

A

So this is just an example showing you need easy stream dump V and then.

G

A

Your this is a resuming send stream. It's gonna tell it I'm resuming from this operation offset and then the first object in the first right record are going to be for that object and offset, and of course we have to include this information so that when you do the receive it knows that you're receiving a valid stream.

A

In other words, like you, didn't you didn't somehow concoct a token that says to start from a later point, then we actually already have so that you know we always want to make sure that there's no way to like misuse it and get in correct file system in the end.

A

Okay, you know the fast-forward a little bit through this, so that we don't take forever part of this work. We added a checksum to every record in this end stream, so that, if the receive dies in the middle, then we know that everything that we've gotten is the corrected data. So this is another additional layer of checks, something on top of, like whatever is already in your transport of maybe tcp or whatever. Tcp has like a really tiny. It's like 16-bit 16 bits or something checksum. This is a full ZFS. You know 128 128.

A

256 bit checksum yeah.

A

A

Okay, so there was a question: oh yes, you were mentioning that and you were kind of reinforcing that when you do a ZFS receive it can get like really bursty and M buffer on the receive side can really help this so that we, we also noticed this and we actually diagnose the problem, and it turns out this can happen really easily. If, yes, if, especially if you're doing an incremental receive and it's of record structured data- so it's you have a big file.

A

You've modified only some of the blocks within that file and you're receiving that. So this happens a lot if you have like a database or like a sieve all or a VMDK file. Something like that. The problem is that well, ok to the problem.the. When we process each record, we have to write that into the file and writing that requires that we read the indirect block, and this is kind of this- isn't necessarily a problem. It's just a fact like we have this tree of blocks back.

A

We have a tree of blocks if we want to write an update this block. At some point we have to read this block in so that we can modify it and point. The new pointer here in the old pointers so points there, but the problem is the way that we're doing it, which is that this read is happening. Synchronously, so what's happening in the ZFS receive process.

A

Is we get a record from the network and then we issued the read to disk for this indirect block now the problem we wait for the read of that indirect block to complete and then finally, we can continue on with the right, which basically means just pushing the data into the dmu, but this right there's no actual I/o associated with it, because it's a like asynchronous right and we're gonna stick out those changes at the end of the transaction group.

A

So we keep doing this, and if many of these read, if many of these rights have to wait for a read of the indirect block, then you end up not getting what I promised to you, which is you go at the full bandwidth of your storage? Instead, you'll be going at the I ops in of your storage and it'll just be doing one REIT it'll be like. Let me do one read it wait for that agreed to complete great now you can do one write.

A

Do another read great tyrita is done now you can do it one right so be very. You can get very, very slow solution. Is we added another thread to the receive process? So there's now two threads. The main thread is getting the data from the network and then in queueing the record on a queue in memory. So what we do is we get the data from. We get the record from the network. Now we issue the read IO for the indirect block, but we're not going to wait for that.

A

We're just going to issue it as a prefetch. We don't wait for that to read to complete and then we incur the record on to a queue with a bounded length. Now the worker is gonna DQ, the record pop it off of the queue and then a wait for the reader, the indirect blocks to complete and then copy the data into the DM you so because this queue allows us, like a certain amount of time between these two operations.

A

After kind of the first one waits, then enough records built up that by the time we get to the next one. It won't actually have to wait. So typically, this weight will be. You know no off, because we'll already have completed that readeth indirect block.

A

So we created like a little benchmark for this and actually went six times faster and on real-world customer database. It went twice as fast, so that's pretty huge performance increase. You increase any questions about this.

D

A

D

You mentioned that, with this improvement, no more need of an buffer, so your benchmark kind of confirms that this or M buffer is equivalent performance. Therefore, we can completely ditch him buffer, because.

J

D

Been performing awesomely and buffer in terms of performance, so so.

A

I mean I I I claim that that is probably the case. I would encourage you to test it and- and let me know how that turns out you we were not previously using M buffer. We did have some buffering in our like network stream stuff, but not that much so M, buffer and buffer, wouldn't an buffer, wouldn't really solve this problem. 100%.

A

You know it decouples. It would D couple the network from the disk performance, but their problem with the problem that I described here is that the disk performance is really bad. So this solution both decouples the network from the disk, as well as actually solving the disk performance problem, so it it made. You know, I'm, not gonna, claim to have kind of conclusively looked at every problem that I'm buffer, solves I. Think that this will address those issues, but I would I would gather data first before you know.

A

You know lauded orally, disregarding your I'm buffer and and.

D

When is this feature and resumable send receive it going to be available.

A

Like a couple months, I've been working on it I've up streamed a few. A few parts of this I think this. You can get it from our repo now and kind of like pull it into your own, but this will be going upstream, probably like I'm gonna, say in like two months: it'll be upstream and.

D

Then another couple months for you to going to distributions did.

A

Time it takes it'll go it'll, go almost immediately into FreeBSD current, which is like kind of the leading edge and then to go into FreeBSD distro it like a release. Version of FreeBSD is several like, depending on, if you're waiting for a full release or patch release, like it's probably gonna, be several months like six months or something from there.

A

um Yeah Omni also has like a Omni OS bloody release, which is like very, very current I think they usually update every two weeks, um and then they have a I, don't know how often do they do major releases? Ok, so then they do every six months like supported release. Thank you.

J

If I have a very fragmented to see if I send this object based, so is it the best choice to different man to poo or am I better off using rsync? If I can afford a longer downtime.

A

So if you're you want to defragment the pool by essentially rewriting all the data well ZFS semi receive is going to be a much better choice because it preserves your snapshots right. So if, if you are using snapshots or clones, then the FSN I receive is really the only way to replicate that in the new pool or like, if you're, just doing it to the same pool. If you don't care about that, for some reason, you're not using snapshots then send and receive vs. are sync. You should be essentially the same.

A

At least for like the initial you know, the first full send or full arcing is gonna, be pretty much the same if you're needing to do like incrementals. After that, then obviously senator sieve is going to be faster than the are sync. But if you're just saying this to take down time and do the whole hour sync, then it would be the same to use, send to receive or arcing and and they both kind of work.

A

Like you know, they're gonna take each file, read each file write on each file, so the files will be like as continuous as possible in either case.

D

Preserving the snapshots might be necessary in terms of preserving the featured a snapshot spring, but by preserving the snapshot. Sometimes we also preserve the fragmentation, because, if I have a three hundred Meg's Photoshop file and it's been changed every day for 40 days and I have 40 snapshots now, when I want to read that Photoshop file I have to read parts that are scattered all over the V, dev or pool, because some of that file is in day one. Some of that file that's been modified in day.

D

Two has to be read from the place in the disk where it is from snapshot day, two and day three and four and five. So of course, if we want to preserve defeat the feature of the snapshots, we have to do that, but compared to just copying the data without preserving the snapshots, we would end up with a 300 Meg file that could be a continuous right, yeah.

H

G

Contiguous and.

D

Preserving the snapshots would actually preserve the fragmentation, yeah.

A

He's dead, correct, I would say, like you know, I would say that the initial problem of that fragmentation occurring to begin with is going to happen, regardless of, if you have snapshots or not like. If you have a file, that's updated block wise. You know like a database file. I, don't know about Photoshop, but.

D

It because of some right, yeah.

A

Because of copy-on-write, like part of the part, the file is written a long time ago and part of it was written more recently, so they were allocated at different times and we're in two different places on disk. Read the trade-off that you're, making there of a copy-on-write file system versus a right and place file system is. The writing is generally faster because it can be to wherever there's a big chunk of free space, but the reading can be more scattershot, but.

D

A ZFS send and receive without preserving snapshots just of the latest date of the pool would then reconstruct the files in the contiguous fashion.

A

D

So without fragmentation, yeah.

A

Just like our sink a little reconstruct it like as continuously as possible, so if you're going into an empty pool, then it's going to be super contiguous. So.

D

Could we not preserve the snapshots but preserve the bookmarks.

A

The bookmarks we didn't really know help you. There.

C

A

Thing that you're talking about is like am I sending a full send of the latest version? Ok versus am I sending the old full sent of the oldest version and then incremental from there to the next that incremental from there to the next very.

D

True defragmentation, we kind of need to lose the snapshots we do. Is they the first 10 MCS of the late of the.

G

D

State of the data set yeah.

A

D

Then all files are written continuously if I have a million three hundred Meg's file, so they're all written, 300, Meg's and then the next and then the next yeah.

A

That's right, okay, so this is one of the reasons that, like we, because you know our product at del phix depends really heavily on ZFS snapshots and clones there. You know there is really no or like there's no really great way to lay out that data so that it is going to perform great, regardless of which snapshot or which clone you're accessing.

A

So that's why we've really put more of our effort into getting good pull good performance, even when you have a fragmented pool and in reducing fragmentation overall, rather than like optimizing after the fact or like you know, defrag or things like that, so George is gonna talk about some of the work that we've done in that space, I. Think after the break. Thank you.

H

This one work and it doesn't apply to flash. It only applies to disk yeah.

A

Obviously, flash random readwrite is is great, so not a big deal cool, so this is kind of all the prepared slides that I had I'm happy to answer. Question continue answering questions for as long as you guys can stand about, send or receive, or open, ZFS or kind of any other ZFS features. If there are any.

B

A

Yeah: okay, great: let's go get some snacks and yeah.

B

Thanks for have.

A

To the break Georgia pact.

B

So I think it's pretty good time. Everyone gets caffeinated up and has some coffee a quick break and then yeah we'll be back for George. So if and I know us on the livestream, what also wants to ask questions feel free to use Twitter with a hashtag open ZFS. You can send those questions and I'll read them out for Matt or George or anyone else.