OpenZFS 2021 OpenZFS Developer Summit, 17 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Improving ZFS send/recv by Jitendra Patidar

Description

From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1DHXaBQcw3MmeZzg-Y5FEgStGEFHi4IfwN5VfgzNLKPA
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

A

Hello, everyone, my name, is.

A

I have long experience working on file systems.

A

My learning on zfs started around two and a half years back when I joined nutanix files. Team nutanix files is a scale-out file. Server solution built on top of nutanix core sci platform on top of uh nutanix file servers, uh different type of file shares can be destroyed, uh can be deployed and uh for with respect to different use cases and workloads.

A

B

File sayer level replication.

A

Solutions this product has and for that product, gfs and receive, is used. So today I am going to talk about couple of two optimizations. We done for share level replication solution, so one of the optimization uh which we done around zfsn block traversal and the second optimization is uh for share level replications uh on the zfs receive site. So I'm going to cover these two optimizations in today's talk.

A

Yeah, before jump over to optimization just brief of cfs and receive so gfs send is a replication tool. It traverses through the block tree for even snapshot. It does searches for all the blocks for full, send and change lock for incremental, send and for those blocks. Then it dumps records on a send screen or the wire for receiver side to process them and replay on the target. So zfs vc basically receives those records from stream and processes them and dumps applies them on the target.

A

So that's a brief now I'll talk about first optimization, so cfs sand. uh Basically, as I said, traverses block tree from the root of given snapshot, it traverses through corresponding objects and for each of the objects. Then it traverses down to the indirect blocks, non m0 prongs and then finally reaches to the l0 blocks for l0 blocks. Then it prepares corresponding drr records and dumps on the sand stream. uh Okay, so while traversing indirect blocks, uh basically zfs uh traversal does prefetching and it's it's basically to make traversal faster.

A

So, while traversing indirect of 128k size, it points to a max 102 for blocks underneath as each of the block pointer it has of 128 byte. So the current reversal code does our prefetch of all these 1 0 to 4 blocks. So this is too aggressive prefetch.

A

So that's a problem, because uh if you are doing two aggressive replace, then it could have basically performance impact on other active workloads running on your system and also it's possible that if you are doing very early and aggressive, replace the blocks which you prepared for the benefit of saint reversal, those itself get evicted and so the it's not much beneficial for traversal itself. Also, so.

A

The aggressive prefetch is not really helpful. We so, though, now I talk about the solution, so the optimization is, is in align to the controlling, basically, the this aggressive refreshing. We have in the traversal, so while traversing indirect- and we know uh it could have max one zero. Two four blocks pointed underneath so in place of doing bulk prefetch of all those one. Zero two four blocks do uh control prefetch in uh different slot so like uh when you are traversing uh first going to trevor's.

A

First block: do prefetch of next 32 blocks, uh starting from second in the second block, and then, when you, when you go midway traversing this this 32 block like around 17 blocks, then you trigger another prefetch of next 32 blocks and so on, so that uh you are doing a control prefetch and the prefetch which we are being brought primarily with the benefit with the purpose of making traversal passes still remains intact and it still benefits and other side.

A

The other workloads which are running parallel system has minimum impact, so uh so that that's uh optimization, I have one to enable defined uh like zfs drivers. Indirect refresh limit is by default uh defined to 32, and it can be configured with respect to any workloads have need of changing it to lower or higher value yeah. So I already have this these optimizations post to upstream, so uh this.

A

This is part of the upstream master thanks thanks to the reviewers uh brian and amateur yeah, sorry of my name wrongly so, I'm new to the gfs world yeah, but thanks a lot for doing review- and this is part about stream code.

A

Let me go back to the.

A

Yeah so- and it is also part of zfs 2.1 release.

A

Yeah, so that that was the first optimization now I'm going to cover the second optimization uh before uh detailing on second optimization, just brief on the distributed share. So basically on nutanix file servers we have we support different type of file share, so one of the type of files here is distributed. Share, so distributed share is consists, basically consists of multiple z, pools and data sets, and these simple data sets scattered across multiple file server vms.

A

So this is primarily a build for homes, home directory use case and it's later expanded for other different industry enterprise workloads. So, as you can see in the diagram uh the as you can see in the diagram, uh uh the different users home directories are just getting distributed across different file servers and they are baked by.

A

Data sets, so that's a brief on distributed share, so I am going to talk about uh second optimization, which is basically replicating these distributed shares from source to target, so uh as as, as I has explained, uh distributed share basically consists of. Multiple data sets simple data set which are scattered across different nodes, so like in this example, I have distributed share with.

A

With the three data set, data set, one data set tool and data set three, and each of these have so basically, while replicating this. uh This share and corresponding data sets it's possible that the change set to be replicated for this data set is not same. It.

C

A

So like in this example, you have 10 mb for data set 1 and then 100 mb for data set 2 and then around 1 gb for data sets 3. So now, based on the replication throughput, which you get on the wire, for example, uh in this. In this example, it is I'm saying like we have like 10 mbps throughput. So in this case it would. What would happen is replication of this data set would complete in different timelines so like for first data set, it could complete around one.

A

Second, then, for second data say we could around 10 complete on around 10 seconds and then last one would complete around 100 seconds. So the point is basically, the replication of this data set would complete in different timelines and uh if you are talking about share level consistency, because this, thus this distributed share on the source side is basically made out of these three data sets so on the target.

A

If you see the consist view of this share on the target site, it's basically uh all the that particular snapshot, which you are replicating in current timelines. The that point of image on the target side, so basically, if these are completing in different timelines, though target side, we see a uh not a complete view of that particular snapshot. So basically it's temporarily uh inconsistence, so com. This level consistency is temporarily. Inconse is compromised basically for this use case, so now uh yeah. So now I'm going to talk about this optimization.

A

Before I jump into details of optimization, I will just cover uh a brief overview of gfs receive, which would help us understanding the optimization so gfs receive uh uh basically in the beginning begin part. Is it creates a temporary clone from the existing data set or the best snapshot for incremental and from existing that dataset for the full receive uh if it could have it creates new data set as well. If, if.

D

The dataset doesn't exist so.

A

And then it basically receives on the temp clone or newly created data set. So that's a the receiving part is basically reading from the stream and then processing those records and applying on the target. Once compressive processing is done where all your changes change set is available on the target in the temp clone or newly created data set, then the last part at the end part of receive it just switches that temporary clone.

C

With a live data set.

A

And makes that received snapshot, changes available on the line or in case of newly created it marks that data set consistent and make it available. So the point is as soon as you receive, uh with respect to uh uh receiving uh and uh applying on the live. So as soon as you receive completes, your changes gets available on the line with a receiver.

C

A

Now I would jump over to the optimization and the solution so with respect to the problem, as I described, for distributed, share, use case where, which is consisting consists of multiple simple data sets and their replications completing in different timeline.

A

So the primarily we wanted to have the receive of these different data sets corresponding to share complete in same timeline, and we wanted to have a of corresponding content chain set available on the live in a controlled manner. So so uh as a solution, what I have done is, as explained the receive, has three parts begin then receive stream processing and then the end part, so, basically, after stream receive completes and all change set available on the target. Just the end part is left to do in that stage.

A

I am basically the in the solution I am proposing to break the receive and breaking the receive is built on top of existing feature like regime receiverism token we have so we are using that functionality and building on top of that, so basically generate a token from that stage, where we are just going to end the receipt. So the token is generated such that it has, along with existing content. It has additional fields to indicate that this is a activate token. Basically, all the contents has received on the target.

A

Just the activation part is left to do so. You can just whenever needed, to activate this snapshot, which you have received target. You can do a later point of time so and then few other additional fields in the token, because we are going to uh activate this to use this token for activation directly on the target, so you could need certain information which is coming from the source, so we keep it those flags and necessary info in the token itself, to keep it handy. While we are going to activate this snapshot later point of time.

A

So with respect to distributed share use case, you break basically all the receive for all the data sets corresponding data sets and once all those data sets replication completes then later point of time. When we want uh we can, we could activate this.

A

It's also possible with this infrastructure, and this optimization built in like if you are receiving a large number of data sets corresponding to share and if any of the data sets completes in different timelines, you could delay activations to make sure the consistent view on the target and also you could decide on if, for any reasons, if transfer of any data set fails or any problem happens and you, if you decide on not really activating the content received, you have control of that as well, because you could avoid the the data sets for which this you completed, but it's not activated so you have.

A

This gives basically control control way of activation yeah. So, with respect to implementation, I have two cli defined. uh One is in align to the breaking the receive at the end part. So hyphen p is option here, gfs receive sign, it's it's used along with hyphen ace, which is uh basically receive resume token functionality. So this optimization is built on top of that. So when you hyphen p, along with hyphen s, that is you break at the end. So workflow is like the dmv received begin down.

A

You receive stream then, and then, at the end you will see end the receive breaks without activating without switching the changes with the light so and it generates the token which is I called it- resume receive to receive activate token, because this is used to activate the uh resume and activate the snapshot which we have received on the temporary clone so and then.

A

Finally, once you have those activate tokens available, you could use those to not token, but the target itself you can give with hyphen a option and internally uh zfs would do this. Optimization would internally face the token from the target and then prepares begin record and goes through the receive begin and just receive end workflow, which is which was left over to do uh with uh when we given hyphen p option.

A

So uh in our use case, as I said, we have multiple uh data sets underneath the share, so those and those are scattered across different z pools. So we have to. We can't do them activate them within same transaction group because they are scattered across the sea pools. So we needed a controlled way of activation on top of this infrastructure. So we have infrastructure layer which does control activations and, if necessary, also doesn't basically.

A

If any problem happens and don't want to activate it, just efforts, the remaining data sets and so on. So I also have this these changes available in my upstream zfs repo.

A

So this this builds and functionally works as well, but I still have to clean up and do more testing to make it available. uh Master opencvs master, uh pull, request, yeah and uh yeah. So this this this, this is uh basically help doing a controlled activation of further our use case of distributed share. I could be helpful in generic use cases where the shares are created with multiple set of z tools and data sets.

A

Yeah, so this is a diagram with respect to the problem. So now we can.

C

We basically receive.

A

On temporary clone and once the receive for all, corresponding data sets complete and then, if we have control- and we know that we- we can activate this and make it available on the live. So we have control to activate this and get it available on the live. So that gives a consistent view for the share on the target side for the end, consumers, and- and this also gives control to drop.

A

If anything happens, on the wire and if you uh not able to receive any of the data sets out of all the underneath out of the shares, then you can drop the remaining ones and make keep it consistent on the target side.

A

Yeah, so that covers both of the optimizations which we have done around send receive. uh I can take any.

B

D

Paul did you want to ask a question.

B

Yeah, um so I thought that the first, the first optimization, was definitely useful, um especially on systems with lower memory count having access to sort of the capacity to limit the depth of that prefetch is a really good idea.

B

um I had some questions about the second, uh the second portion of the talk, um so the as I understand it, the the different data sets that are being used as part of your sort of notion of a share are potentially striped across different pools, which is why you can't have them all as like one data set and then just send them as a single unit um yeah, and so my question is: why add this functionality to zfs send, rather than doing something at the orchestration layer where you like, receive the data sets as clones and then swap them into the active position once all of your data sets are ready, um it should be, I think, the same level of adamicity either way you need to sync out a transaction group across multiple pools to get them all to be exposed.

B

At the same time,.

A

Yeah, so uh basically you could have the workflow as you're saying driven outside, but while we are basically receiving, we have the mappings created from source to target and we have already existing uh source uh shares and then also deployed so when it, when we are doing full or then later on incrementals, uh I built on top of the existing uh receive framework which we have in the kernel, so it actually has the clone infrastructure temporary clone infrastructure only so I'm just building on top of the existing infrastructure.

A

But similarly you could do uh cloning outside then receive that and then switch over to that one. But then uh you you, uh you you basically, uh I I'm not sure it could become complicated with respect to the existing.

C

Just to answer uh paul's question.

A

C

That's that that's possible, uh but we had a couple of other underpinnings there. For example, the fsid is something that we construct internally. So there's there's a lot of other things. For example, the data set names that we pick are based on uh on the uh on the on the share, uuid and stuff, like that, so a lot of other things had to be reworked.

C

If, if we went down the clone and and promote sort of a thing, so it yeah it is possible, but uh the amount of work that we needed was much much larger uh compared to what we have here.

B

Yeah, I feel like there should be a way to do it where you don't have to have the new names be persistent, but it's possible that, in your design, even having them essentially would be too much work. And I understand that.

C

Yeah yeah, I mean it's definitely possible because I I see what you're saying that you receive it in a clone and then you just flip it at that point. Yeah it's it's possible, but um yeah. This one gets a little tricky with with with the way we are handling the share you ids and and the.

D

Yeah, I think this file, it's an interesting idea, paul. I don't think that you can. I don't think that it's possible with the functionality in zfs today because, like you, could do the receive as clone like you said, but then um you know if you have an existing share on on the original one of the original name, there isn't a primitive that would let you like.

B

Do the there's? No rename you don't you can't do the like. The percent receive internal swap as an external consumer, exactly.

D

Yeah yeah, but I mean it might be that that would I mean maybe it would be better to have that be a first class thing where it's like. Oh I have. I have a file system, it has a clone. Neither of them have any snapshots just like swap the contents of them, um which is exactly what happens. You know at the when the receive completes.

D

um I think that the way that jatendra's done it is a little bit more like it probably provides like a a more.

B

Complicated slightly nicer abstraction, probably.

D

Yeah and the more the general thing like it might have so many restrictions that it ends up feeling kind of forced and you have all these error modes right where it's like. Oh, like you tried to do the promote you tried to do this swap, but there was a snapshot versus like with the receive it's like.

D

Well, you already have to decide, you know, are you blowing away snapshots after it or not, um and then you either get an error or not, um but if there are, if there are other use cases, I think that would be really interesting.

A

Doing it inside, it also gives control uh control of not allowing snapshots in between the receives. So you can, because if you are one snapshot you are receiving and if you are creating another one, then you have to decide on which one to keep keep on top of base. If you are so this doing internally, it also gives control of like not allowing new snapshots on target. While you are receiving.