OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Improving “zfs diff” performance with reverse-name lookup by Sanjeev Bagewadi & David Chen

Description

From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/1t9O_KXa5aXxUwfXG6DvIMIetCOHzdH0Q/view?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

Greater than everyone, my name is david and my co-worker sanji venai today is going to talk about the festive optimization and also some problems that we fixed um yeah. So uh so what does diff is for so z festive. So, for example, if you have uh two snapshots, you can do a zip step on those two snapshots and it will show like all the changes uh made between those two snapshots.

A

So, for example, if you look at the below there's a sample output uh symbol over here um so yeah, it shows that some files are added. Some files are modified, some files are like remove or the rename yeah. So uh we use cfsdf.

A

To allow the doctor body software to do in backup, so basically we send this information to instead of third-party software, and it will do the like the copying the files modified files for backup so uh yeah, so it works pretty well uh for for small set of changes, but we found out that uh there's some problem uh with. If you have a large number of files uh of changes between snapshot, the z festive will take a very long time. It's very slow.

A

For example, uh we have, we have a cases that we have like 1 million files, 1 million changed files and territory, and it takes like about about an hour, and so we try to dig dig the where the perfect bottleneck is so we found out that uh it is mainly in the the known the dino to pad translation.

A

That's where that most times are take taken. uh So another issue where we found out is is: when you have power link changes mixed in it will actually break zip as break the div, for example, uh if.

A

For example, if you you have a file called the 5.6 and you uh you create a hard link in the another directory, and then you remove the new link you just created, and then you do a cfs div. You will show here that the file.6 is removed, but you you don't actually remove that file, so the file is still there, but the zip is div showed that uh the file is removed so that that's not the correct, uh not not really a correct result that we're expecting uh so yeah.

A

So this issue has been reported on uh on github before some people may see like unable to determine path or stats for object, 107 some kind, something some message like this yeah, so uh yeah. This is something you might also see before uh next slide. Please.

A

uh Yeah, so uh so, how does cfs div uh do the path uh translation?

A

um So uh so, because d note itself doesn't store the file name, so it only stored the the parent id uh so how it find out the patterning is it go, goes to the parent directory and it does a linear search uh inside the parent directory and in trying to find a matching uh object id.

A

So, for example, if you look at the graph below so we have a file called bob and it's objective is 108 and parent is 99, so it go. Go visit the parent directory 99 and is do a linear search in the hash table.

A

So it find out that okay, here there's an entry with the id 118 and the name of the entry is bob. So here is it figure out figure out that the uh the fountain is bought yeah, so this operation, it needs to repeat all the process, all the way to the root territory. So you can figure out like the file name, the the directory name from the parent and then directory name of the grand grandparent and so so forth.

A

Yeah um next time, please yeah um so yeah. So because, because of the linear search, so each pad components will cause uh the big one. So a few times like how many layers of your file is so it's it's uh is it's going to add up a lot and also, uh if like, if you have a lot of files, are inside the comment directory.

A

You are also like uh repeatedly doing the translation for those uh commentary as well, so uh so, basically you're doing the same. You're you're doing the same thing like multiple times. So if we can catch the result for the common tertiaries, so it will also speed things up significantly.

A

um Also, the linear search, because uh when you do linear search through a large territory, you also like uh caching a lot of uh a lot of caches in the arc. So.

B

A

You're like causing uh unwanted, churns uh yeah, so so, basically yeah we with a large large number of changed files. This problem will will significantly increase the the total runtime of the zfs diff next site. Please.

A

Yeah, the second problem is the harling problem. So um so because we only store like one parent uh id inside the the dino, so we will create a hard link. um So what the current call does is that, uh for example, if you look at the graph, so we have the file, that's uh in thirteen id5.

A

So if you do a hard link to uh direct id six, so now you have two directory points in the same file. What it does is that zifa's current code does is that it always changed the parent id store in arduino to the new new parent.

A

The new hard link you created so now when you remove the remove the hard link. uh Six so you'll find out that, uh because, because the the parent id store on the the dna is still six, so uh he basically couldn't figure out the real parents now.

A

uh So so that's why it it broke the the path, translation yeah. So that's the the issue with with hard links next time. Please.

A

Yeah so uh so the rest uh I'll hand over to sanji.

B

Thanks david, um so, like david um explained on how we have this issue of sequential search, um I think one of the key things that I would like to add to what david said is the zap entries themselves are indexed uh using the name of the file or a directory and the so that's your key, and the name is the key, whereas the value is the inode number and hence you cannot, although it's indexed search, uh but it's indexed for the name and not for the for the inode or the dna that we are looking for, so that that's where the reverse name, lookup, sort of becomes a problem because you're doing a sequential search.

B

So the way we try to address this is when, when you add an entry to the zap you uh we actually compute the hash. So we know uh the hash value uh when when we are adding it, so what we thought was uh we picked up that hash and uh added it to the d node.

B

The hash itself is 64 bit, so we could just easily um add it to uh to the d node and we felt that the system attributes themselves actually lend pretty neatly to this usage. So we added it added a new system attribute called the.

B

Hash value in the d node now this brings in some challenges. One of the things is in rename.

B

You would need to update the link name hash because the hash value itself could change. In addition, the parent directory also could change, but that's that's all right. I mean it's, but um the cost of that is not much because you would anyway need to change the m time and see times off the note so that that's that's all right.

B

So that was one change that we did so that sort of addressed the problem of sequential search, and this now reduces to a constant time search because once you have the hash value, you could directly do a lookup in the zap. You have the parent id and you have the hash. So it's it's a quick. It's a constant time cost to look up into the zap, so that was one solution.

B

Second, uh so here is a sample that we have uh from the zdb output. We have the object, 128, whose parent is 2 and we do have the link name hash for it stored. So what you could now technically do is uh in this app uh d, note 2. You could quickly look for for this hash and the code itself lends it pretty neatly.

B

You could pass in a hint, so we added that part and we can now do a constant time lookup for the hard links slightly a longer solution, because what we technically need is a way to store multiple pairs of parent d, node and link name hash. So storing this is not depending on how many hard links or hard links are created. You need a large you'd need a larger space. So initially we explored option of storing it in the d node itself, but the d node.

B

In our case we are using the standard d node, which is the 5 12 byte, and there's not enough space there to store. So what we decided to do was add a new um zap zap object, and this is allocated only if the. If the file is has is is hard linked as an if, if its link count is greater than one, so only in such cases you would have a zapd node allocated and in this app in that zap, what you do is keep a pair of these parent d, node and link name hash.

B

So it's essentially a list of those and we'll quickly see how that looks like and once you do this now it's easy because um you you have the entire list of parent directories that are pointing to this file and the corresponding link hash value.

B

So you, you can easily list out the various files that are on various directory entries that are pointing to this particular d, node so, um and the zfs diff itself can right now it lists out the first entry that's found, but um you can play around with it, so we actually added a utility which can list the various um entries so I'll we'll get to that in a bit. So if you see here, we have an example where um object. 5 is actually a hard link and it has multiple links on it.

B

In this case, there are two links that are pointing to it, so um it ended it has it has a a parent id which is 256, that's that's pointing to it, and um we have a links app uh 384 and in the links app there's the second entry, which is 250 259, which is pointing to it and the the other output that you see here, which is the linx app output.

B

It's listing both the one which is present in the d node itself, that's 256 and the second one which is 259., so you can technically list all the um hard links that are uh that are referring to this d note. So um those are the two changes that we did with respect to the on disk part. uh Moving on. We did some additional optimizations in there, um which is um like we like earlier mentioned.

B

um The top level directories which are common as you grow, closer to the root of the data set and those would be dereferenced a large number of times, depending on how many entries have changed. For example, if you look at the below change here, if there were 200 changed files, um we would be dereferencing the directory a b multiple times in there or, for example, c itself would be dereferenced about 100 times.

B

So the call the benefit of caching c is high and the benefit of caching b is even higher or for that matter, benefit of caching. A is is really high, because it's at the top of the uh top of the tree there, so uh the benefits actually go up. So what we did was we built a cache in the user land. Now, um just a note to remember here, we are not using the standard cfs diff.

B

Instead, we hooked into lib cfs and where we are using this, but the same can be applied to standard lib dfs as well. We could probably build a cache there, so um the other changes that we did um is in order to utilize this we modified the um iota that does the denote to path uh translation. uh So we we could we pass a flag in saying: don't translate the entire path, just just fetch us the name and that's that's sufficient.

B

So once once you just fetch the name and uh by the way, because we of the earlier change of links, app uh link, name hash, the fetch, the fetch is quick, it's constant time and once you fetch it and if you know it's a directory, you just stash it into into a cache files. You don't need to, because files would just be looked up once you don't need them, but directories you just stash them into a cache and the next time around.

B

If you want to walk up the path, you could just refer it from the cache and use it. So that's the additional change. So what did this fetch us, um or rather before we get to that? uh The summary of changes um is in here, but what did it really fetches?

B

So um initial implementation was was in python, which was fairly simple to get it going and it reduced the time from 60 minutes to about, I think about 15 minutes or 10 minutes, and then we started hitting python issues. So we we stripped that away and then we switched to c plus- and it has come down to about 15 to 20 seconds.

B

Even if we say it's about 30 seconds from 40 minutes to 30 seconds is a huge reduction, and this is right now shipped, it's part of our backup apis, that we provide to haiku net backup and other vendors, and uh it's running on customer machines right now so fairly tested and running. So it's sort of tested in there.

B

um What what effect does it take for us to uh move it into the community? So the summary of changes that we have done um the- and there are two system attributes that we have added- that I think- is easily portable to community code.

B

There's an extra d node that we consume, but this is again only if it's a hard linked file. um So that's that's the additional cost that we pay. There um then there's some change in in the zil, where we had to handle the additional properties that we are passing in for.

B

For for the hard links and some changes in the ncfs iot, so in summary, we should be able to bring it to community with minimum amount of changes. One part which which I guess needs a little more.

B

Effort is to see how we could or how we could adopt this for the existing file systems. We should be able to bring in the link name hash.

B

We could just populate it on the fly as we proceed, but for the hard links we don't know if that's feasible, so that part might be tricky, but the rest of it should work.

B

So that's a summary of changes. um I think we can open up for questions.