OpenZFS Leadership, 21 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: August 2023 OpenZFS Leadership Meeting

Description

Agenda: OpenZFS Conference; ZED; RAIDZ Expansion; Fast Dedup; etc

full notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

All right, I guess it's time we can get started, uh because the first reminder is that the open ZFS developer Summit is coming up soon. That'll be October, 16th and 17th in San Francisco. um The call for papers is open. Now, Matt would like everybody to get their abstracts into him uh before September 5th, so that we can build the schedule for the conference.

A

Okay, um right now the agenda looks like it's all my stuff, but I hope other people have stuff to add.

A

um But the first item that we have is some work we're doing on Zed for Linux uh specifically to extend the work we did a couple of months ago, where you can using per Vita properties, configure Zed. uh As far as the number of I o or checksum errors uh within a set amount of time uh triggers the auto spare of the device we're now extending that so that you can do the same thing for slow iOS.

A

So if a particular disk has too many slow, I O's in the set amount of time configured with the the Vita property, then ZFS will automatically replace it with a spare I think we've all run into the situation before where one disk was dying, but not quite dead and was dragging the performance of the entire pool Down based on it. So if Zed can detect that and automatically swap in a spare, if you have one, then that will get the performance of the pool back uh easily Alan.

B

Will that work with the d-range.

A

Spares um that's a good question.

B

A

I forget the logic actually but yeah. That's something we hadn't thought to look at. That's a good point that I will have to have Rob dig into.

A

Anybody else see any gotchas with that uh other than d-ray.

C

But I'm thinking uh there would be nice to have opposite protection uh if, for example, some indeed single disk but affects uh whole size Fabrics that happens quite regularly and makes problems for many disks same time. It would be good to not activate in that case because attempt to rebuild. While we are not saying in general, it would just make things worse, I'm not exactly sure how to formalize it, but would be definitely good. So we would not try to go beyond a pull redundancy and Beyond cool stability.

A

Yeah, if the number of deaths that that exceed the threshold is more than the number of spares, then just do nothing kind of thing. Right.

C

Well, generally, more than one or something like I, don't know what the chance was. Not disk failed at the time. So if we have problems with more than one disks- and maybe we shouldn't make a.

A

Repeatable moments exactly I think that's that makes sense.

B

A

uh And then, in a related one, about Zed we're also looking at improving the logic around if you pull a disc out of the enclosure and then put a replacement disk in the same slot um on Linux, depending how you had the device named, uh you know if it was named by the the wwn or anything like that, then Zed says you know this replacement disk isn't the same and doesn't do anything uh and so we're looking at setting it to look at the enclosure path.

A

The ink Path property, if it's not null um so that you know if it's the new disk is in the same you know, enclosure in the same slot uh then consider that a replacement for the disk that's missing and start the replace automatically rather than uh having to be triggered by a person.

A

So right now it only looks at the actual device path and we already sent it to you know if that doesn't match, but the enclosure path does that? Maybe that's the right answer.

A

Does exactly that.

C

So it would be good to have logic, be the same I think uh on the other side, I'm just slightly worried of uh extra replacement. We never used that logic, and it was just normally blocked by the fact we are using. We are partitioning disks and freebies GFS, never partitions disks, so it effectively protected us from an unwanted activity. I just thinking uh how would it be controlled on Linux I, don't want things to be replaced arbitrarily.

A

uh Well, you might have to opt into the said Rule and like activate that specific one uh and in general you know, there's a bunch of safeguards like is the disk that's coming in blank, uh then it's okay, but if it's you know if it has a ZFS label on it, we're not going to clobber that and right and.

D

Destroy your data right because that it prevents you from accidentally sliding the disk into the wrong slot, say: oops.

C

Oh yeah, I I would I like the idea to have it opt-in, so just so user who configures the system should know what he's doing and how it should behave. I, just don't think it's usually by default, is a very good idea. Well, just maybe not not intuitive for users.

A

Especially you know, everybody's enclosure is different than whether they even have the enclosure stuff to have a path there uh and so on. But yeah.

C

So it reminds me at some point: there was questioned whether pass on FreeBSD and Linux. Actually is the same. Originally they came from lumos and some people complain is that you must Solaris format is pretty weird and there was a wish to improve it. uh I, don't remember where it ended. Up on freebiesd I had a feeling. Somebody was working on that on changing that format.

C

This is the bank.

A

E

A

Think I'm FreeBSD, it's literally like the Dev Inc, with the thing with all the comments is that the SAS path, basically.

C

No, no, no, not the event, I'm speaking about uh property of. What's that, we're writing to labels that fist pass.

C

Readable string yeah there was some activity to recently change it on 3bg to make it more reasonable and readable, but it would break compatibility with previous pulls from that perspective. That makes me wonder or like what are we gonna do about those paths? Do we already have them different between different platforms, because originally freebies mimicked uh salaries lumos, so that was it is compatibility, is on that front, but I never knew what how that looks.

A

There yeah, so the ink pad that we were looking at for this is I. Don't have the example in front of me, but uh it is very Linux specific. All right, I'll have to pull it up on a free busy machine as well and see what it I think. It doesn't even get populated on FreeBSD right now, which might be.

C

Because if you take any enclosure with SCS, you could get that pass right. If you get any HCI, if you stimulates HCI for well, it emulate spiritual enclosure for HCI ports yeah, so in sales. Discord modern system should have an order, pass okay, physical path, so you should find it without problems.

C

The question was it matches or not.

A

uh Dan, did you want to provide an update on the raids that expansion stuff.

B

Sure yeah um I'm, looking for reviews still so anybody out there is, uh wants to contribute and help get this across Finish Line. um That would be helpful. There's a few z-test asserts that I'm still chasing down with Fedor, so um but they're kind of rare, so I'm not as concerned about those and then I had a question about the FreeBSD side of things it went when I do commit. It doesn't look like any of the test spots run against FreeBSD.

B

A

B

Our disabilities are down or.

A

I think that was part of the problem, uh because I noticed this on I forget which recent pull requests didn't end up, getting run against FreeBSD either.

B

Yeah because I did I did a recent change, which is FreeBSD specific, so I really want to see a bot run. The test Suite, um so I added detection for the btx bootloader on MBR. They they um they.

A

Use that they're.

B

Stashing, the yeah.

A

Unfortunately,.

B

There's a header there, so it's easy to detect, but I want to make sure I have a CFS test suite for that to make sure that you get the negative test on that so and then the other question with code coverage, um we don't do code coverage anymore on Linux I know that FreeBSD has code coverage capabilities, so I don't know. If there's a way to opt in to like for FreeBSD side to do code coverage, it would be nice to get code coverage numbers, but.

F

um For the for the buildbot stuff, you may check with Tony hudder he's been doing some work on that infrastructure. So maybe he accidentally broke something.

B

Okay, yeah I, just I just noticed it wasn't doing any previously events and specifically after I, made the change so yeah I'll follow up with them. Thanks yeah.

A

But yeah I've seen the same thing on another recent pull request and hopefully we can get that fixed.

C

Oh yeah recently we had uh February here 2013 to test that at least two versions, but I haven't checked lately speaking about rate Z I, wonder about timelines like if it's approaching uh it's, it's been implemented, I guess it's already too late for any 2.2 and it probably won't be merger signs. It really introduces new feature Flags for that. Just my uh sales team attacks me regularly like when, when when we see it, yeah.

B

The big dating item here is just getting the reviews and I know it's a big water code and that's just the challenge. So, oh.

C

Yeah, a reviewer, it's understandable, it's more like her. Will it be like only 2.3 and will be some reason sometime next year. Probably no sooner we'll see.

F

That seems likely um Brian, uh who is it here, could probably comment more definitively, but I would probably recommend that we not try to stick it in at the last minute um into 2.2. uh But I will take a look at the code review um probably later this week.

B

Okay, I noticed Pablo's on the call I don't know if you would be available uh to look at it or look at portions of it.

B

G

That's possible.

B

Okay, uh it's fairly fairly straightforward. Actually, and if you watch uh one of Matt's talks it it sort of outlines the high level. You know, approach.

B

Okay- oh that's a pretty much all ahead.

A

uh So I have uh an update on the fast YouTube work. We've been doing, uh maybe spend a few minutes on that. uh The kind of results we have so far are actually uh pretty interesting. So just quickly share that.

A

uh So the first thing we noticed was the recent commit to master that changed the default, indirect block size and block size for dedupe uh from 4K to uh 32k um and looked at the impact that had on right inflation.

A

So this is all just normal dedupe uh in master and notice that you know with the uh block size, equals 12 or block shift equals 12.. You can see the Blue Line there that, as we wrote more and more data to the pool with dedupe uh the amount of inflation or the amount of the amount of Rights we had to a dedicated vdev went up as we got more objects. But obviously you see with the higher block size of.

A

32 or percent 16.

A

Yeah uh 32 or 64k, then we see much higher inflation uh almost double.

A

uh So it's that change, while making the DDT itself quite a bit smaller and more compressible does actually have the effect of making the right inflation, especially with small blocks and D2 much worse, but that also made it easier if you compare with log D dupe and see what the mitigations, for that were the other main difference that uh explains some of the reduction in amplification.

A

You see with the work we're doing is the change to the actual on disk structure of DDT, so it uses a new set of zaps with a feature flag and will only activate if you didn't have a traditional dedup on, but in traditional dedupe. Each entry in the DDT had four ddt-fizz tees, so it could store the list of dvas for a normal single copy. Dedup copies equals two and a copies equals three, and then it had the deprecated slot for the ditto block.

A

So that means it actually had room for 12, full dbas and ref counts and birth times. uh It's making each entry quite a bit bigger and mostly full of zeros, and that's why each entry is separately uh compressed with zle before uh it's actually put into the DDT uh We've replaced this with some new code that has just the one set of three dbas, but has the ability to update the existing YouTube entry.

A

If a newer, if the new block coming in has a higher number of copies, uh then we'll allocate just the one extra DVA uh and update the entry to have the existing single DBA and the new second DBA.

A

The only downside to this is, if you free the only one that had copies equals. Two, uh the D2 will not free the extra copy. So we'll have the most number of copies you ever had of a block until the block has you know references, equal, zero and gets read entirely, but I.

A

Don't think people really change the copies equals around that much that it's going to be a big impact for people, and you know if you ask for two copies at some point, then you know you asked for this and that's just how it's going to be. uh But this way we save you know three quarters of the biggest part of each DDT entry.

A

uh And then we implemented the first version of dedo blog. So as new changes come in, we write them to a depend, Only log and maintain the list of what those changes are in a separate, in-memory, AVL tree.

A

uh Then, once that log reaches its uh some criteria, um we're expecting a maximum size or age, uh then it will condense that log basically truncate that log after writing out all those changes to the zap, uh and the idea is that this would hopefully amortize the cost of using a larger leaf and indirect block size like we saw in the first slide there right now.

A

The Prototype isn't that smart all it does, is condense, a log, every n transaction groups, but it was enough to be able to show what the performance impact of that is so test. This. We ran fio on a data set with a record size of 8K chosen, mostly to allow us to get a large number of dedup entries without having to write a huge amount of data, but uh better performance than we got we're trying to do it with 4K. So we get better throughput.

A

All the blocks are completely randomized, so if they fill up the unique DDT, there's no actual due happening here. This is just the worst case of 100 unique data being written. So we write out make a new data set and write out four two gigabyte files to it. So we can use a couple of threads to write, basically, eight gigabytes of data, which would create about 1 million DDT entries.

A

um Then we stop and measure how much data has been written to the dedicated V Dev in the pool. Then we export and import to reset that to zero, create another data, set and write eight more gigabytes, and did this a number of times in iteration to show how it affects performance over time as the D2 table gets bigger and bigger.

A

So here we see in blue and red the first two lines: those are the existing traditional dedup code, red being the previous where each Leaf is 4K and the blue is where each Leaf is 32k, and you see uh like the first graph that big difference in the amount of write inflation then at the bottom. The green line is the fast YouTube code, where it's right into a log and then updating this app. Both the log and the zap live on the dedicated dedupe device.

A

So all the log rights are included in these numbers, so just uh batching up 32 transaction groups worth of Rights and then updating the zaps uh increment. Once every 32 transaction groups, you can see greatly reduces the amount of write amplification doing it. uh Every 256 transaction groups you can see, has this kind of spiky effect of every time we do it. There is a bunch of extra data to be written out uh and something smoother might make more sense.

A

uh We're also looking at ways that we can basically time limit how much time we'll spend updating the zaps during each transaction group so that we don't uh drag out the the flesh time for a transaction group.

F

Ellen can I see a question about this.

B

F

um So 32 transaction groups is not that no.

A

This is a small uh yeah.

F

I know this is I. Don't understand like why we see it not uh you know. I know you have more work in mind, but um I would think that you know even aggregating 32 transaction groups at a time. Given that you're doing random rights, you would still have like every every leaf block of the zap would have only one update to it right because it's like because it's fully randomized in terms of where it's being updated, unless the unless the DDT is very small I.

A

Would so this is starting with the DDC at zero. So at the first point here, where you see eight gigabytes written, that's about one million entries in the DDT, and so each increment here is adding an extra million uh entries in the DDT up to 10 million.

F

So so do you think that what's happening is that the DDT is so small that, like um that you're actually you're fully rewriting the DDT and you're getting multiple multiple like entry updates in each Leaf block of the DDT each time or or is it one entry one entry updated per DDT Leaf block and it's only the indirect blocks that we're seeing a benefit here.

A

Definitely seeing the benefit in the leaf blocks as well, uh because there aren't that many in the end, uh when you're at the smaller size, uh when we looked at uh just zdb on the DDT on the zap uh at the end, after the 10 million, uh there was a decent skew of birth times, where not every block in the zap had to be updated. uh Each time yeah.

F

I guess it maybe it would be interesting to see how this like 80 gigabytes are in is like not uh you know, probably is, is not that big compared to the challenging use cases. So it'd be interesting to see this go up to, like you know, terabytes and for sure, and see like if there's some discontinuities in those lines.

A

Yeah uh this was just the first cut to tell if our prototype was doing what we thought it was doing, got it um because yeah, like we've, uh got some data from a couple of real world. Youtube customers that have you know, 320 million objects in their YouTube table, uh and it's like yeah at that scale. Things are going to be quite a bit different.

A

um One of the things that this work is hoping to do is by using uh the sharding of the zaps that we will further concentrate and make it slightly less random. uh So part of that is uh either sharding on the physical block size so doing that in 4k increments, or something because even just doing, every 4K information from 0 to 16 megabyte uh the max record size would be. You know, 4 000 uh shards, uh but it means that you know.

A

If you have a block, that's 1.4 megabytes, then you only have to search that zap of block set or that size to find. uh If the hash is already there- and also you know, if you're concentrating your updates, uh then you're when you're flushing, one of these logs that contains you know only the updates to 16k blocks, then you're writing a much smaller zap and hopefully it's less random. You get more concentration so that again, you get more of this uh amortization effect.

F

A

F

That um I think that the goal here is to when you're flushing out the changes to have multiple updates go into each of these block right, yeah, so I think that that would um the optimal way to do. That would be to have all of the all of your changes um sorted by hash value and keep the most that you can and then write them out into the DDT by hash value. That's great!

F

So, like you already now over many txgs and like in the first GXT you're running out all the ones that have like hash, prefix, zero, zero, zero, zero and the next one you're writing the ones with Hashmi fix zero, zero, zero one or whatever, so that you're, concentrating your rights um and then like.

F

Since you can do this over multiple two excuses, you can kind of like do it continuously. So you're like okay, like my limit, my memory limit is I can use like a gigabyte of RAM for the pen, these like pending changes and then like once I get close to that I just start. You know eating through writing out by hash value and keeping like a cursor of where I am that way.

F

You have like the densest set of modifications like right in front of your cursor, because that's the oldest oldest part, and then you can kind of like determine the like feed rate just based on the memory usage or some like minimum values.

A

Exactly uh and it's kind of signaling uh what I'm saying but kind of like we do with scrivenson it's like we want to fend. You know this much of each transaction group flushing these out or whatever yeah, when it.

F

Does I think if you do that, that's going to give you kind of the optimal behavior and you don't really need any additional like sharding? On top of that, because you aren't like you, you know you mentioned like oh, like we separated out, so we don't have to search. But since it's a hash table, you're not searching at all right.

A

So yeah we'll look at that. um This is the basically the same data but looking at the right throughput, and we see that you know as the zap gets bigger that does have more and more effect. Part of this is. We were purposely exporting and importing the pool between each of these runs so we're starting with an empty Arc.

A

Each time uh also, this was a relatively small VM, whose Arc was not going to be able to Cache the whole DDT anyway, uh but seeing the the effects of the different uh stuff, we still see quite a big Improvement but I think, like Matt, said some of the uh ways we can amortize.

A

Most of these costs can uh end up making a big difference and allow us to take advantage of being able to have larger Leaf blocks so that uh it's more fragmented or less fragmented and we get better prefetch and so on on dedupe as well.

A

Other thing we noticed is that, right now we store all the DDT zaps as copies equals three, uh which makes sense, especially you know the duplicates app is uh you know you could never free a block uh ever again if it was damaged uh without knowing you know, if you could decrement it properly and so on, but the uniques app is different there uh with the prune concept, uh we'll be able to go through the Unix app and delete some entries to make room to keep the the size of the DDT from getting too big, especially to make it fit in memory or at least constrain it to the size of your dedicated vdev.

A

um So there's already changes to the the freeing code to not, you know assert that we can find that hash in one of the deduct tables.

A

um So with that, uh it means that if we happen to damage the unique sap table um that it wouldn't be catastrophic to the pool, it would mean that you know if we lost the holes app, it would mean that you know uh new rights wouldn't have a good chance of deduping against the old blocks, but the pool wouldn't lose any data or anything.

A

um If we reduce the number of copies, we store the unique zap with to two or even one uh that would basically cut the right amplification of all those app updates down by a lot.

A

Are there other things we need to be concerned about? If we do something like that, I.

F

Think that's a great idea.

A

uh Because yeah, just looking at real world use cases, we've seen even with customers getting Duty ratios of like 3.5 to 1, they still had like 240 of their 320 million records were unique uh and that's a lot of stuff to a store three times just takes space, but also every time. You're. Writing one you're! Writing it out three times.

D

A

It twice or once would be.

F

Or when you're uh like, when you're pruning entries or presumably could lose them here, um are you have you changed the way that Scrub Works so that it can still scrub these blocks that have the dedup bit set? Because because normally scrub um ignores like when it's traversing the block winners, it ignores ones that have the dedo bit set and it scrubs them by looking by finding them in the YouTube table.

A

No I don't think we did that in the previous version of it yeah.

F

So I think that even to handle the freeing um or like evicting things from the unique zap table, you need to um like change with that Scrub Works. Presumably like have scrub. You know, actually look at the blocks of the DJ bit set and then go look them up in the deduct table and.

A

If they're not, there then pretend like the YouTube, it's not set basically.

F

Yeah well or basically like if they're in. If they are in the duplicates zap, then you don't scrub them, but if they're in the unique zap, then you do scrub them right.

D

Or not to visit they're.

F

Not in any zap, then you do scrub them. Then you have to worry about like blocks that transition between one or the other's app right. You know like if it can go from unique to to duplicate and then back from duplicate to Unique. Then you'd have to worry about that.

F

F

um Even when there's one copy, then you you might be able to simplify that.

A

Yeah, because also I need to have a big effect on the dedicated vdev. uh You can fit more DDT if you're, not writing. Three copies of all the blocks to your dedicated vdev and three copies on the same video is not as helpful as if you know the way copies normally works for uh regular blocks. It would try to spread those copies across different vdevs, but when you're using an allocation class, it's going to put all three copies on the special video.

F

A

Could have yeah.

F

You could have multiple, yes, it'll, spread them across them.

F

G

Question so I like the idea of uh not doing copies three for for the unit exam, but also like the the biggest problem uh from my tests was uh how much we we have to read uh during uh when we sync transaction group um because of uh the hash Is Random.

G

So we have to read all the indirect blocks so was I was actually wondering uh if it wouldn't be better to have a single DDT and just read once like one large one, instead of just looking into two different ones for the entry, because that multiplies the number of reads and reads: Rita, of course, synchronous during transaction groups. So that slows down transaction group.

G

So wouldn't it be better to actually have a single one.

A

To not have the unique.

G

In the duplicates be separate, yeah yeah.

G

Because for for uh to look up each entry for uh like large DDT, we need like four or five reads per entry right, and we are doing this during transaction group. Sync, we cannot prefetch, we can only prefetch for freeze, because for free we do have hush, but uh for rights we cannot do pre-fetching because we don't have hush yet so during transaction group. If we have two zaps we will have. We have to do the reading twice uh to try to find the entry in each of them each yeah.

F

Yeah I think you should you would be able to do them in parallel, but um yeah like you, would clearly like having two. If you have two giant zap objects and you need to look in both of them, then you're doubling the number of disk accesses for.

G

Right because if you double the size of DDT, you want to double this the number of reads: maybe you will do one extra read or something like that.

F

Well, I mean if it the dupe table. If you multiply it by a thousand X, then you do one additional read right.

D

A

You're, adding.

F

One more layer to the indirection exactly.

A

I was not a thousand X, because the indirect blocks are smaller. Okay, but yes, it's like 32.

G

A

My biggest concern with that is it takes away our ability to reduce the copies equals on the one of the signs. Yeah.

F

I mean you might consider like um maybe reducing copies to two just like for all ddts or like um having you know, maybe like, because because I feel like it's pretty, um it's not super necessary given how people configure their pools like yeah, copies stuff was neat when you're thinking about a non-redundant pool um and adding a little bit of redundancy for very little cost. But um in this case the cost is high, um and you know probably people are using this with like grade Z or something some type of redundant pool.

F

That's like they care about their data and they've configured it to be redundant enough, that they aren't going to lose data um and the case of like single block. You know you'd only worry about this one. It's like I've lost the maximum number of disks, I haven't reconstructed yet and then I also lost one random block randomly and that and then like that happened to be DJ block, which was important and now I have now.

F

The additional copy really saves me, um which is hopefully not that common, um so I would think that reducing it to copies equals two just like for.

C

Everyone all the.

F

Time would be fine um and then, like maybe think about, could you reduce it to like if you reduced it to copies, equals one for everyone all the time? What would the implications be like Could? You?

F

um Could you not lose data, but just leak stuff right, because we're only we're only really worried about like losing access to like one or like a small number of blocks of the YouTube table, and so if we could write the code such that yeah, if you lose like one block of the gdub table, then what's going to happen is what everything was right. Whatever is referenced, there is leaked forever.

A

Yeah, like conflicts with the concept of uh dealing with, we couldn't find this hash in the dedup table, so it must have been a copies equals one or a ref equals one that we purged well.

F

You'd have to know like oh we F. If, if we, if we can't read this, if we can't determine whether it's in the detail or not right, then we have to assume that there are still more copies of it and not forget right. But if we can can read the block that it should be in and we find it is definitely not in the deduct table, then you can still do the right.

A

Yes, okay, so yeah. If we read and we got back, you know and it's okay, but if we get eio or whatever, then we leaked the block on purpose or whatever.

F

D

A

Or we don't free it and it might end up leaked or might not. Yeah yeah I think that doesn't exactly.

G

Can't we detect that we have redundancy in the pool like if there is right Z, we we use copies, equals one, and if this is a laptop and we have only one disk, we'll use copy equals two. So yeah can.

A

Do that, possibly, although you know if you're using YouTube on your laptop or you might be in trouble, but.

F

Anyways I I think that um it would be nice to preserve the only need to look in one's app for performance reasons, um because that's like you know, double your performance right. There yeah versus having to look in two zaps, because.

A

Like there's not really a difference in the on disk format of, is that, like the entries aren't different or anything so I I, don't think they are do you know why there were two separate zaps originally.

F

I, don't remember: okay,.

D

My recollection was that there was some performance issue around that originally, but I don't remember the details anymore. This is a long time ago. It's been this way for since I think almost the beginning, when Jeff first did this stuff.

A

Yeah, because the only other use case we've done before for a different customer, uh because D2 prune didn't exist at the time was uh like at import, dropped the entire unique zap uh with flag. So if you set a flag while importing the pool, we just dropped that entire zap and truncated it and made a fresh one, uh and so you'd only have your blocks that were already deduped to do more.

A

But I got the the size of their DDT for in-memory down a lot and just doing that occasionally seemed to make YouTube work well enough for them. But.

D

A

Know if we're gonna have real D2 prune, where we can walk through the tree and clean stuff up, then I. Don't think we need that separation anymore, but Willie's family with it and see. If that uh you know, if we find out why it was originally split up uh because yeah the advantage with the sharding is because it's deterministic, you only have to look in The Shard, a specific Shard ever so you're not ever having to look at more but I.

A

Think yeah, especially if we're going to Shard not having to have two of a recharge, would help uh on our side as well and make the code a little bit easier to deal with.

A

So we'll investigate that as well.

G

How long do you have any numbers on how uh this dedup log impacts reads during transaction group sync, because.

A

I'm not concerned there are no reads yeah. So, uh while entries exist in the log on disk, they also existed in avl3 in memory, so you would never have to read from disk if the hash you're looking for is in the log, it will always be in the avl3 in memory. Yeah.

G

But I'm more concerned that, uh because you have the log that you are not really writing the updates to DDT every transaction group, so every 32 transaction groups, when you try to sync the lbl avl3, you will need a large amount of reads. So you will have like spikes in the reads. So it would be.

B

G

To see because, like this, every 30 second transaction group will take much much longer, for example, and then I'm.

F

Assuming that this, once every 32 is a is prototype only yes,.

D

F

Think if you take the approach that I outlined of like you know, you have some limit and then you uh for the size of the AVL tree, and then you have a cursor of like this is the last um the hash value that I wrote out and then like every every txg, you walk forward a little bit. Writing some of the entries, but those are the entries that are all with the most similar hash values.

F

Then that way, you know you that's the way you're gonna the most likely to have consolidation of modifying the same block of the of the DDT right, um multiple entries being in the same block of the DDT right.

A

And because we would know the hash range ahead of time, we could also prefetch all those blocks to deal with pavel's concern as well.

D

Right and you can be adaptive as well a little bit and say well we're not keeping up, so we may have to Traverse more of that range than the.

F

Previous yeah, so, like you know, you make sure that you limit the size of the AVL tree and also have some minimum amount of progress right.

A

Right or just like a high water mark when we're above this Mark we're going to try to write some out.

G

That's interesting because we could also like limit the how much uh avl-3 we think by monitoring how much reads we did and how much we slowed down the transaction group sync. So.

A

That's why I was thinking in particular like scrub, uh limiting the the flushing of the log to you know some number of milliseconds per transaction group.

F

I mean, but you have to limit the size of the AVL tree right. Yes, but you don't have you have a finite amount of ram right.

A

And because this AVL is busy over it's not in the arc like the DDT, is we can't page it out? It's like the yeah. This memory is locked down and so yeah. We have to have a limit.

G

All in all, it's great to see uh DDT being, uh did you being worked on so I think everyone is happy yeah.

C

Yeah: okay I, just like uh from experience of uh meta slab, precise meters log into the slab during poly import, I'm thinking. It would be good to not like if we can flush more of the dupe. If load is low, somehow we can, if we could flush it more often do not accumulate too much in lock to replay during pulling part yeah. So just wish.

A

Yeah, uh that's definitely one of the things I. Don't radar is making sure that, especially for you know, ha failover and so on that the import time can't get uh two out of hand, because that'll not be useful to anyone and so yeah again. That's why I was thinking. You know we have we'll spend up to this much time, each transaction group, or at least this much time each transaction group uh flushing things out.

A

Then we can like map, says, keep the the logs from getting too big so that at full import we don't have to read a lot of them uh to play stuff back.

A

Were there topics anybody else wanted to talk about.

F

um I, don't know if you already covered this, but the uh developer Summit is coming up soon um and uh we would love to hear about all the great work that folks are doing. So uh the deadline is in three weeks, uh September 5th, um to submit a talk.

F

um Some of the things discussed today would be great to hear about um at the conference the D-Day work, the uh raid z um expansion work um so uh send me an email with your proposal of what you'd like to talk about and then the conference will be October 16th.

E

To 17th in San Francisco.

F

E

F

About this year's conference.

F

Right, that's all I.

D

A

All right, the next one I had on you, go ahead map.

C

Yeah I would like to bring once more attention to my uh second attempt to refactor, Zeal, writing and Zeal. Looking uh in previous attempt that is currently committed and merged to 2.2 uh George found deadlock caused by the fact that there is no way uh Brazil can sleep or block or anything like that after it allocated next block or maybe a deadlock at the end.

C

So you know my patch in pr15122 I, once more refactor it with the code with idea that everything that must be that everything that may sleep or block or things like that should be done before the allocation.

C

It also uh like in in a way makes code cleaner, because I previously tried to avoid those scenarios explicitly one by one until I found it's impossible to to do it for all of scenarios.

C

So uh it's slightly bigger refactoring and slightly completely more complicated State machine, but I believe it's much cleaner at the end and uh like Signs, there are no deadlocks in the previous implementation. I'd like this to be, if possible, a review, written, merge it sooner and hopefully get into 2.2, even so George. So far is the only one who reproduced that look on his workloads still I am not very comfortable about that.

C

So, comparing to previous ones, this also removes a few more operation out of uh ziel is sure lock. So it should be even more scalable than a previous one, so it it does couple more atomics, a couple more logs, lock unlocks, but I haven't seen any more contentions more or less contention so looks good to me much better than previous, but needs a look. George promise to turn around some testimony this week.

C

But if anybody else can take a look, I would be happy.

A

I'll have uh Rob Norris in my team. Take a look at it. Korea he's done a lot of digging in the Zill lately kinda deal with discs going away and what that means for itself.

A

uh The one other one I had on my list here was: oh okay, um looking at, uh we already have per user quotas, uh but there's some interest in per user reservations, so that you can make sure that you know if you have uh data set for a class at a university or whatever that, while the students can use up the whole quota for the uh the data set or whatever, making sure that the professor has a reservation so that they will always be able to write.

A

uh You know the last couple of gigabytes of space in that data set of reserved for that user.

A

Is anybody else seen other use cases where something like that might be valuable or predict any complexity and trying to influence something like that.

C

My first thought was: it was contradictory to the whole idea of overall provisioning, but once you mentioned Professor like like some higher priority user, it kind of makes sense.

C

I, just uh it comes to my mind that somebody uh also also asked for possibility to do uh some reservation and quota not on a physical level, but a logical level because, like if you are giving somebody a promise, somebody you can write. 100 gigabytes, you probably need 100 gigabytes of logical size and physical.

C

Reporting allowing to configure this number of knobs growing exponentially yeah.

A

Yeah I've definitely seen uh people asking about. You know uh logical quotas instead of physical quotas, so that you know if the data happens, to compress that's the advantage of the owner of the pool, not the user, so you can only write 100 gigs of logical data or whatever, uh but yeah I was thinking about trying to think about. The interactions between you know a user quota, a user reservation and then maybe a project quota and uh the data set quota and reservation.

A

It's like that's, that's a lot of math to do to figure out how much free space there is in the data set.

A

But I just wondered if other people can think of use cases for that or or you know, if there'd be enough demand for a feature like that to to make sense.

C

I would guess we would have to to not load Full Table of all the users who would have to have some cumulative summer total number. How much do you have reservation for all the users that we could add to data set? Our reservation actually should go on top or overlap like how.

A

Those yeah, that's the other thing. It's like you know, should the sum of all the user reservations not be able to exceed the code of the data set so.

C

Reservation of a data set exactly.

A

E

A lot of questions not currently but historically when I ran a number of multi-user systems. We ended up working around not having that I, basically just having a data set per user for their home directors and then just giving each of those a reservation, so I think there's certainly probably demand for it. I don't know if there's enough demand to Warrant the complexities, but.

A

Yeah at this point, I was just doing this: how to experiment of what? How much complexity actually is there.

B

A

I think that's all the ones that I really need to get to today.

E

Nobody else has anything I'll ask for thoughts on the pr I opened like four hours ago, which.

A

E

uh I opened one that adds a threshold for the 12.5 compression threshold um that upper bounds it at 12.5 percent of 128k, um not because I think that's the optimal one, but just as a starting point to discuss what a good setup for that was um and I got several different people's comments.

E

So far on what should be done there um I was curious what other people thought might be on that, um because I'm going to go collect a bunch of data on when this would be useful, but the data's on the going would be as good as how constructive the thoughts I have on the phone.

E

um I figured that just starting at that much of an Upward Bound would be a reasonable start, um because that way, you don't have to say save two Megs to compress a 16 Meg walk. Yes,.

A

I'll have the.

E

A

Oh, uh that reminds me of uh I have to clean it up and Upstream it, but we've developed uh compression equals slack, uh which just trims off trailing zeros.

A

uh When you have a uh like record size equals 16 Meg and your file is 20 megabytes and you have compression off because say the file's already encrypted or whatever. uh Currently you end up, wasting you're writing a whole bunch of zeros at the end of that block, uh because compression is off and so compression equals slack just takes the trailing zeros off the end of the last block uh and can really improve the the write throughput of files.

A

That are, you know not divisible by the the record size, especially if they're you know, if you're writing a bunch of 17 megabyte files to a data set with record size, equals 16 Megs. You end up writing you know practically 40 something percent zeros to your disk uh and it really hurts your throughput.

F

Cool so like a faster, a faster zle that you just apply to the last block of the file.

A

Well, it doesn't even do zle, it just sets the physical size to end where the data The Logical data does, uh and then it just you know inflates. This uh doesn't have to inflate the zeros back, because the logical data is only that long right.

G

For embedded the blocks uh because embedded blocks are currently also, uh we try to compress them.

A

Embedded blocks are the result of compression, if, if you compress the data and it compresses down to less than like 108 bytes or whatever, then we just stuff it in over the checksum in other spots, in the block pointer.

G

Yeah yeah, but because of the uh of the metadata of the compression algorithm, you can actually write much less well much less yeah less than you could otherwise.

F

Do you think you'd apply this trailing Zero's compression first and then normal compression and then.

D

F

uh Fit into a block pointer, better.

G

E

If you're going to do that, you probably also want to use various options for like ld4mv standards to skip a bunch of the other data. If you know you don't care about, because it's a tiny frame anyway, um but I, don't know how much that would save you in practice. So when you're only getting 110 votes or something.

G

But I remember from my tests when I was just writing. Some random small random blocks. I could actually fit like 20 bytes Less in them by the dog. Because of uh that.

A

We tried yeah Z4, we even outside of the compression header. We have a ZFS header of the original logical size, that's like the first eight bytes or four bytes of of the data and so on and so yeah. There could be opportunity to get more data into embedded block pointers, especially if people had you know, 100 byte files.

E

um I mean at least for lv4. The reason you need. That is because it doesn't have one the way that we're storing it so and it doesn't make any promises. If you give it too hard to block right.

G

Because I think the logic there is that we we first have to expand to the whole record and then we compress and then we try to fit into embedded block. Yes, something like that right.

E

I, don't think anything logically requires that it get inflated that large, which is what I was going to ask. That's.

A

Basically, uh I, don't remember all the details now because, just a couple months ago, when we wrote this uh I think part of it is avoiding some of that inflating it to the full size, because you know why copy a bunch of zeros around in memory, if we're not going to write them down, yeah in particular, trying to avoid doing that as well, when uh recreating it on the other side.

A

So when we decompress, we don't need to build memory with a bunch of zeros just to uh to throw them away when we return the data to user space, and it's only going to be the original logical size.

E

um You get to apply this optimization to all compression algorithms or just the special slack setting.

A

uh Right now, it's just a special slack setting uh because the customers like we want compression equals off, because we don't want to waste CPU time on lz4 on encrypted data. That's not going to be compressible, but when we switch our record size from 128k to 16, Meg, suddenly uh we're losing a lot of space and we were seeing worse performance because we're spending all this time writing out zeros on all of our randomly sized uh objects.

E

I guess, philosophically, is there a reason it couldn't apply to any compression setting? That's not off.

A

um I, don't think so.

A

Anyway, we'll get that cleaned up and get a pull request open and we can discuss it further. There.

E

A

I had almost forgot that we did that too.

E

A

Things happening at once.

E

Often a good problem there.

E

E

But yeah, if anyone has any other clever thoughts on when we should threshold compression because requiring you save two Megs for fixing Meg block seems like a bit. Much awesome comments. Welcome.

A

I think that threshold was definitely uh come up with when the 128k was the max record size yep.

E

um I have some previous experimental data, that's close to about the optimal threshold below a certain point, but I'm going to go, collect a lot more useful data shortly.

A

I think we're at time here. So thanks everybody and we'll uh when's the next one Matt four.

F

Weeks, it should be four weeks from now uh September 12th yeah.

A

I'm back at the normal later time,.

D

A

F

A

All then, is everyone there to submit your talks.