OpenZFS Leadership, 5 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Second January 2023 OpenZFS Leadership Meeting

Description

Agenda: 2.2 branch; chacha encryption; raidz deflation; arc eviction

full notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

B

A

To the open, ZFS leadership meeting um looks like we have a few items on the agenda today. um So let's Jump Right In, uh the first one is about branching 2.2.

A

um Someone was asking this because of sounds like they're. They want to get overlay FS support um into a release. I do not have the answer to that um looks like maybe Brian just joined. Did you hear that question.

C

Yeah I did um yeah looking at branching, for 2.2 is something we want to look at doing fairly soon.

C

um We do have a lot of stuff in the master branch that have been there for a while, including overlay support and some other features. um Really. We just need a plan to to wrap up that work and make sure it's in a good, State and Branch. It make sure what we have is in there is what we really want um and then- and you know, move forward with that, but I don't have an exact date for that. Unfortunately,.

D

Like are we waiting for some specific features? Do you have released or like? What's the plan? Why not start it at any arbitrary point like nobody is screaming really and I I I. Personally, no and I systems would like to see it already sometime in spring, when uh we need to work on create new branches for doing us, both scale and core on Linux and 3bsd, and it would be like would be good if we could use 2.2 for that yeah just to.

C

There were a couple of features that we were trying to get in before tag.

C

um I know that we were really close on a bunch of them, but they didn't quite get there um block cloning was one of them uh was ready to go in pretty much um and maybe force pool exports was pretty close as well, um and I wouldn't want to make a branch you know until well. If we made a branch, those features probably wouldn't end up in that Branch all right.

C

If we were to do it now so um I guess we should Circle back and figure out if we think we can realistically get those done uh before branching or if there's other stuff. That people feel is absolutely critical. That needs to be in there.

D

Definitely be a good thing, but maybe you could have you tried to talk to. Powell was his plans. There yeah I know he's working as they said what I heard, but oh how.

C

Close is it, is he Paul's on the call? Maybe you could uh answer.

E

C

A

How are you there maybe stepped away.

F

Yes, sorry, yes, okay! So.

F

So basically, for me, uh the missing item was um uh zero claim, because uh without a zero claim, we cannot really. uh We cannot really allow to block phone across data sets and I think that's uh that would be bad so, uh but uh I will be able to get back to it, probably in uh couple, probably in a month and and finish it because I'm currently working with uh on a different project with Alan. So I need to finish this this one first.

F

F

So this is how it looks. I guess.

D

I guess it would require separate feature or would it become committed. This is now to add uh claiming later with smaller change. Hopefully,.

F

F

Well, one of the problems is that uh when I sit down to to do zero claim it might uh it might prove itself that we can simplify ziliary records because my initial idea well, the easiest way- was to simply uh record in the Zeal uh ID of the objects. We are cloning and ranges that we want to clone, uh but because of the uh cross data sets uh cloning I decided to um to record in the Zeal BPS of the blocks. We are cloning, but then.

F

Mostly because of your your feedback, uh it turns out it's not really, it still can be uh racy.

F

So uh and because this is implemented that way that we record uh BPS, then the Replay code is different than the the the the cloning code and uh and this is a precedence because uh all the Zeal replay operations are basically uh the same operations that are used during normal operations and block. Cloning was was different in this regard. But if we can do that uh from Zeal claim, then maybe we could simplify Zillow records and- and maybe this would allow us to use the exact same operation for uh for clonic and for cloning replays.

F

When we replay cloning uh uh Records so uh and once that's committed, it will be probably hard to change how the Zillow records look like or you need to probably some compatibility like.

D

That, oh, yes, that wouldn't be yeah I, agree like for me in months would work, but obviously they're always wished sincerely.

B

A

Yeah, it sounds like we're, probably just a couple months out from that being integrated to the main branch.

F

F

Not sure sorry I wasn't prepared for that, but I can talk to Alan uh and see if, uh if maybe the current project, if we can, we could make room for for this, so maybe I would be able to sit uh and work on this earlier. If this is what uh 2.2 depends on that, if this is something you would really want to have um yeah, but uh with my current obligations, I cannot say it at this point. If, if that will be possible.

A

I mean that's fine, it's just um it's good to get like as accurate of times as we can. When we're trying to plan the next release. I, don't think that um you know I'm fine with you taking the time to get it right and not have to do like rework to to re. You know read your things after it gets integrated, um Brian I, don't know what your things are on that.

C

Yeah I know this is a I'd. Rather we do it right. The first time, of course, too, that kind of goes with that thing.

C

It's all just a lot of trouble down the road I do know that there's a lot of interest in this particular feature, though, um and getting some kind of at least Branch started for the 2-2 release, so I would personally be interested in like if we can get this Mortgage in the next month or two and then you know, cut a branch after that, but whatever's in master now between now and then I think that would be great if that's viable.

C

If it's not, um if you'll want you too sooner, we could look into cutting something sooner without this feature.

A

I mean I kind of feel like we're better off trying to cut a release every year and um and not worry too much about holding off to get more features in, especially for these like features that are like you know, they take a long time to develop. They've been the works for years and yeah. It might not get it didn't, it might not make it into 2.2, but there's a lot of other stuff that is in 2.2, and it would be nice to get that out to a release.

C

A

C

Agree, the current um a lot of people are in the master branch and it's in good shape. Now so there's nothing preventing us from ranging at any time um and yeah. We could certainly we're a little overdue for our yearly Cadence, so I could I could see taking a 2.2 now with you know, what's in master today, that's well tested and works well and then you know we get a 2.3 at the end of the year or whatever, with this.

A

Yeah I think that, um like assuming that we have kind of the resources to be able to do one release a year, then I think that it makes sense to try and stick close to that.

A

But obviously, like the the decision goes to the people who are doing the work, which is not me in this case, yeah.

C

Well, there's there's other interests in making a 2.2 from just a maintenance standpoint right like back forth, you're, getting more and more difficult right as the code base moves more and more forward. So um there's good reasons to make a you know at least start stabilizing, a 2-2 Branch off master for people that really test yeah.

A

I mean it'll take uh probably a month or a couple months to go from cutting that first Branch to having the final release right, yeah I would think so.

C

A

All right um other thoughts on 2.2.

A

All right, um the next item on the agenda looks like somebody crossed it out, um but it was spurious eio. uh This bug that was filed, I'm guessing it was crossed out because I I opened a PR that will hopefully address this.

A

So um hopefully that will address the problem that um is being seen there, uh not sure. What uh is there anything that we want to discuss about that at today's meeting.

C

uh I, don't think so, so this was a a bug in 218 due to a back Port, just to be clear. um We immediately rolled the 219 with that patch reverted. So um you know unless you're you're running the master Branch at the moment, which still has this code in there until we merge your pull request. Matt um things are, should not be an issue anymore for people running stable releases and I think we only had a couple reports of it. So.

D

C

A

Yeah and the reports were, did not have a lot of data, unfortunately, but um so you know it's hard, I I can't say 100 that the problem I'm fixing is this problem, but it's you know: I'm fixing a problem where you can get the spurious eio when you're using um encryption and you have ganging.

A

um So hopefully this is what folks have seen.

A

All right um next item um was uh one that we talked about briefly last time, but now um the submitter is here: uh Rob uh wanted to talk about his PR to add Cha-Cha 20 as an encryption algorithm.

A

um He wants to find out what are the next steps for moving the pr forward.

G

Hi yeah, that's me, um I, don't have a lot to say about it. um It's kind of doesn't really touch the the guts of ZFS much other than to pipe through um the the the the decipher name for the encryption property um on FreeBSD it.

G

uh It just goes straight to the clinical crypto um and on Linux I've uh implemented a module for the ICP to do the work, because um a lumos don't have that the those algorithms in in their trees, so um the the actual, the actual Cipher stuff I've taken um off the shelf. So um it's kind of all there. It's all written up in the in the in the pr, so it mostly I mostly just need guidance on you know where do I take it from here.

G

um Probably the the the the the the the maybe the tricky decision is is is.

G

Deciding if you know how do we decide if the crypto is good, um uh uh you know it can have all the right words on the packet, but if it doesn't actually print it out, there's not a lot of point. The actual the reason for wanting this is just for devices that do not have um uh Hardware accelerated AES, um it's very slow in software. um uh So that's like that's your Raspberry Pi, but that's also, there's a risk.

G

Five board that I'm working with, um which does not so um I, would like at least uh decent encryption option for that, um so yeah I'm, just looking for guidance and Assistance or, however, we play this.

G

A

Yeah I was just gonna say um you know, we I think we would probably want feedback from somebody else who wants this, like um I I, think I understand. You know the use case that you're talking about it makes sense. Theoretically, um it would be good to have like other examples of people kind of like upvoting. This, like yeah, like I I, would totally use this. If it goes in um and then uh you know, I, don't think we're gonna um review the cryptographic code.

A

um I, don't think we really have the skills or time to do that. um You know it sounds like you basically copied it from like the reference implementation, so I think we'll kind of trust them. Unless anybody um knows otherwise, and so then we need somebody to review. You know the kind of ZFS specific bits of this um which it looks like are not a very substantial.

A

um You know we need someone to review, like maybe the man page, maybe like suggest, adding some wording to the man page to you know say when you might want to use this, um but it looks like it's pretty light in terms of changes to the main ZFS code, which is really nice.

D

The only thing is it crossed my mind: signs are two different crypto implementation. Maybe it could be cross-tested between FreeBSD and Linux to make sure it's compatible, just in case like sure, like it's same algorithm and should work, but.

G

I've I've done a very only the most lightest of testing, which is making sure I can create a poll on one and import on the other. I have not gone further than that. um I should do a little more extensive testing but I like there's some sense of Mercedes and bits and pieces, but I yeah um I, don't know how deep that needs to go to feel sort of confident with it.

G

Making some things to do.

A

uh What thoughts do other folks have.

A

Oops sorry and we um I guess we'll leave it for you to work with other folks find um other people who are interested find someone to review it. I think um I didn't already assign Mark I, think Mark would be the um uh signee Mark. Maybe um so he can also help work with you to find code of yours looks like uh Jorgen took at least some somewhat of a look at it. So um maybe we uh Jurgen how thoroughly did you look at this.

B

um I went through it, I mean I was pleased to see that it was written for ICP, so it will just uh compile for both Windows and Mac OS. So, like I was mostly concerned about that. But um the implementations look pretty straightforward and simple, and if you run openssl you can be given Cha-Cha. So I wasn't too worried about whether or not it's you know considered real, so yeah.

B

It's important to me.

A

So you looked at all the ZFS code: yep yep, all right great.

B

Show and I took a look at there. There's a GitHub four cha cha with list all the implementation, so you can get a feel for this DC version and the assembly version.

A

Cool well, that sounds like you've got one one through a review there. So that's uh that's a big step um and it looks like there's. You know the other folks that commented Richard and Loop violent.

G

That that's the author of The Librarian use he's been bonus with um some false positives that came through a static analysis tool.

A

Cool yeah um are those conversations resolved now or are you still discussing with them.

G

um The one so the the one part we're looking at in there. um It's not resolved um uh from the the opencfs and- and it was a there's, a particularly gnarly piece of mass- that yeah a static analysis. Checker tripped on that.

G

It's a this could overflow and it can't overflow, because there is earlier checks in the code that ensure that can't happen. So there was some discussion about like how do we actually prove that this thing works? So it's not it's one of those things like if the off-the-shelf implementation of the the algorithm is good, then it's not a problem for us.

G

The conversation is about whether or not it's good and whether yeah, and so that's that sort of speaks to like how much do we have to trust the thing before it's fine, um so I there's nothing for us to do on that, um but I'm not really qualified to say like yeah. This isn't a problem.

G

um You know it's it's sort of the kind of safety thing so, but that that's that's that conversation there's there's no others.

A

um But it looks like you got like the original author of that code. You know kind of weighed in and and said you know, I think that this is correct and that it doesn't actually overflow in practice. Yeah.

G

He's written it yeah he's written um sort of a you know like an English proof, and he's tried to write a an automated proof of his because he's very interested in making sure it's fine, um um which is part of why I like this implementation, because he's very responsive and right communicates very clearly about it. So.

A

um So yeah I mean that sounds like a pretty reasonable answer to me.

G

I I thought so too. So that's good.

G

All right, no that's great I've, got plenty to move on with now, um okay and uh yeah. Thank you. Thank you again for your review. I hadn't realized you've done a more complete review, so I really appreciate that um and yeah I appreciate your time. I will um I will come back to you soon.

H

Oh yeah sure um I know that your primary well, one of your primary goals in this is to provide something. That's significantly more performant on things that don't get to play with the AES instructions in the sandbox.

H

um I was wondering I'm not familiar enough to know, but are there accelerated versions of Cha-Cha, 20, uh poly 1305 for things like SSE or ABX or whatever you like?

H

um That might be useful for getting even faster performance on systems that might have those anyway? Yes, there.

G

Are yeah there are I, haven't I haven't looked very closely at them, mostly because I don't read AVX first off my head, um but I thought once the once the structure is in there and there's a generic implementation.

G

um I would like to look more at um at accelerated versions, um uh particularly for like the small arm processes, which is still it's still. It's still crap on. Even in software, it's just less crap than IES, but if I could get an accelerated version there, but I thought yeah having the structure in place first and then um extend further, but they definitely exist.

G

um Google has a lot of interest in these particular algorithms, because this is what they're, using on um Mid to low range phones that uh don't have um uh yeah, don't have Hardware support for things. So so the code's there.

A

Cool well thanks, Rob, for uh putting this together and and pushing it forward.

G

No worries, thank you all for your time really appreciate that.

A

um The next item from the agenda is ZFS space calculations raid Z deflation.

A

um There was a um issue raised with a big discussion about this I think um Brian and I, and maybe some other folks were involved in that discussion. um I don't know, yeah did is the person who put this on the agenda here.

I

I am cool, Mr, Mike works.

A

um So uh where did you want to go with this discussion today?.

I

I didn't know if there were any additional uh questions. I know that, based on the discussion so far, there were some some other questions stemming from it. I know, Alan had some, although I'm is Alan on the call um I know there was, uh in other words I mentioned, of possibly prototyping it and shaking it out and seeing okay. What actually happens you know Does anything, go sideways or break anything or get some unexpected uh results of a chain of any potential change, um but just my my original take on this was uh I.

I

Initially was trying to do some uh just pathfinding on my own use cases here and seeing what sort of compression I was getting on files and I was seeing some numbers that really just didn't make any sense and It ultimately traced all the way back to the root issue that I raised.

I

uh Just because my particular configuration I was storing things way more efficiently than would have been, for you know, a 128k based deflate ratio, so I would imagine that might extend to confusing other folks in the future, or anybody that has tools where you know the the majority of their records or blocks being stored are more efficient than the 128k based math, which would then lead to some confusion, because you end up with a really weird looking things like uh like the the files appear smaller than they actually are.

A

um Yeah, um would you mind if I kind of take a stop at summarizing the issue for folks that aren't familiar with it? Please, um so my understanding is that the the concern is um when you use Raid Z.

A

uh You know we have to allocate space for the for the parody and um the way that we display that is so ZFS is like internally. It's keeping track of you know. Okay, I allocated, you know you asked for 128k and I had to allocate. You know that times 1.2 amount of sectors to to have space for the parity, but the way we want to display it to users is like kind of um sort of ignoring the parity right, so at least like kind of in the in a lot of default uses.

A

You know you see, um you know if you add, like four disks in a Grade Z1, then like one of them is being used for parity that you can kind of think of it. That way. So you know, if you have three terabytes on these disks, then you show three terabytes available and then um you know, as you use up space, it's like we're. Allocating space on Awkward disks, but then we're we're displaying a space used. That's like assuming that you had one disk of parity in two disks and the three disks of data.

A

That's sort of the overall idea that we're talking about here now the way that it's implemented in ZFS we um um so first of all like blocks, especially small blocks, can need more parity relative to the amount of data. So you know, if you have a lot of small blocks, then um we're like assuming this data to parity ratio, but that might not be true for your small blocks, and so the space used might be more than what you expect.

A

But until recently the space used was never going to be less than what you expect so, um but now that we have, uh but so the assumption is based on, uh like the assumed ratio of data to parity is, is based on the record size of 128k.

A

Now you can set the record size up to 16 megabytes um and those can be stored more efficiently. So um you know you might see this like uh this weird thing where you know you have like your or discs one of them. One of them is parody and you're, storing stuff with vertices uh 16 Meg, and actually you get to store like more. You don't get to store three terabytes of data, but you get to store more than what it said you could right. It's like when we calculated that ratio.

A

We didn't say: oh, it's you're going to get three terabytes. It was like you're gonna get 2.8 terabytes of data, but then actually you get like 2.99, because you got more because you used exercise. 16 Meg, which is more efficient.

A

um I agree that that is kind of weird and confusing, potentially um and I think it would be reasonable to try to correct that by um changing the changing the like best case um ratio of data to parity, to reflect the actual best case now of 16 Meg blocks, or maybe a theoretical best case of like it is three to one like you know, it is exactly three to one regardless of you know whatever. That's that's the absolute best case possible.

A

um I think that's kind of an implementation detail, especially since, like record size, 16 Meg is the biggest ever probably gonna ever have, because we don't have more bits to store bigger. You know sizes, and um you know practically speaking, the the difference is going to be negligible.

A

um So the question is um this kind of two questions one is: do people think this is a good idea like? Does that semantic make sense to say um yeah like the default record. Size is still 128k.

A

If you go in there with a default, it's going to say you know you have three terabytes available, but actually, as you write, uncompressed data, it uses up a little bit more than that, and so you end up only being able to put whatever it is 2.8 or 2.9 terabytes of actual user data in there um I think that's, probably okay.

A

um You know. Obviously we already have that issue when you, when you have compression, when you have you know uh Z Vols, which have a default record size of 16k. um So it's like in the default case. This is going to be a little bit more noticeable, um which is like not great. It's a downside but I think it's a minor downside compared to um you know having that, like big ratio, reflect kind of a best case scenario um rather than have rather than having people get confused about like.

A

Why is it using less space than it should, even though I'm not using compression? You know right and then going down some rabbit hole of figuring this out, I.

I

Do I do have a note on that, and that is where what you were just describing, where the user might notice some additional difference. They would only notice that change if they were using a sub-optimal geometry before, in other words, if they had a raid Z that only had seven data disks as an example right.

I

So that's obviously sub-optimal for storing one 128k Stripes right, so that configuration is one of those where you would see a few percent additional taken off of just the you know: data versus parity, whatever that percentage would be purely right so that same use case. If that user had eight data disks. Before and after this change, they see no difference in in that pool because they didn't have any additional tact on due to the due to the padding for 128k, even in the former case right.

I

So this really only fixes it for those that end up in the weird you know these 128ks don't fit properly, so we were taking some extra off right and then that was the biggest observation or the thing. The biggest thing. That sort of smacks you in the head. From being on the wrong side of that is yes, those larger records all appear much more. You know they appear smaller than than There's Hope supposed to be right than they actually are yeah. So whenever that number goes negative, that's where it just gets really confusing.

I

At least my personal take just because that's what.

G

I

I spent I spent a good week trying to even get to the bottom of. Why? Why are these? These files are initially I thought 16, Meg record size was leading to amazing compression. That's what originally got me here was like wow. Look at how much better this is I'm going to do this on everything, but then in reality, I started, checking zdb and realizing. No. That file only has like two compressed blocks out of a million, so this this file is really not compressed, but it actually shows as if it was eight percent smaller.

I

Just because of how the 128k would have had eight percent more padding for my case right.

A

Yeah, that makes sense, um as you probably know, I'm, not a fan of trying to gain the width of your raid Z things to like hyper optimize. For these certain cases right.

I

And and actually- and actually this is well no- this actually uh reinforces your argument further right. This makes it so that the free space calculation and everything else is more consistent. Even for those cases that are not perfectly gamed right. This lets you be more free to not have to game it.

I

Of just just not worrying about the raid Z and just setting you know, I mean I. Have a personal pull here. That's 88 wide trust me I am absolutely uh not following the game. The uh the stripe with thing right.

A

um Yeah so I think the the first question for the the group is like does that seem like a better semantic than what we have now and then the second question is going to be like who who is going to implement this and how, um which you know that might have to wait for a later meeting, but um first one we see what thoughts other folks have about changing that semantic kind of you know as a principal going forward. We don't know exactly how that's going to take effect. Maybe it won't affect existing pools.

A

Maybe it only affect new pools or whatever, but like as a principal, you know. Should we change that change the way that ratio is calculated to reflect more of an actual best case scenario, now a record size, 16 Megs, even though the default record size is going to remain at 128k for the time being,.

I

I had I had looked through some of the code again I'm. Not. This is really my first diving through the code base just over this issue. I've only done light dabbling in the past, but uh in some talking, or at least chatting with Alan earlier and my own. Looking through the code like it's anything above trying to make it, the revision only take place for New Pools turns into major surgery very quickly right, you have you have one one form of change: yeah.

I

You have one form of a change that is extremely simple to implement in that you just change the default moving forward, and now only New Pools it applies to and all the old pulls still have, their old uh deflate ratio applied right.

I

um Anything above that is a mess.

I

um There were two separate proposals that I had in there and then, due to some other concerns from that, Alan had uh led me down a thought, hole of uh well what if you make the default high so that you're covered- and you don't end up with the weird you know negative percentage issue right. However, what? If you, what?

I

If the user intended to make a poll primarily of Z Vols, for example, uh that pool that free space reported initially is way off even with 128k, it doesn't matter it's way off compared to what you can actually store there right.

I

um If you, if, if a technically inclined user knew hey, this pool is just for Z Vols. Why can't I just set it to what would correlate with 16k if I know that all the Z Vols are 16k that way, I've that that pool will always have an accurate free space recording reporting, unless, of course, you start filling 16 Meg records or something on there, then all bets are off again right, um but that would still be.

I

That would be a little harder to implement because you just need to give the user the ability to include you know a customized value at pool creation. That would be the the major you know the additional change right.

A

Let's kind of ignore that for the moment and see if there's input on um changing the change, the way it's calculated to based on 16, like the best case scenario with 16 Meg blocks,.

I

Right because that changes literally, you change one number.

A

um Alan Brian of the folks we've already kind of seen that discussion. What what do you guys think about that.

J

I definitely have concerns that changing it up will just mean that the estimates for anything with small blocks will get much worse uh and if that leads to people running out of space a lot sooner than they thought you know. If Z pull this says you're going to have this much space, and then you start writing it and and that space goes down at three times the rate you were expecting. That'll seem weird uh and I think that having.

C

It happen the other.

J

Way where, oh accidentally, you ended up with more free space after you wrote seems less problematic, but you know either way the numbers are wrong, and that feels weird. So what yeah I? Don't.

A

Know the number like the amount that we're talking about is not 3x right, like we're talking about less than 10 percent the difference between using the 128k Assumption versus the 16 Meg assumption I mean it's. It's like single digit, percents, I'm, pretty sure.

J

Right but if you're using really small blocks on.

A

Something then you're already off by 3x and now we're talking about being off by 3.3 X.

J

Right right so.

A

It's just a few extra percent yeah.

J

A

I think that, like the concern about being off by 3x, is, is valid and not addressed and not addressed by this change and not made substantially worse um by by this idea, yeah, so that that's kind of why I feel like it would be. Okay, um I also am like hey. You get some extra free space like. Why are you complaining, um but um from a like understandability point of view, I think it does make sense. You know, and that was the original intent was to give you like a best case um ratio.

I

Right like like when, when Matt came up with that chart, that was in his blog, the behavior was not actually matching the chart in the blog right. If you, if you change it to closer to either Infinity effectively like it goes away or 16 Meg, those numbers get a lot closer to the table in in your blog.

I

So for someone trying to look and go okay, what am I going to fit on this pool.

A

The numbers in the blog um take into account the record size, so they should be accurate, you're.

I

Saying like they don't account for the Assumption of 128k for extra padding that goes into the calculation there. In other words, the blog is assuming perfect parity like you get that three versus four, no, no.

A

That's not at all true the the blog is telling you exactly how much space will be allocated. um This is before, and not it's kind of orthogonal to the ratio. The ratio is a display purposes is for display purposes. Only has nothing to do with how much space is like actually allocated right.

I

No yeah I'm not saying that the amount of space actually allocated is incorrect in the blog that math all checks out. What I'm saying is that if somebody were to look at that math and go okay, I'm gonna do I'm going to pick this place, because this is my use case. This is where I fit so I'm going to make my array based on that information, then their free space that they get right after they make.

I

The pool may be incorrect due to the extra padding, 128k, etc, etc right versus what they thought they might have gotten.

A

I

A

The reporting, the reporting of that free space is inaccurate because they magically know that all the blocks that they're going to allocate are a certain size, but we have assumed that they're going to be 128k right um and so the you know, the padding and stuff works out differently.

I

um Right we're on the same wavelength I'm, just not using the right words. I apologize. That's.

E

C

One more wrinkle I just throw out there as well. This is true for raid z. uh You can be quite a bit uh farther off with the d-raid configurations because they do add significantly more padding. So you know it might not be 10, it might be 20 or 30 percent you're off. If you pick a really unfortunate geometry with really wide stripes um and.

I

Right now it goes yeah. It changes it to multiples. As soon as you get Beyond a 128k stripe it. It doubles. Yeah.

C

And that's really confusing at the moment, because it makes it very pessimistic estimate usually about how much space you're going to get in the pool so we're under promising and over delivering a lot uh which is right. Some confusion. So.

A

For d-raid, um like for d-rate, I, think the common or recommended use cases probably to increase the record size right. Oh for sure, yeah.

I

A

I

Looked I haven't found where that math is again I'm not fluent in the in the code base. Just yet but like I was looking and couldn't find the dang.

A

Thing so I'm not sure the the d-rate math is much less complicated. It's just like you know. When you configure d-rade, you configure the stripe width right you're, like I I'm, having uh seven wide like it's part of the it's part of the configuration of how many disks are in the stripe, and so you know the we're allocating in stripe width times, sector, size, increments. That's it right.

I

Right yeah, it just falls off this very sharp Cliff. As soon as you get past uh the data stripe being greater than 128k or less than 120k. Sorry! Well, you know what I'm trying to say dang. It.

A

Figurative fish is more than 128k, then, like obviously you're always rounding up to that. So you know.

J

A

Basically wide because you have whatever that is a lot of disks.

I

Right if you go, if you go one disc, past 128k, you know fitting properly, then your free space is cut in half that's.

A

Not right, that's not true up by one over whatever so like 128k is 32 sectors, 32 4K sectors.

A

So if you have 33, if you're, if your data Drive width is 33 instead of 32, then you've lost one 30 second or one Thirty. Third, you know one disc worth of of data is Right.

I

Understood I went back and looked at my numbers if, if you're, um if your a shift is 13- and you do that, 32 data disk wide uh d-rate, your free space reported- is cut in half.

I

So maybe you have to be. Maybe you have to be 256 wide for that to happen, and maybe that's a separate math issue happening somewhere but like I was able to I was able to make a d raid with uh 25 of the expected Space by doing 64 data disks and a shift of 13 or 14.

I

yeah just work.

A

D

Foundation right.

A

A shift of of 13 is 8k sectors right.

I

Right, yeah, that's the so that's I was I was speaking, I was not thinking it through correctly, or at least explaining it correctly before. It's that anything higher. If your actual data width goes beyond 128k, then it's just assuming you're you're pissing away whatever the remainder is right.

A

E

I

The higher you go, it starts to really I mean it's. It's a large effect right. It's not like single digit percentages. It's uh yeah.

A

I

A

If you're talking about 8K sectors, then your 128k block is 16 sectors wide right. So, if you're, if you're putting if you're going from 16 to 17 you're wasting, you know one drive worth 117th of your of your data. If you're going from 16 to 32 wide, then you're wasting half of the data.

I

Right, even even if the intended use case was just really huge records, and then you know you end up with your free space. You think the thing has half the capacity, but you can actually put twice as much on there. That's like really yeah yeah off right, yeah.

A

Yeah I mean I, don't know how like common those excuses, are.

I

Oh I'm not arguing that's common at all, I'm saying eventually, it'll probably start to happen with larger. You know with ssds moving to a larger physical sector sizes- and you know, I mean I work at an SSD company. It's not inconceivable to get to an a shift of 15., soon right, um so you're going to start to get to the point where a reasonably reasonable with pool that's not a obscenely wide uh is going to run into this issue, and then you know some. It would need to be addressed somehow at that point right. So.

A

Yeah I mean um it sounds like, especially because the recommended use of d-raid is with larger record size. It seems like basing the deflate ratio assumption on a larger record. Size makes even more sense for d-raid right is that? Does that make sense Brian or am I misunderstanding no.

C

I think I think that makes good sense and it's definitely encouraged usage model, um and it you know, helps with some of these worst case predictions at the same time like if you're, actually using d-raid and not increasing the record size and we're making these very optimistic predictions for you could be surprised when you build your pool and it only you know, takes half the capacity. You thought it would right. So at the moment, it's kind of nice in that we're under promising and over delivering I. Think it's probably the gallon saying.

C

Maybe I don't like it, but maybe it's a better situation to be in. If you find yourself there rather than building something and discovering it's not quite big enough, we're kind of giving a worst case instead of a best case estimate, but.

A

Yeah I mean uh a a better kind of like place to be would be where the default record size matches the assumed record size right.

J

You know when I incorrectly understood this I I assumed that the free space number was like the worst possible case like it's just what would be the most pathological one and we use that to calculate it, and so we could never have less than that much space. But.

A

No, it's not no I mean the worst case is like you know you get 10 of the space or something you know, because there's all kinds of things that can go into it being much much worse than it typically is yeah.

I

And the percentage is on the worst side, even currently go far deeper than the percentages on the opposite side. Those negative values when your records are more larger right, usually that's just a few percent. The other way, it's just that those few percent are really head. Scratchers right. It's like this. This file should not be smaller. How can this be smaller? It's not compressed, but it looks smaller right.

A

Yeah I mean the kind of what you're saying Brian is why I was thinking like hey. Maybe if we could increase the default record size and then increase the assumed record size to match the default, then you know there's less surprises about it. Using more space than expected.

I

I will I will say, having found myself on the wrong side of something trying to access a small SQL database that happened to be sitting in a data set. That was a default 128k. Even that was painful.

I

So if, if you go much higher than that it, you really fall off a cliff right, it's it's where a literal having of the performance every time if you go up another doubling of the default, so you know as much as I'm all for make all the things bigger, because they're I'm I can't just fall into the Trap of assuming that everybody's storing huge things I just want the huge things to be reflected properly.

I

That's my concern, but as far as the the default goes I I don't know, anything over 128k is probably going to cost some some heartache somewhere.

A

Who are already encountering like for people who have already misconfigured it right.

I

A

I

Right yeah I mean if you really know what you're doing. Obviously, okay, that's a database, 8K or 16k do something you know, have it in the right on a properly configured data set sure, um but it helps to be able to cover those the.

A

Default Alexander did you? Were you trying to get a word in there.

I

D

Well, I think we just go in the loop several times. I was just saying: can we move on all.

A

Right um um feedback from other folks I think we've heard at least a bunch of opinions here, um but more opinions uh are welcome.

J

I need to do you enjoy it more. My question is: how much of the deflated numbers are we actually storing on disk? Like is the in the accounting, uh and why would we ever store that versus you know reality of how much space it took and then only have the free space be there's a calculation where we'd be able to change it on the fly without it messing up all the accounting on on disk.

A

So you're you're talking about the implementation of like how you know. How could we implement this in a way that, like gives us more flexibility? Well.

J

It's just like yeah when we allocate some space and we're tracking that in the metadata and so on.

J

um Why was what was behind the decision to use deflated values in there, instead of actual and and only having that the deflated value be calculated kind of on the Fly or the free space and not for allocated, and so on?.

A

um I, don't remember it.

J

Was 20 years ago.

A

But I imagine it has to do with the fact that, like the a pool can be comprised of multiple top level videos which each have their own deflate ratio. So when you have a piece of accounting that combines the space from multiple V devs like say the space used by a file or the space used by a data set, then um you know there's no real way to combine those.

A

You know you. If you combine the base values without the deflate ratio applied, then you have no way to apply the deflate ratio after the fact right.

J

Proportion of of the data is on this feeder versus that vdev and so on. Yeah, okay,.

A

So I think the way it's stored is like within each block pointer, it's the non-deflated values, but then in all the aggregate things like space used by uh file or space used like.

J

A

J

User yeah, the data set ones I think all those have the deflate and that's why changing the value on an already existing pool will mess things up because you'll now free that data and.

I

J

For it in the the new deflated value.

I

Right and then and I think, even if you had a special alongside of a raid Z, then the if the file, if you're, using a special small box like if the file is on the special, then it gets reported with the full size as opposed to the corrected size, because it's not on the right. Z.

A

All right um Let's, uh we only have a couple minutes left. So let's go to the last agenda item more adaptive. Arc eviction uh looks like there's a pull request.

D

Yes, that's my pull request uh in my recent deep dive into Arc. Among the other things, I touched, I went one more point which is archiviction which I found to be obsolete for many years, because some of algorithms there actually go predates abdr and they are clearly broken they're doing nothing Arc the court attempts to evict unevictable blocks, but practice you know truth- is that those blocks are really not evictable they're, not even visible to the eviction code. It's just that code. Do nothing and I tried to try to clean those.

D

I also went deeper and tried to look on long-standing idea of making Arc balance between data and metadata natively instead of series of dirty hacks half broken. uh We we have now, for example, right now we have ghost state eviction which always first week, data and then only started reaching metadata which give us by Design like broken balance between ghosts and then broken balance between data and metadata.

D

So I clearly wiped out all of that code and written pretty small couple of functions from scratch, implementing just a balance between four different states uh like datum, adverse metadata and Mr. um A few formative data practically a quadrant, uh so um like I'd, like people to take a look on that uh code, actually most of it most of the patch, it doesn't looks really scary.

D

It's mostly mechanical addition to separately account for data and metadata, that's required from us, so changes are pretty material and the only couple functions which are material is architect and one functions. It's it calls from inside. It's literally two pages of code, uh that's a dramatically different, but it allows to remove so many dirty code from R from the GFS from The Arc eviction that I'd really like it to be reviewed. I'm, not saying that uh balance right now is perfect in a question of coefficients what should be balanced between data and metadata?

D

What should be rates of like adjustment with database like different states, moving up and down data with the data, or even, if you remember some, are you some of them were just taken: semi randomly, but I tested it. It's logic is working, but some specific tuning May get some tasting or for better ideas, but I'd like people to to invite people to take a look on the logic I implemented there, or is there any assaults? Opinions about that as I thought? It's like two pages of code, which is really a worse review.

D

Others other parts are mechanical ones, so significant change, but it's pretty small change in textile form.

A

That sounds really cool, um so it sounds like you're, balancing between data and metadata, using a kind of like um adaptive, algorithm.

D

Exactly I don't use the same mechanism of ghost States I just introduced that are now practically. We always had forgot States, it's just we're not widely announced and, as I told they were not really properly handled. Eviction was broken between data and metadata goes States, so I really formalized four different states daytime a few daytime or you go a metadata mru and the metadata mfu like it's really four different states. You separated accounts and separately separate everything.

D

uh One downsides is that I'm not exactly sure how to perfectly balance like uh Leave, It, Go State, because when there were two states the size was always be equal to size of Arc.

D

If we have four states, uh that's slightly more tricky science in like in one from One Direction from one side, I'd like the total size to the status of Arc, but from other perspective, I'd like to be like double size of Arc, because I don't know in which way eviction will go so I'm trying to right now, I'm taking approach of thinking, something in the middle.

D

So we'll, like question like, if what, if one of four states wish to use all Arc, do we want to give it to it or we wanna go slowly and well algorithm does something in between like, uh if, like you have hits like 90 distance of from the current point, it won't be not it won't be either one won't be a ghost hit.

D

It could be a miss uh because, like those hits may be irrelevant, if other states also require some memory so to properly handle that it will require much more States like it's like okay, this heat was very close to recently. That was far as it was far much more complicated Mass. So uh so I try to take something in between so sometimes like, maybe uh hit 75 percent of Arc in the head will still be a hit. But if you go out on 80 or about it will be already missed.

D

So it's uh but one of potential point of adjustments, but yes, I told mass is pretty simple and clear and that allows it allowed me to remove few messy parts of the code.

D

So I think if it gets some review and maybe comments, maybe tasting of to see how it behaves on real workloads, I think it would be good.

J

Yeah I had a question for this uh looking at a customer workload recently, and they wanted most that are to mostly be metadata and currently with the tunables they have. All they can do is limit. What percent would be metadata, but they can't actually limit what percent would be data uh with this change? Can you control the maximum size of like the the data mru.

D

J

D

With this patch I removed most of hard limits like uh just completely removed all of that I practice, I left only one nope tunable, which allows to specify how much metadata ghost hits is more important than data goes Heats, the generally like, if you set like by default, it's five, just mostly random number, but if you set it to do like 1000, there's no upper limit, then it practically means metadata will never be evicted.

D

That means that as much as you have Arc like most of most of it will be metadata, data will be on very minimal scale like that right now, I'm not no longer measuring one of my chains. There I'm no longer measuring balances in like bytes between a light bulbs between mrem a few before, but now a parasitically measures them in percents. Well, fractional numbers, yes, 32 30 fractional numbers, so it should be trivial to introduce limit to say. Okay, we will balance between data metadata, never less than x, never higher than y.

D

It's it's trivial to add, but I just trying to be simple because previous logic of half dozen different knobs were just not working and.

I

D

I

Can back up Alexander there I have an issue from a year and a half ago where I was seeing metadata being evicted when it shouldn't have and archimeda Min does effectively nothing. So you know the knobs that were there were broken anyway, foreign.

A

Area, that's long needed some work, so thanks a lot for putting something together Alexander um now we need to get some code of yours.

D

Yep well, as I've told like it was, it was only parts that literally require reviews like one or two screens of code, so I'd like people who've been there to take a look. That's that's not much. Really. Most of the pr is mechanical change, just to separate states.

A

Yeah I think the tricky part is understanding like the dynamic behavior of the system. Given the code right yeah like how do how do things actually end up, balancing and and making sure that we aren't introducing other bad behaviors, where you know you get stuck with barely any you know of one of those four types of cash and then not be able to recover it.

D

Well, generally, if one case is huge, it means others. Other states will have huge ghosts in those ghosts. They will critically recover. I even I introduced the uh which slows down like if States going too too low. It's actually like the reduction is slowing down it's very difficult, to get to say in states to zero like after 20 to 15. It will get really slow.

A

Yeah, it sounds great, it's just you know, that's what code reviews need to verify that actually has that right, which is hard to do. You know, because it's like you're trying to figure out how the system will behave um under certain inputs.

A

But yeah, that's that sounds great. um I saw you tag me and George uh I think we're good candidates to take a look at it, um and maybe we can find some other folks as well.

D

I'd be happy.

A

D

A

uh We're over time, so let's call it a meeting. um The next meeting will be in four weeks and it'll, be at the earlier time, uh 9 A.M Pacific right thanks everyone. Thank you. Thanks.