OpenZFS Leadership, 5 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: November 2022 OpenZFS Leadership Meeting

Description

Agenda: quota performance; ARC MRU/MFU; BRT details

full notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

uh Welcome to the November 2022, open, ZFS leadership meeting um looks like not too many of us here right now. uh I won't make a comment on the meeting length so as not to jinx us, uh but I don't see any items on the agenda.

A

um You have a quick update from the conference, although I think most folks here are at the conference, um I had a great conference I thought of great talks, a lot of great um socializing with other developers that haven't seen in several years and we're posting the videos now so um I'm getting about one a day uploaded on YouTube, um um so uh that content will be available to folks who couldn't attend in person.

A

uh Other topics or questions.

B

Of speaking about topics uh just like to bring attention to my podcast for review for uh some archery Factory and I started unexpected legal for myself, I was going to work on uh Uncle prefix I was going, I was talking during the hackathon, but ended up first digging Arc and stuck there and ended up refactoring a few things got ugly over the years. So there's a number of refactorings so I'd like some people to have an eye on it, because at this point, I'm just already fixing unrelated uh test failures, which are not even mine.

B

So it's probably a good time to take a look.

A

Yeah um I took a quick look at the at your description um and uh you know it all sounds good. I didn't look in detox the code yet.

B

While looking at it, I found a few more uh odgs around that logic uh separate, but still like, for example, right now, we are in first order eviction data from cost lists, I'll. Normally, then, starting a week in metadata, I bet on system with a lot of metadata. Our balance between mru and mfu is heavily biased to metadata I'm. Not sure was it intentional or it's another bug.

B

A

Yeah yeah I I I've seen that kind of imbalance in the past on real systems. um I suspect that there's some issues with the logic there.

B

Just I get a few more ideas, but they it's difficult to work on patches, some punches of punches. That's why I'd like this one to move on and so I could do other things.

A

Cool uh by the way, um Muhammad I was just responding to your request for uh slack invite. um For some reason your email address didn't come through on the mailing list. So um if you could send that to me, send me your email address, then I'll get your invite to slack, and that applies to anybody else, who's watching or listening later on. um If you want to join the open, ZFS slack uh just uh send me your email address and I'll give you an invite.

C

Sure, um if I could just ask like how can I get an email address, I I'm, not sure it was very obvious how to get it. Oh.

A

uh My email address um you can send it to Matt M aarons.org, so like mat at m-a-h-r-e-n-s dot, org I'll post it in this in the chat, um okay or um if you, if you are comfortable, sharing your email address publicly, then you can post it on the mailing list as well.

D

Yeah sure I'll do that thanks.

E

uh Alexander, did you have a chance to look at the Z Vol quota PR that we updated? uh We put the skipping enforcement on Z vol's behind a tunable so that if in the future we add the ability to actually adjust the quota on a z vol that it will work by default, and this patch.

B

E

B

I I, so lost I saw that by default, uh quotas are still verified, so I can see that that's good. It's at least it's not bad.

D

E

And the the other change that actually allows you to go slightly over the quota in order to avoid you know, constantly stopping everything and waiting for transaction to flush uh has already been merged.

B

Oh yeah I was actually thinking like if we are already doing implementing mechanism to allow going over quota. Why do we still need the tunable uh to improve performance? It should already be good, I. Think, isn't.

E

It I think it depends how much you're writing uh but yeah. It was mostly just to make sure we never slow down uh unnecessarily on his evolve or.

B

So it's still for your local use or somebody's local use right.

E

E

uh Yeah there's enough use cases, uh especially with you know, lots of VMS or iSCSI stuff on it, where it would make sense that you'd want to make sure you're not slowing down. uh Oh yeah,.

B

E

The quota is, is purposely kept relatively small uh and you know the amount of outstanding rights could be larger and resulting you still waiting, but that's why it's not on by default.

B

No I still hope that uh what I I don't remember what was the amount of over quota? But if it's in percent I think before any long too small right, it should be sufficient or well and enough to not slow down too much. But it just looks much less invasive to me yeah and be good about that.

E

And I think we have a agreement from Brian and Tony and so on about uh during the.

E

Vita properties on the root vdev and how you will access that uh up to Rob will push an update to that pull request today or tomorrow, uh and then I think that one should be good to go. Then and uh Marius should be able to finish the last update to the fail fast one uh to make the the Linux specific tunable or module parameter for setting the mask and we've changed. The data set property back to to be simple, and so that it'll be compatible with something. If we add a similar mechanism to FreeBSD.

B

Oh yeah I saw the commit just reducing to on off I. Think that's enough for me. I I just like to see we use it more automatically or in cases when we know we can retry or can't retry see. We use it only in very few cases, but probably we could expand. That would.

E

Be nice exactly can expand it exactly.

D

E

Think that's the all of the progress we have open I. Have we just started working on the force export one again, so it's not really ready for uh to get approved to be merged or anything just yet, but we're working on that one actively again, I think Matt that will include uh based on your feedback, the small semantic change to fail mode equals continue uh where we will actually fail out rather than wait on an fsync. That's started before we suspended, but then we suspended- and you know we should fail there.

E

Instead of wait forever um and likely add a new fail mode equals export. That will you know, Force export, a pool that gets suspended so that it can't jam up. You know any other pools that happen to be on the system.

A

Cool that sounds good.

E

A

E

As I can remember, that's everything I have opened.

A

Cool thanks, Helen um I, see a couple other folks have joined. uh Were there wasn't anything on the agenda that I saw in the agenda doc. So uh it's just open discussion of any questions or topics that folks have. um So uh what would folks like to talk about.

E

I guess we don't have uh Ryan today, but there's been more questions around uh when we would be doing a 2.2 release.

A

Yeah I think he was conveniently absent last time.

E

We wanted to talk about that too. Yes, yeah.

A

uh Why don't we ping him on slack, okay, foreign.

E

We were looking at is the per data set. I o swaddling stuff that I talked a bit about at the conference. uh We've got based on that got some uh interest from folks on actually getting that done, uh and we've looked at a couple of different places where you might implement it in particular, if it makes sense to do it near the top before it's the the VFS side or lower down more towards the the vdef side.

E

uh The pros and cons to both of those a bit um obviously doing it closer to the vdev side means that really large rights are already broken down into more reasonable chunks, and we have um it accounts for the extra things that happen like the metadata updates that are caused by what the user does, so that those count against their quota, whereas if you're, just at the VFS level, maybe doesn't um foreign things like that, but also you know once we're too far down towards the video level.

E

How much do we know how this right is associated back to a specific data set to charge them for it uh and kind of looking at uh all of those just interested? If anybody has opinions or experiences to share.

F

um If you look at the um and I have to I, don't recall the specific places since I didn't write it, but um on Smart OS. This is never Upstream, because people it's per zone so like a container.

E

So there's a people request uh where someone ported it mostly to the Linux one: it's like a pull request in the 9000, so I've looked at it. It looks kind of interesting, although it's mostly about ensuring fairness between the different zones, not more strictly limiting a specific zone or data set and uh yeah.

F

E

Want specific limits, yeah.

F

I mean it's not about um I, don't think you can set like a specific. I o rate as much as you can, whatever the IRA rate the pool can sustain, you can divide, you know proportionally divide it between uh things, uh because it's kind of it's kind of in this um um on a lumos or something called the fair share.

F

Scheduler, uh which kind of can work the same way for scheduling between zones or things called projects or uh which have nothing to do with CFS projects, but essentially groups of processes, um and so it's kind of I I believe it was kind of modeled after that, I'd have to ask Jerry for sure, but um but that's kind of how it works.

F

A similar idea, where it's kind of pro you know a relative proportion to the other stuff, but at least some of the techniques in terms of you know, slowing you know essentially I believe it basically adds latency to certain iOS to kind of achieve the fairness with the others. I don't know if at least some of those techniques or bits there might be useful.

A

Yeah, depending on like the semantics that you're trying to achieve that, might it might be better to do it at the you know the like the SPL layer versus at the over, I o layer um like if something maybe what you want to. Maybe what you want to expose is like a quota of you know, this data set can't do more than x megabytes per second.

A

If that's the case, then I think you could do it at the at the SPL it layer pretty straightforwardly, because you know how many bytes are being written and also um or red, um and also you know, you're exposing the megabytes per second to the user.

A

So you know, if there's like an overhead of doing Med of like reading metadata, then you know maybe they shouldn't get charged for that, because they don't have really as much control over that. Similarly, like um they don't have as much control over compression like the compression ratio or whatever. So you know, maybe they should be charged for the uh you know, pre. You know before compressed or uncompressed size, which is what's kind of easily available at the um at the zpl layer.

A

But it depends on the semantics right of what you're trying to actually implement.

E

Yeah, like we've even gone back and forth, or you know, if it's an arc hit, should we charge them for it or not, uh because you know there's advantages to not. You know they're not actually using up a precious resource in that case, uh but does that then feel weird to the user? Where sometimes it's fast and sometimes it's not, but do we want to limit them to as lower speed for no reason and yeah.

A

Yeah I mean it depends a lot on kind of. Why you're trying to do this to begin with.

E

Yeah, especially, you know if it's partly like artificial scarcity, trying to do an upsell or something uh to charge more for apps. Maybe you want to limit everything, uh whereas you know, if you're just trying to make sure that noisy neighbors don't steal all of the the disc bandwidth, then it makes sense to to charge for what's, you know actually consumed rather than it's.

A

Logically consumed and in the latter case like maybe something like what Jason was describing, where you're setting up per megabyte a megabytes per second or whatever you're just saying you know this group of iOS and that group of iOS should be kind of like equally ish weighed.

A

You know, because you can easily have two workloads in the same system that where one is like sequential, like or one is like serial iOS right like do this I O then do the I o then do the next. I o and the other one is like issue a million iOS all at once.

A

Well, you know the one that's issuing a million at once is gonna saturate, your storage back end and the one listening one at a time is going to experience much higher latencies because of the higher Q depths. You know, that's that's the kind of worst case Noisy Neighbor, so maybe you're just interested in preventing that yeah I.

F

Don't know the other part too, if you're trying to specify certain you know um bandwidth, you know the the difficulty is um I guess you know in terms of you're dealing with over subscription because even with a given pool you know, I haven't really done much testing with this, but I'm guessing that the actual quote, unquote, Max bandwidth of the pool is probably gonna vary depending on the I o pattern. You know if you're talking, you know physical disks, you know random versus sequentials gonna. um You know impact the I o rate.

F

You know if you're talking, you know SS, you know D or if you have log devices in there, that all kind of makes it hard to say. Oh well, this pool can sustain, you know, can do you know xio. You know megabytes or gigabytes a second of I o um and I'll, even the IR rate. So that's the other thing with I. Guess, waiting I, guess is kind of you know. I think was probably the other rationale behind at least the approach that was taken. I, don't know if that helps any, but.

A

Yeah I think that's that's definitely true and um the like. If, if the quota is, if, if you're leading a quota of megabytes per second, then the issues that you raised fortunately, are not relevant to that right, because you're, just saying like well like I'm, going to prevent you from doing more than this other things might prevent you from doing other amounts of I o, which might be less than the quota um just like with the disk usage quota.

A

It's like I'm, saying you can't use more than this, but you also might not be able to use that whole amount, because it's a quote and not a reservation right. So um that's at least implementable, um whether it's. What you really want is you know, goes back to the use case.

E

Yeah so we'll have an update on the design, as we get a little bit further through it and figure out exactly what we think will make sense for what we're planning to do here.

E

But I think we are leaning towards specifically basically IO quotas of iops and megabytes um and probably doing that at more closer to the physical level. Just to.

E

Account for you know what they actually delivered resource.

A

Yeah I think that should be doable as well. Yep.

A

More topics, what would folks like to chat about.

A

B

And there I with my idea: how could we improve arcp behavior in case where same data, they were also I, should be same time, mru and mfu, because they are recent and they are frequently accessed and right now uh they are like behaving slightly weird because we move them to mfu, but Arc p is not decreased, so we can accumulate more of both mru data. As a result, some of currently mru data set left became very, very old, but then surely much older than should be and I think we could improve.

B

If, if we explicitly track buffers which are more promoted from mreu to mfu for proper tracking and Mario depths like you, we would know that X megabytes of recent data should be kept no matter. How much do we promote Let's uh issue, 14 120? If somebody wanna go on command there, I haven't tried to implement it yet, but it seems not so complicated. It makes sense, but maybe I'm wrong in understanding the concept.

A

Oh yeah, I haven't looked at that one, um but I think uh like I agree. There probably is a lot of weirdness about how like mru, mfu, split, Works, um the way I always think about it is it feels like those are kind of misnomers um and at least from how it's implemented, because it's really like one list is things that have been accessed exactly once and the other list is things have been accessed more than once right, so it kind of makes sense how that accomplishes the arc's goal of being um scan resistant right.

A

So if something is scanned, it's accessed only once and then it can like fall out of the cache faster than things that are accessed have been accessed multiple times where we're like. Well, it's been accessed twice at least twice, so it's like more likely to be accessed again versus something that's only been accessed exactly once, um and maybe that kind of uh uh thought process will help with the when, like trying to figure out what it should really be doing.

B

Well, I actually read original paper about our cash, but uh I haven't found a that part in there like it looks like it's based on assumptions that mru and them a few lists shouldn't overlap too much. But the fact is, if we are doing a lot of mru hits like we end up with buffers that are pretty recent, but they are in a few lists, but while potentially there should be territory in Boss because they are still recent and uh problems that RFP practice is practically measured in bytes lengths of time.

B

How long shall we kept recent data in cash with hope for them to be reused, and from that perspective, when we are promoting something to a few RFP should remain the same, because distance haven't changed, obviously from the fact of promotion. But on the other side a science we promoted Arc, but there are just less data ended up in in mru, RP should be reduced and there is controversy- and this is like I described mechanism.

B

How could we temporarily reduce it, but then restore it when actual eviction time will arrive, so I think it should be at least a closer. Much in original idea was not so much blood.

A

Yeah I I read the original paper like uh a long time ago, and um I totally forgot that that's how it was originally proposed, um but yeah. That's definitely not how it was implemented uh in terms of like a block can only be on one list at a time um in the actual implementation.

B

That's just my assumption, because idea of uh mru is just. We have the X amount of data which are recent, we that we may reuse, but when we promote from data from our mru to mfu, we remove them from mreu and uh just the distance if we still try to measure it in bytes of of cache size just by itself. Sequentially read from disk are not the same as byte as currently in cash, because we've removed things out of mru list and after that, if it started, do more reads and add more data to mru.

B

Should we use the previous or should we keep them? Because right now or it may happen that okay, we have huge uh mfu, because we just promoted a lot. So we start trading and zmfu immediately wiped out because of our arcp still huge.

B

Well, there's just not no data for it.

A

Yeah, it does seem like the mru mfu split um can be, can become very far out of balance and it's like it can get stuck there, and then you have the weird behaviors you're talking about are even more accentuated mm-hmm.

G

Is that adjustable in any way that I put a comment in chat there for what it's worth.

B

I saw the comment, but I couldn't read it in time.

G

Such so uh Matt, you mentioned the what I'll call sensitivity to a scan. If, if data is passed over n number of times to be potentially cached, is that adjustable in any way? Or is it just hard-coded.

A

um I mean it's not about it's, it only cares about. Is it scanned over once versus multiple times? Okay, so there's no. Like n, you know, there's no hit count in the arc, um like uh per block hit count in the in the implementation.

A

E

From the mfu right.

B

We have counts for all the states, but the only statistical ones they are not used for anything. Maybe we could.

A

I, don't think that there's a count per block it does. Oh, it does have a per block account yeah.

E

But we don't use it for anything interesting well other than to differentiate that oh, this should go over to the mfu I.

A

E

Algorithm uses it to decide you.

B

E

Isn't going up that it? It is more evictable or something but I. Don't.

B

Know if I feel good, even just Basics or some simple fact, okay, we are now on Mr U. We got another hit, so we are promoted, it doesn't care about counters, but we keep, but we do have counters for every state. We'll counter for a number of hits for every state for uh l2s are because they they are in L2. Here, therefore other states there are L1 here. So we are spending some memory, but they are reported only through some stats interfaces, not use it for any Mass yeah.

A

That sounds like a waste of memory. Then right. If we're saying that every block has a hip-hop I mean we tried to be pretty efficient with the memory usage in the arc. Header I'm surprised.

B

That we get through that we were explicitly or hard on uh El Tosh for L1. We were not so much. That's true.

A

A

A

Yeah I think there's probably a lot of uh Improvement that could be done here. I think it's It's tricky because we don't have like some like reference workload that we can feed into it and then measure the hit rate and then like try a different algorithm and then feed in the reference workload and measure the hit rate.

G

My question was a video editing where you might have very significant metadata. That applies all the time, but then very high high flyby actual media data that you don't want to attempt to cash in any way.

B

Well for media data: you will you should never get uh ghost hits as a result. You should never grow mru much so whatever your MSU should stay However. You should stay pretty small and don't care about wiping out Arc, while mfu should be sufficient to keep all your frequent use of data. That's the idea, but I'm not sure that I think like like, as I mentioned, there could be mess between a distribution of data and metadata. We have, in addition to our um a few, that's where it could be messy and I.

B

Think it's heartbroken now I'm outside I, used to think in which way to better solve it either I think it could be tracked. The distribution data metadata could be handled through the same mechanism of ghost ghost caches, because here there's or do we need to at least unify and don't track separately uh data metadata distribution for those States at all for proper eviction, because, right now it's is broken. There was PR. Actually somebody created to completely remove Distribution on data metadata in Arc, but somebody was screaming loud. No, it's bad!

B

It's bad I'm still, not sure it's bad, but I need to close her.

G

Thank you for those explanations.

A

Are there questions or topics.

E

A

E

At it uh quickly at the the L1 buff header might actually have a bunch of more stuff than maybe we need.

E

We have hit counters for mru and ghost mru mfu and ghost mfu yeah.

A

E

Clock t for the last time it was accessed, although we do use that I, don't I I.

D

E

Know how much we use the ghost hit counters, though.

B

I was also I exported somewhere and used by some somebody's profiling tools or something like the debugger tools, but they are not used for real life only access time. It's really used for promotions from memory to mfu.

E

Right and even the bite swap is an 8-bit and I think it can only have a Boolean value right.

D

G

Mentioned access is that posix access, time or internal to ZFS, it's a clock, T, it's internal! Thank you. Yeah.

E

G

B

I make sure you wonder why uh so small distance between uh seats, which are counted as mfu? It's now, it's like 60, millisecond, 16, yeah, 64 milliseconds. Something like that. Why so small like sure like because I think on sequential read, you may end up reading indirect blocks multiple times within the time. Considering typical look at some slower disks and uh thousand block pointers per indirect block, it means like more than 100 megabytes per second.

B

There are potential disks that are slower than that and they will probably get direct box promoted too I'm a few just not sure it's, because it's good or right I do know where that uh value go comes from I.

E

Don't know it seems pretty arbitrary.

A

E

It's it's very likely that a much higher value before we promoted to mfu might actually result in better results, because if you access it twice really quickly, but then don't use it again. It's oh.

B

Yeah, if you access it quickly, it's still in mru and it should be fine stay in there. So I was thinking whether you value could be better. It's actually what's confusing, because comment was talking about 128, milliseconds, well, actual value. If I read correctly means 64 like 1 16 of a second, so it's even not even self-consistent so I was thinking. It was a bigger value should be better.

E

Right and it depends how it's measured, I know uh with the old scrub timing stuff. uh You know the lumos clock ticked at 100, charts in the previous D1 and a thousand Hertz, and it meant the tunables did very different things with the same value.

A

E

B

I think it's you said divided by 16 or something like that. I, don't remember. Yeah.

A

I think we fixed all those uh several years ago, but I remember that foreign.

B

Ature as one of possible solution for evict data that will never demand access at times he can add in one more Arc State, something like uncached or something not sure about name, but the idea is just to drop there all the buffers that should be freed after, like one second and create separate Reclamation threads and just go couple times per second and free everything that remains there. That's one of my ideas, alternative idea, I had just put them into debuff signs.

B

I expect them to be physically contiguous and there should be no duplicate memory and then on eviction from debuff Cache that could be evicted also, but from the professional perspective, science doesn't use debuff cache right now. That would be a bit awkward, but I think it would be good to allow uncachable data to still to stay in debuff Cache. That would allow like sub block, reads to be still cachable, because right now it happens. Weird like if you have block 128k but I, read 64k at the time.

B

You'll get a bit and read it's inefficient, so I'm thinking about allowing uncashable data to stay in debuff Cache and be evicted from there. Normally.

A

It might be reasonable. um Do you think that it would be sufficient to say like if you did a sub block read then we'll keep it in the debuff cache? But if you read the whole thing, then we'll kind of assume that you got everything you could need and then we would not cache in the name of cash.

B

Ier too, it's just that information is not available at that point of making decisions should be positive, some additional flex but yeah I had such so too. It would be good, probably.

B

A

Cool, that's a good brainstorming. um uh Powell I saw you unmuted there for a sec. Did you have another topic to discuss? Yes,.

C

I would love to get some people to to do the final review of blood cloning uh I think everything is uh addressed. uh Not everything is addressed in optimal way, but I think we can definitely move forward and uh and basically work on some optimizations later.

C

This is mostly stuff like cloning across multiple and across two different data sets. uh When we have sync such cloning operation now, we will just wait for transaction group to to be synced instead of using zeal, but this can be added later exploring your idea, math to using Zeal claim yeah.

B

I I, don't remember, was it committed comment that I was going to but I think uh full transaction commit on every copy request. So we think is is too much what the files are small and we're just for file of 100, kilobytes or even a few megabytes will commit transaction Group which will meet many megabytes of rides and cash flushes and quite expensive. Maybe there could be some threshold before which we just report. We can do it or something.

C

I think the best idea is to just do zero claim, um but but it needs some work. Oh.

B

Yeah yeah, obviously I'm just at this point uh like I understand that transaction committee is easy to do, but it may be unacceptably slow like you. Wouldn't it be better. Just return error and let fallback code to do manual copy I think it would be much faster than waiting for commits.

C

Yeah well in in that case, you can just simply uh don't change the code in the VFS to not allow uh to not call the the file system, copy file range and just use generic file range generally copy file range, so this can be disabled for now, until the Zeal claim is, is used for that. Oh.

B

I just I, just like it to work for within one range within one Mount Point. Within one data set and when Crossing data set okay, it will return a Roar and let fallback code handle it.

C

But for now VFS prevents that at least on FreeBSD on the Linux I think it will allow uh you will call into into GFS. We could return an error, then.

C

A

You could argue like me, maybe the first thing that gets integrated is uh like there's a flag that says that, let's, like a tunable that lets you choose between, um you cannot you know zero copy between data sets or you can zero copy between data sets, but it's, but you get a txua synced. If you have sync it, um and maybe the default is just you can't zero copy between data sets in the first implementation.

A

That way like it was less of a sharp edge um for these initial users um and then like once once we do the claiming or some other solution to for the f-sync, then we can make that be enabled by the you know. Zero copy across file systems be enabled by default.

C

I would actually even opt for a third option where you simply uh ignore fsyncs for uh for cloning, because I think that the the uh the most common use case would be to clone large files and not small ones. So in this case, I I personally I would just do that on my tools.

A

So for large files, you can probably afford to do the txt wait. Sync after the whole file is copied right.

B

I think we could add some threshold I, don't know what would be like 100, megabytes or gigabyte after which I'll allow it and wait for translation commit. Probably it wouldn't be that bad well for small ones. It makes no sense.

C

But for small ones you would like us to do nothing or.

A

Yeah call back on copying the data right yeah.

A

That way it could be logged. You know in the Zill, like normal yep.

B

Well, it's okay, it's double right, it's extra space usage, but it should be so much incomparably faster than it should be better I see I just had a thought what else ah in case of equipment from snapshot, as uh mentioned good use case? Would this be counted and separate data sets or not, because uh it would probably be safe to copy data out of data set because they are stable.

B

The only way for the space to be freed is actually the Legion of the snapshot which, which means a transaction can meet um by definition, I think and then we should be safe.

B

C

Well, I'm not sure if we, if we will be saved because you can still destroy snapshot without mounting the file system right, yeah and replaying. The zero.

A

Yeah that would and like unfortunately, I don't think, that's the case because the what he mentioned like you can you could do the like copy file range from the snapshot and then lose power and then come back up and not Mount the file system and so not play the Zill and then like delete the snapshot and then and then later on, Mount the file system, and then it plays The Zone. The Zeal says: oh refer to this block that happens to be from the snapshot that was deleted should be bad.

B

Yeah through I was thinking from wrong direction. Agree.

A

Let's beat him, unfortunately,.

C

Yeah, um thank you. I can definitely do some tunable I can also look into Zeal claim. There was I think one complication with Zeal claim it's that you that maybe it's not a huge complication. I I didn't write into the code too carefully, but there is some handling of uh rewinding the pull. If you import the pull- and you want to rewind to some other transaction group, you don't want to replay or claim some of the uh Records.

D

A

Should be fine right like if we're not in in those cases, we're we're discarding the cell so um but like when you re rewind to an old txg, we're also discarding the Zill and and not using the Zill, and so the fact that we don't claim or we don't claim it and we don't replay it. So the fact that there's entries in those the records that refer to blocks that we're not sure if they really exist doesn't matter because we're never going to play them right.

A

B

Why shouldn't we replay a lock in case of rewind? It just import previous transaction pairs.

B

Replay of it should give us more more up-to-date State should be good, I think it should be replaced always or I missed something.

B

Of course, if we are winding too far, back blocks could be or written by something, but from the point of of that txg metadata that should still be allocated already and everything so I. Don't think why rewind should be a problem.

A

All right, I think I, agree with that, and and also if we discard the Zell, then that is also not a problem right. If any I think that there may be there's like a flag that you can do like rewind and discard the Zill or whatever, and that should also work.

C

um But for me the rewind operation was mostly like uh a way to to recover the pool when police, in some kind of like Corrupted, State right so I would uh I would think that you don't want to replay any data behind some point, because you don't really trust any data. Beyond this transaction group or something like that.

A

So that was lied in that transaction group is still in that transaction group that you're that you're, trusting or trying to trust.

B

Yeah yeah you're not going to rewind beyond the transaction group from user ahead. There should be different logs for each exg. If I remember right to.

C

Understand correct yeah.

C

um So um so can we can we try to like agree on some uh on some version which, uh which you guys would find uh mergeable uh well.

A

C

Think before Brazil claim is done, Thunderball would be would be okay,.

A

Yeah I think I mean my principle. Is that.

D

A

The behavior out of the box should be um correct, like we shouldn't just be ignoring that like ignoring an fsync, um and it would be nice if the behavior out of the box didn't have like Extreme Performance sharp edges where it's like. Oh you know, if you fsync, it then like most fsyncs, take on their milliseconds, but then some take on their order of seconds because they happen to have, because you did a copy file range right.

A

um So the thing I was suggesting where it's like you know normally the behavior is, is that, like you, um you can't copy file range across data sets? Then you don't have to worry about. You know fsyncs of copy five ranges across data sets because they don't exist.

D

um Alexander does that.

A

Does that work for you as well I.

B

Think if we would set a tunable to like one gigabyte and uh copies above one gigabyte, we always do transaction commit that I think should be acceptable and for anything below this crosses it should return error and do manual copy. That would be okay for me and should be trivial to implement.

A

That's that's fine with me too. It sounds a little. It sounds slightly more complicated than what I was suggesting, but um some.

C

Semantically, it's still easy enough, yeah, um okay, but uh just so you know, there is one more case where we went when we wait for a transaction. Sync is when we try to clone blocks that were created in the same transaction group.

C

uh So I just wait for a transaction group to sync with this operation we could also fall back to the copy, uh but uh well again it depends on the perspective. I know that IX have a lot of storage available, but uh I would guess there are some use cases where you want to save as much storage as possible. So no.

B

Saving is good, but but again what happened if we have a large Z wall with multiple concurrent operations, some of one VM doing a lot of rides as a VM. Doing, who knows what? But some other like VMware itself will try to do copy. The employing and science like granularity of the decision is a full the wall. It may end up sinking them for every few megabytes, that's also quite expensive.

B

It would be nice, I think if the code could differentiate it more fine-grained uh like whether this data is copies actually overlapped, because having uh offset dirty is quite big. One.

C

But if you copy stuff in parallel, I think that you are okay, only one uh one stream would be frozen for a bit.

D

B

Like just what, if one uh one process doing a lot of Rights while other process trying to um do block cloning on completely unrelated part of their object, so they're not overlapping, not conflicting, but science, the first one creates always creates dirty blocks or or I misremember. How it's implemented. I think it was like if we have dirty record for anything for for object practically for the wall itself. Right.

C

uh So, actually for civil, it's not implemented yet. So it's not.

B

Attached, there's not no no interfaces but yeah we've discussed. It would be good, but no potentially could be a file with same assumption. I, don't know. How often is that just you will example is the most prominent I.

A

Mean the tracking of whether you're copying file, whether if you're doing copy file range from something that was modified, this txg, the granularity, should be on a per block basis. Yeah. Is that how you implemented it pal? Yes,.

B

Yeah I think it's also if we have dirty record for anything.

C

A

C

No, no it if, if so, basically, I want to clone a block, and there is a dirty record for this block, which doesn't have oh.

B

Yes and I have no objections, and then it should work. I just somehow thought it's per object.

B

It should not be a problem. I, don't think it should be at this very specific workload. Pretty weird one I don't have nothing. I have one as such.

A

Yeah I mean the cheeks away. I. Think I would argue that um yeah. That is a weird workload.

A

It's probably unusual, and um if you hit that unusual workload, let's fall back on copying the data rather than fall back on txt waste synced, because the performance implications of the tx2 rate synced can be pretty extreme um I think it's probably like, in my mind at least I'm thinking, that the copy file range is kind of advisory like there's, no guarantee that it's going to save you space, it's like most of the time in most circumstances, if you're, using it kind of the way that we expect you to then you're going to save a lot of space.

A

But if you know if the stars didn't align the dirt, the data that you're coming from is dirty or what is different data set or whatever, then we're going to fall back on some layer here is going to fall back on just copying. The data that that's kind of my mindset does. Is that reasonable for the use cases that you're thinking of.

C

Who me yeah uh yeah? It is reasonable, of course, actually copy file range does have a place. You can provide some Flags. Currently the standard actually doesn't Define any Flags I think there are some ways on Linux to um or I'm not sure if, if there is but uh but I could imagine a flag which which says that uh just clone at all cost, so don't care about performance and the default Behavior would be to to be just performant and do whatever is quicker right.

A

Yeah I guess I'm coming from the perspective of like the most of the point of this is performance like saving copying the data from like reading and writing the data to disk versus space savings at all costs, because you, you know, if you wanted specific things in our class, you could always use dedup.

C

Yeah, that's that's reasonable, like uh I was actually coming from the perspective of, uh uh but this was also different that my initial implementation was uh my perspective, was to always clone the data and save the space, but then I was also implementing dedicated system calls, so you could so that was the purpose of the system calls right to to save the space, but now copy file range. As you mentioned, it is advisory right.

C

So, if uh it's this system call, is it's not there for to guarantee space savings right it it's there to actually uh speed up copying. That's the purpose of this system call.

C

So that's yeah! That's reasonable!.

A

Yeah so then, in that case, I mean I I feel like getting this integrated um where it you know getting this integrated sooner rather than later, with the perspective of like it's going to speed up a lot of file copies, but not all of them um would be reason would be a reasonable approach and then, like later on, we can add more circumstances where the copies will be accelerated by this um depending on you know, customer needs.

C

Yeah definitely there is also another case that uh uh should be optimized, where you uh do unalign uh copy file range. So for now, I just returned an error, but we can definitely uh Implement copying, just the the underlying fragments and cloning. The rest.

A

C

The implementation is a bit more complex than that, because we probably want to do that under one v-note log and the generic copify range. At least we have in freebies the uh assumes it's uh uh it's not called as a part of a different operation. So so it will uh log V nodes on its own, uh so I cannot really use that atomically atomically, with uh just copy fragment, using generic copy file range and clone the rest under uh without dropping the locks.

B

So you guys need to leave good luck about with. That would definitely be good to see and integrated. Thank.

A

You yeah, it looks like um even mentioning the uh indirectly mentioning the possibility of an early meeting has uh has caused us to use the full time um but I'm glad we had this good discussion thanks. Everyone and we'll see you in four weeks.

C

Thank you guys, bye.