OpenZFS Leadership, 12 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: November 2020 OpenZFS Leadership Meeting

Description

At this month's meeting we discussed: OpenZFS 2.0 release schedule; non-interactive io scheduling.

meeting notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit

A

All right welcome to the november 2020 uh wednesday, fast leadership meeting uh thanks everyone for being here. We have just a few items on the agenda. um Why don't we start with uh some updates from brian on uh the open, zfs, 2.0, release, schedule and then d-raid.

B

Sure so we're pretty much all the way there we've had five rc candidates now and it's good that we had them. We had some good soap testing going on. We caught a couple of small issues, um nothing too serious, but it looks like everything's come together pretty nicely, so I'm getting ready to tag the final release.

B

So hopefully that will come out this month, um hopefully early this month, but it's pretty much all set, so we just need to wrap it.

A

Up cool well early this month might be. uh You may have to stretch the definition of early there, but uh it would. It would be great to see that um land this month.

B

A

C

Does that have the persistent l2 arc stuff in it? It does that issue we were talking about earlier today. We might want to make sure the fix for that goes in so that you.

B

Don't get panics.

C

uh If the pool gets suspended, yeah.

B

Yeah, I think, that's still an open issue if the pool gets suspended, um while you're writing to rebuilding the l2 arc yeah. I don't think that's handled at the moment, just exporting the pool, um but I don't think we've seen that in any testing yet, but I think it probably is a real issue, although if your pool gets suspended, maybe it's you know you're already a bad spot.

D

Brian about release- I I was going to sing ask: do we have any list of things that are going in or like. We are not planning any more mergers there at all unless it's absolutely critical or what's planned for this point, for example, the uh one fix you merged today of mine. Will it go in or should I create pr or it's not going in.

B

I think anything, that's in the master branch. Currently we might cherry pick at the last minute for the 2o release. There's some stuff there. Little freebsd fixes some documentation cleanup that kind of stuff that we'd like to have but super low risk right. um So I think that stuff can go in before the final release. um Bigger changes, probably not, although I.

D

I I know, but not about bigger ones, mostly.

B

About small ones, yeah.

D

B

It's already in the master branch. I've tried to pick most things back, that were small and were important um so that stuff I'll try to pull it again before a final release, like I say it's usually a little. If you've got something in mind, definitely let me know and we'll talk about it and figure out if it makes sense or not. um I think also. We can totally talk about doing more frequent point releases after 2-0, so just to get in the habit of doing that.

B

I think would be good, so we don't have a long gap necessarily between uh stuff. We can get fixes in if we need.

A

Them um cool, then, um should we just mention. I think there were a few um prs that are nearing completion or need some eyeballs uh one being the d raid yeah.

B

um That one's pretty much uh just needs to be merged as well. Thank you, matt for your time reviewing that that was really valuable uh that got us pretty much over the finish line. So I expect that'll merge this week uh to the master branch um me and mark were just.

D

B

Last minute, testing and you know double checking and whatnot.

A

That's great, that's a huge amount of work uh to come to fruition. After I think more than five years of work by you know at least half a dozen people and uh several different companies yeah. It's.

B

Great to see that finally get wrapped up, so I'm pretty excited about that.

A

I think it's a great example of collaboration. You know across companies that we've facilitated with um you know the open zfs project.

B

Yeah, it was really nice to see people pick up that work and carry it forward right, even from different companies. So I think that's that'll be good cool.

A

um Alan uh did you want to mention the forced export pr yeah, uh so we've.

C

Got the a version of forced expert that works and actually passes the test now, which is nice, um uh so yeah we'd, really like people to look at that? It works very well on freebsd and does actually work on linux as well and main use case, as I mentioned before, was if you have more than one pool- and you know, one of them goes away for one reason or another.

C

Then you want to be able to export uh force export that one, even though you know it's not going to ever unsuspend.

C

Basically, uh so you know use cases of you know remote devices or something where uh your have a local pool, that you're using normally and then a remote pool that you're say just using for instead of send receive or something, and if your connection to the remote devices goes away, that pool gets suspended, but sometimes you want to be able to just force export that rather than uh you know, trying to get it back.

C

So you can resume it so that you can export it normally, or I think the other case that brian mentioned is you know if you had a pool on a usb stick and you've disconnected the usb stick?

C

Sometimes you want to say just throw the pool away and let my you know my root pool continue to work instead of being stuck where you know, zfs list doesn't work because the other pool is stuck holding the spa, namespace, lock or whatever.

A

C

So it has a bunch of tests and so on, but I would love to see some review and any other thoughts. People have on.

A

That all right, and is that, um are you expecting any more work on that, or is that ready to go as far as you know? As far as I know,.

C

It's ready to go, uh you know it depends what everybody else thinks of it, but in general I think it's uh good to go.

A

B

Let me try to give you some feedback on that. um No that's one of the features we've seen requested the most over the years, so I'm pretty excited about it.

C

Yeah I I originally didn't know that I had much use for it until a couple weeks ago, when I ran into a case where I had a use for it, and it would have been really nice if I could have exported the pool and kept going instead of having to restart.

A

Cool well speaking of uh highly requested features. um I am back to working on raid z expansion. uh It's been kind of coming in fits and starts, but uh we have a renewed push to make some serious progress on that by the end of the year.

A

So we're we're hoping to get like a final reviewable pr by the end of the year um might not quite make that, but um hopefully you'll be seeing a bunch of updates on that. So I'd love to uh for folks to folks who are interested to help testing that um we'll have a bunch of new features coming to that uh probably later this month, like uh writing with the new data to parity ratio after the expansion um performance improvements uh like after the expansion when you're when you're reading and writing uh to the raid city.

A

So I would love to see more testing on that to help us shake out any bugs.

A

Cool and then anything else that you want to plug before we move on to the technical discussion about the non-interactive zios.

A

Any other reviews that folks want to draw attention.

A

To all right, then, I'm alexander, let's chat about the non-interactive zios.

D

Oh yes, thanks uh last several weeks, I was trying to investigate how to prioritize uh what I call interactive io, which at this point, both uh sync and I think, comparing to scrap initialize receiver, whatever we're doing in background, because timeline of the first half seconds in milliseconds and second and part is minutes hours forever, and uh I tried several sites. I experimented with skies and ata priorities. Unfortunately, I haven't found anything usable there.

D

uh Yes, kaiser priority seems like not supported aj prior just not doing what I want from them. They do orient more on high priority, not on low priority and marking everything high priority. Just kills iops practically handling all requests in order of arrival, which is not fast. Obviously, so I went higher to cfs and for several days there is already a handgun pr which uh throttling non-interactive requests.

D

If there are any interactive requests are active or they are not completing within several moves within several completed last completed requests so practically if what I am fighting with is that if some hard drive see a large sequence of sequential requests, it quite often give up on random ones and handling some them. Sometimes one in four seconds, which is obviously there's for any interactive workload and with my pie chart I see dramatic improvement of worst case latency from four seconds to one eighth, or at least one quarter.

D

Second, so like 16 times lower latency, I think it's great and the one question I have left that part with scrap it works as good as it can be. I believe uh I don't have any question to that part, but I now looking on higher part differentiation scene versus, I think, requests and uh it's not so straightforward science. Obviously I think requests are not so high priority, but they also they must be done within reasonable time within seconds.

D

We cannot give up on them for a long time and one more complication that for right, if drive, has right cash and we already pushed there tons of data, I'm not sure drive will react on us fast anyway, no matter what we do, and so my question was about motivation of the code we have now uh in particular, I think right. uh What was the was?

D

It really seen benefits of increasing the right q depth according to following the dirty data amount in arc, because uh if we have right cache enabled, then I'm not sure there's any difference whether we have a q depth of one or q depth of ten minimum maximum uh from throughput performance. It just pollutes, drive queue.

A

Yeah, so um at a high level, I think it to me it does make sense to address these like truly background activities separately from reads and writes even the async reads and writes so like, which is what you've done, which seems.

A

Okay to me, I was just a little confused by the terminology, um which I think we can clear up um for the eighth for um the async rights and like increasing the q depth um yeah, we did see that it, it does matter and it helps a lot when you have workloads that are where, like you have a lot of um reads going on, um and I'm pretty sure that we did test this with just plain hard disks.

A

um It uh ultimately like I think that, even with a write cache um like there's only so like, there's only so much stuff that you can throw into the right cache and the essentially what we're doing is like if you I think that the default previous before we made that change, it was like the right queue. Depth was 10 and the read q death was 10., and so what you would typically see is like for hard disk, especially 10 is like a lot right, because it can only actually do one thing at a time.

A

So, um if you have like sync, you have some sync reads and then um then, but but not like, you, don't necessarily have a very high q depth of sync reads: like you have a couple threads that are coming in hitting with reads so. Maybe the read synchrony q depth is typically one or two.

A

um If you have always have like 10 rights outstanding to the drive, then you know listening at one read at a time like, even if it's processing them in random order. Like you, you know, 90 of the time the the drivers will be going to be doing the rights and 10 percent of time.

A

This is going to be doing the reads, um so what you would see is like, um while you're syncing uh like if the right workload is not um totally hammering the disk then essentially like while you're syncing, you hammer the disk and then while you're, not syncing, you're, not doing any rights.

A

So the point of that scaling up is to smooth out the right workload so that um when the right workload is not like at the extreme, then we are only sending out a few writes to the disk at a time, so that uh so that the latency of synchronous reads remains. Okay,.

D

The problem with that qdaps, if you have right cash, uh you put more if if cash is empty, you put data quickly there and latency of right very low, and you barely ever reach 10 requests of depth. They complete faster on the other side.

B

D

When you get cash full, you have 10 immediately 10 requests waiting and you get all the latency of 10 requests of one megabyte each thanks to aggregation and good. If they are not random.

A

Yeah that that may be um I I haven't, looked at that in detail like how it behaves with the right cache on that kind of micro scale. um So I guess what you're saying is that um if you, if you have write cache, enabled then the like, when the read comes in the right, cache is already full. It's already like busy doing, processing the right. You know doing right back for the right cache, and so the read is already gonna have a high latency, regardless of whether we have a bunch of rights outstanding.

A

At the same time, or not,.

D

No yeah the additional rights outstanding, don't give us more bandwidth because all cash is already full, but they increase latency and pollute q, reduce q deps for other traffic, because for sata we have only 32 tags and if we have uh one freebies dmx is not increased to one mag one. Each one megabyte request means eight tags because of splitting into 128 cable chunks.

D

So we may end up with this completely stuck stuffed with right q and no single read there and just wait while right, complete.

A

Yeah, so the I mean are, are you saying that therefore have like having the scaling down is good because it will help to mitigate that, or are you saying, like uh scaling down, isn't really going to help that much because of because of the rate cash.

D

No, I just don't see, point to scale up so high. uh That's why okay so staying low is a list of bad things. We could do it with this with the right cache enabled with with with right, cache disabled. I can't believe that uh we would we we may benefit from pushing more there signs. There is a will, be latency drive.

A

D

So you're quite cool all right.

A

So your question is: why do you like, why is the limit 10 and not like three or whatever.

D

A

Okay yeah, so I mean it's been 10 for 20 years, um uh so we I don't think that we changed that, um but uh yeah, that's definitely a good question and I mean for for hard disks yeah like there's, really no reason for it to be so high uh for any of the queues to send.

A

You know more than like one or two at a time, um and uh I think that the issue is that um for other types of hardware storage devices like they can do more at once, um like ssds right and at least like at the time that we were doing this, which is maybe five years ago, um there still was not a good way to detect like what does.

A

What does that curve of like q depth to throughput? Actually look like you know where, like for a hard disk, it's basically like flat right, like you get to qdf1, the throughput doesn't really increase, as you increase the q dev um and for but for ssds. It's like you know, closer to being linear for at least a little while um and for for other things. You know like uh part like arrays that are underneath dfs. Then it's like even weirder um but yeah.

A

I would agree that like if we had a way to know for sure that that having a higher q depth doesn't help, then to limit that. Then that would be a good idea.

D

Well, I think it significantly depends whether caching able to disable it. So another question I relate that I have to this topic is what people are typically using, because in many sources I see mentioned that linux may disable by default, that solaris or illumas in some cases may disable by default. Freebsd doesn't so uh like what was.

A

The most target mark, solaris or limous- did this uh pre-zfs did disable it by default and then with zfs. When you give it the whole disk, it would enable it by default, and then you know, rely on the cache flashing commands.

D

That's what I heard uh what's about linux, because I'm uh that knows so much in that area what people are using. I know there are mechanism to control it, but what's our default, what.

C

B

I was going to say: linux behaves pretty much the same way, um it'll enable it by default. If you give it the whole device, and if you don't, you just give it a partition that it doesn't touch, it.

D

No, but if the drive had that have it enabled by default, it won't touch it, it will remain remain, enabled.

B

D

A

B

System set it to yeah.

A

So it sounds like you would typically have the right cache enabled on linux as well.

B

A

Because you typically be giving the whole disk anyways.

B

Yeah, it's probably pretty unlikely that it'd be disabled.

D

Okay, it's just uh brian. uh I looked through history, I see on uh on the office on linux opencfs. Now we have uh minimum. I think depths of two. You increased it several years ago, when somebody complained that on his disks we disc with right cash disabled. There was lower children performance and there was kind of consensus that two is somewhere in the middle.

B

I don't remember the exact patch, but it's definitely possible. I have to go back and dig it up. um What was it before? Was it one? We we increased it to two or we yeah.

D

It's it's still one and illumines, and it was one in freebsd, but in zero l it was increased to two.

B

Yeah we'd have to go back and look at the issue uh linked from the polar regress. There's probably uh there about why.

D

No, it makes sense when the right cache is disabled, because without uh q deficit list of two you just cannot write sequentially without waiting full rotation. Every time just physics.

B

Well, like matt was saying we do have a pretty good way to detect that these days, not only if our cache is enabled or disabled, but if it's an ssd or a hard drive. So I think it makes sense to you know, revisit all these things and maybe have different tunings based on what's there, but.

C

B

Out what makes.

C

Sense right, yeah, there's some code. I think in the mirror code that changes based on whether.

D

C

Rotating disk or not yeah,.

D

There are several several places in code that would also do aggregation differently for hard disk and and something else. I think there were maybe two three or four places through the office yeah and.

B

I think they're all through the vw's now right, so it's pretty easy to check. So it wouldn't be hard to do something like that. If we knew the right values to set.

A

Yeah, I've always wanted to um do something where uh we could detect this dynamically, because you never know what the heart like, how the hardware is going to lie to you and even if we knew like hdd versus ssd, it's hard to know like okay. So what's the right value for ssds, you know so it would be really sweet if we had something that would.

A

You know like actually do at least reads, or maybe we reserve some regions where it can do rights when you open the pool or when you create the pool the first time. You know it would like measure the performance of like q depth, one two, three four five and see like at what point? Do we kind of reach the plateau of throughput and uh and then kind of said, the max q depth, based on that.

C

Yeah, it reminds me a bit of some of the work on netflix did in the freebsd storage layer for their ssds, where they would throttle rights to a point where do as many rights as you can until it starts to impact the read latency and then throttle the rights kind of thing.

D

They have very heavily differentiated, read versus write workload. Yeah, that's why they have quality of service for it, but they can do rights as fast as it allows. I'm not sure anybody has else, has the same characteristics but right yeah. Now in a code of uh scrap I implemented, I was thinking about latencies, but I ended up uh with different uh approach like if we send five scrub requests and during that received none of interactive responses, then obviously they are starving yeah. We must stop scrap and let those complete and then continue.

D

Maybe some things like that could be done for right times. I don't know- or it would be too aggressive.

D

A

D

There is not there's nothing too much, we can let it go as long as we want as long as it doesn't uh interfere with workload, but.

A

Yeah I mean there is um the way that it used to be. uh I think this is maybe pre uh pre pseudo-sequential uh scrub. um It was very aggressive about having the scrub back off um like whenever there was any other io, then. Basically, the scrubs would pause for, like a tenth of a second.

D

A

Something yeah.

D

I looked on that code and we actually had it uh that those delays uh internally, even after sequential scrub, was implemented.

B

D

But I found uh that it just cannot be, cannot work reasonably because it delays how uh ios are putting into the queue not how they handle it, and there is delay like five megabytes and five megabytes of random bios. That's a lot of delay. So it's to make it or just delay. It has to be very slow.

A

Yeah and that can so that mechanism would cause the scrub to go very, very slow, unreasonably slow. So that's.

D

Why I decided not to move it forward but implemented what I've done.

A

D

A

The idea of what you've done.

A

um I need to look more deeply at kind of the exact algorithm, but I like the idea of um like essentially if we detect that we are starving out other cues by, like we've, issued five of our five scrubs and none and no other things have completed.

A

Then, let's kind of hold back on issuing ours for until somebody else completes is.

D

That's the cool idea. Now, I'm thinking, maybe the same could be done for rights we may play with coefficients lower and higher uh for scrapping one and a five. We can do it even less aggressive one and ten just for yeah.

B

D

Can the worst possible cases just question what to do in that case, if we were doing right at two depths of ten, should we drop to the floor to none to two or like.

A

D

A very big jump because for scrap or it's it never sends more than one requests. While there is something interactive, but for rights, would it would be too aggressive to send just one if there anything really interactive would be too strict.

A

Yeah, I don't know I mean even um I think, even the behavior, so the so the behavior that you've implemented um as long as there are some other ios. Completing then is, is the behavior the same.

D

As it used to be, no, it still keeps q depths of one. So it goes to full q depths of three. Only when there is nothing really else so.

A

D

A

Even if, basically.

D

A

Even if even if other ios are completing, so I'm thinking about the non-hdd case, like you have an sd, ssd or something- and you know we're completing lots of all kinds of ios and um and the this can actually do like lots of things at once.

A

Then, are we effectively reducing the q depth of scrubs from three to one.

D

Yeah no, I experimented with some ssd and I saw even high-end sus ssd, not nvme, but still pretty fast, and I still saw latency difference between q depths one and even two, but especially and even three. So there is nothing free. The latency grows, maybe of course not as dramatic as with hard disks, but latency is growing. So.

A

Yeah seriously like when you, you vary the the scrub q depth from one two to one two or three, and then the latency of of synchronous reads: increases as you have more of these kind of competing scrubs. Is that.

D

Right yeah yeah, I saw the like between two or four times: variation, not dramatic, not a thousand times, as I saw with real starvation with hard disks, but still like question. How much do we want to affect workloads.

D

Yeah but but it's suitable like somebody doing uh ssd only could always increase minimum. I haven't had coded value of one minimum could be increased yeah too, and it will follow.

A

um I mean to me the um all this stuff makes sense to have these options and changing the default behavior of um to say like when we essentially like when we've detected, that we're starving somebody else, then, let's back off, that makes sense for all hardware storage types um changing essentially changing the default queue depth um when, when there's uh you know when there's other ios going on seems more seems like something that we could kind of, discuss, orthogonally and evaluate the impact on ssds and whatnot.

A

D

You propose just add checks for starvation, with some loyal.

A

Right to me that the starvation fix seems like a more clear win in all scenarios, um and I guess I would the the change of the queue depth scrub qdf.

A

I guess I would just like to see more data of other um like how does it really impact scrub performance and um and read latency on different types of hardware to justify a change in the default? If that makes sense,.

D

You're talking about scrap right now, not about syncs.

A

D

Well, because uh part of that I is hardcoded in the code that dropped to minimum, because previously minimum value was literally not used ever the fs, always.

B

D

Yeah so uh kind of, uh if somebody wants to mod, he can set minimum value higher and restore previous behavior kind of so, for I think right we do use range, but again it's tunable.

D

Okay, thank you. Okay,.

A

Discussion, anyone else have thoughts on that.

A

Yeah this is richard. um We did a bunch of work on this a while back and it gets really difficult, because rights are you're, writing to a volatile, write back cache and it's damn near impossible to ever predict the impact of rights versus latency versus cue depth. So it's extraordinarily difficult, but there are some other techniques we can do and it's maybe a good chance for a conversation at the bar someday.

A

About other things, we can do in that whole uh zio pipeline to improve our estimation of the performance of the of the.

A

A

Okay, um I think this is the last item that we had on the agenda, so uh any other thoughts on this or other topics folks want to cover.

A

All right, um in that case, uh our next meeting is going to be in four weeks on december 8th and it will be, I think, at the same time as today, because the last one was in the morning. Yes, so the the next meeting will be uh four weeks from today december, 8th same time as today and uh enjoy your extra 20 minutes back from this meeting bye. Everyone.

C

C