OpenZFS Leadership, 4 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: December 2020 OpenZFS Leadership Meeting

Description

At this month's meeting we discussed: Forced export; RAIDZ expansion performance; visibility of .zfs/snapshot.

meeting notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit

A

uh Looks like we don't have too much on the agenda for this month's december. 2020, open, cfs leadership, meeting um I'll start with just a few announcements updates on uh projects we've been discussing so uh opencvs 2.0 was released. Congratulations to everyone who worked on it and thanks.

A

um I haven't seen any uh major issues or or problems come up with it. So that's great, and then we also saw durate integrated to master.

A

So that's great uh brian, if you're on I'd, be curious to get your thoughts on the next release. uh I don't see him um all right. Well, we can check with him next time. I think um we had discussed trying to have a uh 2.1 release relatively soon.

A

um That would uh include d-rate um should be kind of an exception to our usual uh policy of of not including major features in updates in minor releases, um but I think he had some some motivations for doing that. I think.

B

uh He was interested in maybe direct, I o if it was ready as well being part of 2.1.

A

All right, um I saw some emails fly by on that, but I haven't.

C

A

Up on, what's going on with direct io.

C

Also, I don't know in what uh group it was discussed, but with freebsd uh shooting branch in winter and release in march. uh It would be good if uh 12.1 could be really or 2012 could be released somewhere in winter. Then we could integrate it.

A

B

uh We had a freebsd call with brian about that uh last week and it sounds like we're going to try to get all that sorted out in time for to line up nicely with the previous two release.

B

Our releases in march so we'd like it to happen before that. Okay.

A

Yeah, I think technically winter is basically january february march in the northern hemisphere. um So that's like.

B

Even technically, like this december 21st to march 21st or something right yeah, I think.

A

B

A

A

Cool um we'll put on the agenda next time to get an update from brian on that um and then the only other item I had on the agenda was the forced export review. Alan.

B

Yeah so uh brian had gone over and pointed out a couple of minor things uh and we'll updated that over the weekend and it'd be great.

B

If people could look at that, I know earlier today, brian linked a an older zol issue to it of you, know people running into the problem of if this pool gets suspended because the v dev's gone away because it was a usb stick or it's a hard drive, that's just gone and never coming back or whatever that currently, the only way out of that is to reboot and not have the pool imported anymore, but with the forced export code we're able to actually you know, while throwing away the dirty data unwind everything else about the pool.

B

And importantly, if you have a second or third or other pools running, they can continue to move forward. Whereas if uh the unlucky pool happens to suspend while holding things like the namespace log, then you can't even run zfs list until you deal with the suspended pool and that you know degrades the user experience.

A

Yes, yeah that'll be definitely a nice usability improvement.

B

Yeah, uh so it's working, we have a bunch of test case, but if people have ideas for other test cases- or I could take a look at it or try it out- uh that'd be good cool.

A

All right that reaches the end of the scheduled agenda. What other folks would what other things would folks like to discuss.

B

Now have you made any more progress on the raid zed expansion, stuff.

A

Just a little bit since our last meeting, um so it's it's pretty close uh to being fully functional.

A

um There are some known uh issues with like if, if you crash at the at the wrong time or if devices become degraded uh like like you, lose a disc while you're in the middle of the expansion um that we still need to address, but uh aside from those corner cases, it basically totally works. um It's pretty slow in some uh for some things, but those are all things that we uh we you know. We know how to address and we're working on them.

A

um So if folks want to want to do more testing on that, that would be great. um I was uh planning to get out like a an alpha like a I guess. The current one is an alpha release. I was planning to put out like a beta by the end of the year that um draw more attention to the fact that we made all this progress um yeah.

A

It should work so uh the latest things that we've done are getting it so that, after you do the expansion, the writes, are happening with the new data to parity ratio, which means that it takes up less space on disk and also the performance is good, is better. um You get basically the normal performance uh rather than um you know. When you're, when you're reading or writing things that are like misaligned, then it has to generate a lot more cios there's actually some work that I'm.

A

uh What I'm working on right now is splitting out some of the performance work to integrate separately, so you'll see a review for that soon um and what what I'm doing is um one of the big problems after you do the expansion, uh when you're doing a read and you're reading a block that was written from before the expansion, then um it's like kind of going diagonally across the disks because you have like it was allocated originally as four wide and now you have five disks, then, like the parity is every every sector.

A

The parity is changing to a different disk rather than like the parity all be on one disk. For that, given block um so the way we deal with, that is by creating a bunch of zios like one for each sector essentially and then like stitching it all back together, uh which works great and um integrates well with the existing code.

A

um But uh just man like allocating and de-allocating all of those zioties and even all of the abd's associated with them uh is very you know it's very slow, like it uses a lot of cpu to do that because you're talking about like for 128k io, let's say um if it's: uh if it's, if the layout matches the like the logical and physical match, then um you're you're only dividing that 128k into a five disk. So you have like five zios versus um you know.

A

The worst case is like you have a 128 kio and you have uh a shift nine. Then you have like 256, zios and abd's that you're allocating, um which is uh it just takes a lot more cpu. Oh.

C

A

C

And that will remain with pull forever, so I'm just worrying. Will it be usable after that, for more than home system, with couple disks or three I extended to four or whatever yeah how bad it is? Have you done at least any test benchmarks.

A

Yeah, I've done a bunch of benchmarking on that um and uh we uh we have a bunch of work in progress, that's like uh kind of rough and ready quality, but it works and it improves the performance to be um pretty good. I don't remember the numbers off the top of my head, but it was like uh more than half the normal performance, um so you know it wasn't like all the way to normal to normal performance, but it was pretty close, um and this is like, I think, a system with like eight cpus driving.

A

I forget how many driving a bunch of ssds, so you know the um throwing more cpus at it would improve things further.

A

um So the work I'm breaking out right now is uh basically being able to like pre-allocate the abdt um have that embedded in another data structure and so like when you're creating an abd, that's just uh like referencing another one. um Then you can provide the actual abd struct uh that'll initialize. In that way. um That way, you don't have to do the like. You don't have to do the camera malloc for the apdt itself,.

C

But abdt is os specific, so it it has to know then about freebsd and linux. Differences on freebsd abdt also includes list of pages, so it's variable size yeah. It's also has to be.

A

Yeah, that's all handled. I mean this type of abd is just the one that references another one. um So uh we aren't like setting up any pages. There isn't any like memory, that's actually exclusive for that apd, um because what like to to improve the performance that I mentioned.

A

uh Basically, what we do instead is we do the aggregation at the uh raid z level so that, instead of having like, instead of creating one zio for each 512 bytes for each a-shift, we create one zio for each disk, just like we did before and then, uh but we still want to use the existing like raid z, priority generation reconstruction logic.

A

So um we need um abd's that point into each sector of that um and- uh and those are the ones that I'm talking about like pre-allocating, so with the new way of doing it, you, you only have like the number of disks number of zios um and you have a lot of abd's but um the, but with this I'll, be upstreaming soon, they're very, very lightweight.

C

Okay, yeah, we'll see still fifty percent sounds scary for me, but maybe for somebody uh who last cent centered on performance, it.

A

C

A

Useful I'll look up the numbers too, and I can give you the real the real number um speaking of abg unrelated.

C

A

Was with a shift nine also so, like I.

C

Know yeah, of course,.

A

Yeah yeah, so that means like the overhead, is uh one-eighth of what of what I was measuring there.

C

Well, even in production, uh we shall we have right now, one the case we are investigating and they have performance problems and cpu's age going about 50 on quite decent hardware, and there we have something like a 16k rate, z on top of five white or 16k. The walls on top of five white raid z, so even uh chunking at four or eight k, is pretty significant. uh It's significantly increased number of ios overhead, so even that is not perfect and great yeah.

A

But you should get, but with redzy expansion you should still get better performance than what you're talking about as long as you have large block size because of the like integrated um aggregation in the at the raid z level. So you won't be doing zios for every 4k. You'll only be doing the zio. You know for the whole block.

C

So you're saying those of 512 or 4k are not pushed to the dev queue they handle somewhere inside exactly.

A

Yeah so so we are actually generating zios for each sector, we're only creating one zio for the whole disk.

C

That's that's much better.

A

Yeah, so that's that's the big one because I mean, like cios, are way way higher overhead than um abd's. Even um so, without that change, you know you're talking about like a tenth of the performance um with with a shift nine. um So oh.

C

Yeah, that sounds better just so it's been a while, since I looked on uh expansion code last time, I don't remember it was there.

A

C

A

Yeah, that's new! It's in a pr against the expansion code, not it's not in the main expansion code. Yet so that's why you haven't seen it um yeah, if I recall correctly with it with the 4k sector size. The performance of those reads was like very close to um without the expansion it was like.

A

You know, you're paying for the one xbox data, because you have to read from all the disks, not just the ones that have the data right, because um all that you know, if you have like a five wide raid z, one normally a read, would just be reading for four disks, but with if post expansion you have to read from all five of the disks, because they all have data and and.

C

Drop holes or read them and ignore something like that: yeah.

A

Yeah so there's a little bit of overhead, but I think that with records with uh 4ks exercise, it was pretty minimal, like less than 10.

A

Overhead um that was uh of overall performance with the eight cpus, I'm sure that the cpu usage is higher. So um there's there's still some more. You know stuff to be measured there. I think for your use case.

C

Oh, I still don't think in real production. We'll add disks. Often it's usually. It's happened on top of vdf with devs or d groups whatever, but yeah some small home users could potentially use it.

A

Yeah yeah and I think you're right to be concerned with the performance of reading those old blocks, post expansion and that's why we've been doing some work in that area.

A

There's there's also uh work to be done in the performance of the expansion itself, uh which is very slow now because it's doing its sector at a time like uh creating a cio for every sector um and uh I'm not sure if we will get to addressing that uh before integration.

A

um But since that's kind of a temporary thing, oh, your extension takes a long time. You know well.

C

A

In the ground and doesn't.

C

Affect much by load, maybe it's acceptable. Are you going to integrate it to 2., 2.1 or.

A

No, I don't think that's I mean I would like to make that time frame, but um I think it'll be right. It'll be done when it's done.

C

C

Science, we don't have other things. uh I have one quick, freebies equation. Does anybody know why we have tunable for abd block size because, obviously setting it less than page size make no any sense signs it automatically drops to smaller blocks. You even needed and setting is higher, is just killer for vm subsystem for uma caches.

C

It's just much more expensive.

B

That originally went back to sean, chittenden and groupon, but yes, I agree. I don't think it makes any sense.

C

I just recently hit it and saw we're doing 64 bit division for every abd creation. That's expensive and I see no any point shift by 12 would look much better to me, yeah and- and that's not mentioning other potential use cases of page alignment code.

C

That I think would be interesting to note uh with that, uh to do some kind of uh abg magic on geom, with f with the geom layer, so that we could skip some copy and right directly from abd.

B

Yeah, that would be good too.

C

But I definitely.

B

Thought this detail could go away, or you know if it needed to stay, could at least be changed units into the bits to shift instead of a number. So we don't have to do the division.

C

B

But I don't think it's actually useful.

B

Thanks but I had a similar question. Actually I was looking at a kind of well. I don't know if it's actually a regression, but uh I usually reported after upgrading to open zfs 2.0 on freebsd that the zfs directory doesn't work the same as it used to specifically in a jail.

B

uh So when I looked in the code in the you know, zfs control directory stuff, when you go into zfs snapshots um on freebsd, there is a comment that this is nasty and then it changes the thread credentials to the k cred to the basically roots credentials uh and then does the mount and then switches it back to the original credentials so that a regular user could access the does that if I snapchat directory, whereas normally users can't mount and if you enable user mount, there's a requirement that the user has to own the directory they're mounting to and obviously the user doesn't own dots that have snapshot slash the name of the snapshot.

B

um Some recent cleanup changed the in global zone macro to use kerprock instead of cur thread, so that code would need to be updated to fake the credentials of the process. Instead of the thread, but the question is: is it okay to do that? It's how freebsd has always done it. It seems, but does it make sense to allow regular users to be able to mount snapshots and especially, do that from inside of a.

B

D

Does freebsd have the notion of like delegating um data sets to a jail.

B

uh It does um and then they could mount it normally there. um I don't actually know how the zfs stuff responds when it's a delegated data set in this particular user's use case. The data set is only is mounted on the host and just happens to be inside the sage route that the jail is based on, so the jail itself doesn't have visibility on that data set like it. Doesn't you can't zfs list the data set that it's ending up mounting.

D

Yeah I'd have to go back and look on a lumos for what it does uh for zones um correct um or actually the other thing, because my initial thought would be is: if it's delegated then yeah it should work. But if it's just you know a subdirectory of a data set, you know it's maybe a little different, but I can't remember now it's one of those things where you just never think about it. um How.

B

Yeah, I never thought about how you know a user in a jail mount being able to access, zfs snapshot worked or the fact that nothing ever seems to unmount those uh they get mounted with the ignore flags, so the regular mount command doesn't display them. But if you do mount v, you can see the list of them.

B

It also came up because, while testing this, I uh broke snmp on our one of our servers, because when there were over a thousand snapshots mounted uh the udp buffer wasn't big enough to send back the list of all the data of all the file systems on the machine.

D

I just tested in what it looks like at least on smart os I'll need to test omnia because to make sure it wasn't something that was customized, I don't think so, but yeah it looks like only if the root of the data set is in the zone.

D

Does the zfs work, but if you just like, for example, if you, um if just part of the data sets visible into the zone, um it doesn't seem to so I mean just as.

B

In this case, the whole data set is uh inside the zone c troop, like its mount point, is inside the the siege route, but the data set is not delegated to the jail.

B

And so the zfs is visible, it's just with, because the credential hack isn't working right. Now, if you go in there and try to you know you can ls and see the list of snapshots. But if you do ls-l, you can't actually f-stat the the directory, because you don't have permission uh basically because it fails to mount it when it tries to mount it to to get the f-stat.

B

C

Just wondering.

B

If the right approach was to fix the credential cheat or to do something different well, is there.

C

Much else we can.

B

Do different other than say you know, permission denied. Basically my thought was: is this expected behavior? Then we should fix it, and if it's not, should we hide it behind us this detail so that only people that need it turn it on.

C

No, I don't know what.

B

C

I'm sorry I just want to say that, uh while we're not using uh jails, no delegation to jails, we're using previous versions of files for samba right.

B

And that's accessed by a regular user, and even if you had user mount on, they wouldn't be able to mount a snapshot because they don't own the directory. So it seems.

C

B

It's pretty required at this point.

C

Yeah, I'm sure when you said it got broken because we haven't seen not just problems. I.

B

Think it's only in master like uh I think it was october. Maybe.

C

Oh, if it's in master- maybe maybe that's when we why we haven't seen it- I mean freebies.

B

Yet uh but it was it was you guys that that broke it by fixing something else.

C

I think it was ryan.

B

uh She made the changes for the in global zone to be able to not go through like two functions and three macros to to find out which jail you're in a much more direct route. It looks much nicer. It's just the hack, only fakes, the credentials for the thread, not the whole process, and now we examine the credentials of the process.

A

um In terms of like what how it should work, I think that um the zfs uh snapshots directories work over nfs. Yes, um so like it makes sense that it should also work locally without having any special permissions uh by default.

A

Whether that's you know delegated into his own or you know, in the global zone as a non-privileged user or whatever.

B

Yeah, I think that it seems to make sense and yeah, like you said nfs and the uh the samba use case are both pretty big yeah.

B

uh So then yeah, I will uh post the patch to change the credential hacking to to make it work again.

B

Great. Thank you.

A

uh By the way, alexander, I looked up the numbers um and I was getting uh 85 of the normal performance with all the uh performance improvements for the reading post expansion and that's with the record with the 512 by sector size. So that's probably a worse case.

C

Yeah, if it's even if it's 85 of performance not of spent cpu, that's not as terrible, especially it's for for the worst case, uh who could tell yeah? I I same as I I always discourage people from removing uh vdfs unless they really need it. I guess I'll still remain. It's got a shorter size, but it's good to have it even.

A

A

Cool uh what other topics do we have to discuss.

A

Today, none all right, so our next meeting is scheduled for january 5th and it will be the earlier time I'll make sure that's not on the calendar. Then, earlier time, nine o'clock, pacific have a great winter holidays. Everyone and we'll see you in the new year.

C

If you hold this.

B

Thanks everyone.