OpenZFS Leadership, 16 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: September 2020 OpenZFS Leadership Meeting

Description

At this month's meeting we discussed: OpenZFS Hackathon; per-pool ARC stats; ZED; OpenZFS 2.0 RC

https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM

A

All right, it looks like it's one after the hour, so uh let's get started um the agenda's pretty light today. um I thought I'd start by retreading the open, cfs hackathon.

A

I know I blobbed on about this last month a bit, but I thought since we have time, um I would mention it again and uh let me share my screen.

B

Let me find the right window, I was gonna, ask, do you have a like a google spreadsheet or something again.

C

To try to organize what.

A

I'm going to show you if I can find the right thing to click on sorry.

C

I don't see I I don't see why it's not showing up the right window. There.

A

So let me share, let me put the link in here because you all have editing.

A

Access, um so yes, so there's a spreadsheet um and in the spreadsheet, oh here we go now you're showing it to me all right now. Can you see uh the spreadsheet on the screen as well cool, so.

D

A

Yeah so there's a spreadsheet like we had last year. um The thing that's a little bit different about this year. Is that uh we'd like to try and do some more organization work before the hackathon, um so that folks uh can like find other people to work with remotely? um Because it's you know it's all online conference this year so uh definitely feel free to like put in your ideas, if you're, just working by yourself, that's cool too.

A

um But if you would like help with some idea, then um we'll then what we'll try to do is set up um like separate zoom meetings for each uh project, so I've I've, bolded the ones here of folks that have already um committed to leading it. So uh josh is going to be leading a hackathon project about the pool, compatible features, um serafim.

A

Is uh going to be leading a new newcomer session for folks that are just getting started with development um and uh muhammad uh who, you might remember from, I think, last year's hackathon he uh he was working with um drives that have like multiple actuators.

A

um He is uh working on integrating zfs and minio, so he wants to delete a hackathon session on that. um So we need more folks who are interested in in leading those hackathon sessions and then um uh we'll give you some time at the beginning of the day, to uh just a few minutes to kind of pitch your idea and explain it to the whole group.

A

um And then uh you know you will be able to kind of lead that um zoom slack slash whatever based online hackathon um for that project uh and then we'll come together again at the end of the day, with everyone to do hackathon, presentations.

A

uh Questions about this, anyone have an idea that they would like me to write down.

A

All right well um I'll, leave that open um it should be. uh You should have. You should be able to add stuff to this, using the link that I pasted.

A

And what else did I want to share? Oh, this is the this. Is the newcomers ideas, so these are smaller um ideas, although I think some of these are pretty pretty cool legit, maybe not so small um ideas, uh but ones that are require relatively less.

D

A

Understanding of the zfs internals to to get started with them.

A

So these are all agrees too and as well as some things that are not necessarily coding coding things but uh documentation, debugging tools, stuff, like that.

A

All right cool other questions about the three weeks from today conference.

A

All right, I'm I'm working on my slides, or at least the outline for my slides. um Then uh the next item we have on the agenda is uh richard perpool ark stats. Are you here.

E

I'm here I'm here, can you hear me.

A

E

Go ahead, um I wonder if george is here or adam, I see george.

A

uh Wilson on the call yeah.

E

And the other george.

C

There's too many georges around here, don't see him on.

B

Yeah, like l2 arc, george or whatever.

E

Yeah, um so basically what they've been kind of a conversation, I think there's a pull request or two around um statistics around l2 arc and for those who may or may not know the ultra arc, was kind of bolted on and has this purple configuration thing, but the ark itself and all the arc stats in it are global to the machine, and so we have this uh problem that some of us observe observability geeks should worry about, and that is alright.

E

What are the metrics, and am I actually seeing what I what I want to be seeing and how do we think of the arc as a as a two-layer cache where one layer is pinned to your pool and the other layer is shared amongst pools now without recommending a complete redesign of the arc which will require significantly more tequila than I've had today?

E

E

Suppose, instead of changing any of the existing arc statistics that we add in some meaningful statistics into the spa stats and for those of you who may or may not know, one of the things that came up through zfs on linux is a whole section of statistics for every pool, and we see those in.

E

For example, if you do a z-pole status, um minus r, for example, you'll see a distribution and the that distribution is stored in the spa, spastat structures, and so now we have a place for the for to put things like purple statistics, and uh so that would make a logical place to put them.

E

uh But it is a fair amount of work. It's much more difficult than just adding an arcstat, um but then again the arc stats are hard to make for pull right. How do I confer pool information in in the simple name for an ark net? It doesn't work very well, and so um at this point I think it's a thing of tossing out ideas.

E

My my thoughts is to keep the existing arc stats as they are, and then, if we want to have more detailed information in the short term about purples, then we have plenty of trace points and d trace probes in the various places that we can do that. But then, in the long run, maybe we will do want to collect to then purple l2 arc statistics specifically l2 arc statistics.

E

uh So I wanted to test that out there and see if anybody had uh ideas or thoughts.

B

So like do you have an idea of which types of statistics.

E

The the one that came up from adam was.

E

Hits and misses.

B

Right so like in arc read, we have access to the spati of which uh we're trying to read, and so we would just where we're currently uh incrementing the arc counter for hit or miss. We would also increment a spa stat counter saying there was a hit or miss.

E

Basically, that's that's the easy part.

B

E

The hard part is changing the zepo commando spit it out for you right.

A

Yeah, um I think, there's a bunch of different ways that you could approach this in terms of the internals I mean one would be um like you can. You can create uh case stats. You know, which is spl concept on demand um like the way that we've done with the object set for objects at stats.

D

A

From added, um so you know, you could add, like something that looks like a arc stat or is mostly the same as those k you know kind of just like you could even just take all the arc stats so maybe break it down into like here are the ones that are per pool, and here are the ones that are global and then the ones that are per pool, move them into the purple thing and then maybe have some backwards combat compatibility to, like, I add them up for the global one um that seems like it would be pretty straightforward and then you know you'd be left with getting the information.

A

The same way that we do today, essentially through the case stats.

E

Yeah that definitely has a nice feel to it. The spa stats are kind of.

E

um Painful to get, I think, is the way I would describe it compared to the case. That's right case stats are right. There you can just read them.

B

Yeah, like that, I think, on the the linux version, you would end up with uh a file named the same as the pool in some directory somewhere in your cat, and you would just get all the stats and then on freebsd. You would have a nice oid tree of tails per pool with all the different stats, and I think that would work nicely and it would probably be the easiest one of the different options to do, because it doesn't require any new z-pool command line interface or anything like that.

E

Right so there's no user, lan component whatsoever.

A

E

Mean unless you.

A

Wanted to build that right, like you, could optionally do that, just like the arc stack command is built on top of the case existing case stats right it does a perl script.

B

Or something or a python script right.

A

Yeah yeah, um but I haven't looked in detail at how the implementation of the per pool um like performance, stats that are consumed by zupaliostat uh work now, but I mean I can't imagine that they're like that much different or harder to implement the one thing that I would. That would be nice like.

A

If, if, if you're not just going to kind of go for the minimum possible work, it might be nice to think about how to do this in a way that will minimize the performance impact, because we have seen um the arc stats and other kind of stats like become performance bottlenecks.

A

When you have high high rates of bumping those because they're all you know, they're, either just like global variables, essentially that get atomic added or they're the um ag sum stuff that we added, which is pretty heavy weight um and and could probably either use some some improvement to the exam code or and or like looking at other options uh like. I know.

A

I think that, with the it was, was it the per data set ones, seraphim, uh there's something that you were some stat that you were working on.

D

Yeah, so it's the per data set exams and I think, like some time ago, I think it was alexander multi who uh also had some ideas about you know, making them better because they're pretty heavyweight, apparently.

A

You had oh, no, it was a difference. It was not that it was a different style. It was like the.

D

Yeah, that was that's.

D

Where we use the kernel's uh thing, it was, I don't remember the exact name, but it's basically this per cpu uh statistics that they have, which are basically like the same as our exams.

A

Yeah, so it I mean, I'm not like prescribing any one of those as being the right or best answer, but it seems like if I were going to be adding a bunch of new stats I would probably put in if I was going to put in more than the minimum amount of effort to get it working. I would probably look at the performance implications before like trying before you know, making a nice swizzy gui. If I didn't have to.

E

Yeah, that's a good point because uh we've definitely seen especially with the I o type stats, because.

A

It's so frequently hit.

E

And you know, and then in linux they do it differently, where we don't have to maintain consistency, but yeah, that's a little bit more heavyweight for those, because there is consistency involved.

B

I think the per object ones. We only do the exam when someone actually tries to read the case stat rather than constantly doing it. Yeah.

A

And that's good: there are ones.

B

I think we do once a second or something right.

A

Yeah and once the second is no big deal, but I think the issue is that.

E

Even just collecting them and even just collecting the.

A

Arc stat into ag sums um uh to drill in a little bit more on that. I think what we found was that the performance of the linux's native, like per cpu counters, was better primarily because they uh disable interrupts while adding it so that they don't need a lock, um because you can basically say like it seemed like it was very lightweight to say just keep running on this cpu.

A

Don't let me get any interrupts while I bump up, while I figure out which cpu I'm on and then pump that down without having any you know, you don't need any atomic operations, you don't need any mutexes, um so it seemed like that was maybe at the root of why that performed a lot better than the xoms.

B

And I think there's a similar concept on freebsd and it might make sense to basically make exams a wrapper around those two interfaces.

A

Yeah spl yeah figure out how to either do that or like improve our implementation of exams, or something like that. I.

B

Think definitely because there's a lot on the table there uh in this case we're basically duplicating all of the arc stats per spa, and so you know, if you have two pools suddenly, instead of bumping one counter we're bumping three counters to count the same thing. Basically,.

A

B

A

It should only be bumping the counter of the pool that the access right.

B

Yes, I guess that's.

A

The strength so.

B

Like maybe you're bumping the globe that one and.

A

B

Yeah and the one, but that's you know if, if exams are a problem, then doing twice as many of them almost is, is something.

D

We should create and.

A

Measuring yeah right.

B

But I would really like to have those stats, I'm a sucker for more stats.

A

Yeah, I'd love to have those stats even on a per data set basis, which I mean there's like a little bit more nuance to what those numbers mean, um but I think you know, because blocks can be shared between data sets, um but I would still love to see something along those lines where you could see. Like oh, like this, for this data set, the hit rate is x. This other day said the hit rate is y.

B

I think uh the the z bookmark fizz t that we get in arcread has.

A

The upset right- and so it's.

D

Possible we could.

B

Actually start doing some of these as uh extend the per data set or per offset once, and that would be really interesting. There would be the caveat that you know if it's a clone, then you know whichever one you're reading gets counted as the hit or the miss uh well.

D

That's kind of what you want. That's what you.

B

Want, anyway, exactly, uh I think, the biggest use case that I've seen for the per data set ones are every customer has a data set and I would like to know which customer is causing all the load and just because you have more reads than me than the other customer doesn't mean you know. If all of yours are arc hits and all those are misses, then you know.

A

B

You, which one is causing all the I o to the disk.

A

Yeah, that might be, um I think, that's a great idea. We might have to do a little bit of re-architecting there, because I think that the uh those per object stats are handled at the zpl layer. um So currently.

B

Those ones are literally reads the number of of blocks.

A

Yeah they just.

B

Call into the data set case that c, which I think we just had different functions to bump those counters.

B

But yeah, I think, looking at the the overhead of those uh is important, but then unlocking all those extra stats would be really juicy.

E

Yeah, I agree, I think it comes down to too you know. Obviously, when we're in the in the thick of things in a performance problem. You know we we go to d trace with bpf trace right and we get our answer and we we we move on and forget those scripts.

E

But uh you know that's a good way to find out how how things might be useful right, because I think that's really what we're facing here and that is the arc stats- are in global useful. But in specifics not so useful.

E

A

Any other questions on this.

E

I appreciate the thoughts guys.

A

Cool uh don brady uh did you want to talk about this uh zed script.

F

Yeah, can I go ahead and share my screen, real, quick sure um yeah. This is some work, I'm doing it kind of piggybacks off of a recent commit. I did to remove duplicate events from the zfs event stream. um This one has to do with sort of retention on on lumos. We had fma logs. So all these events were retained, um I think indefinitely, but on linux we just have. We currently have z, pull events, and this is a an example of a checksum area.

F

I just got, and it has you know quite a bit of information, but we don't retain it. So if you crash or over time, this will spill out of the.

F

I think we only you know, there's an events that we save so you'll lose this information and we do have a thing called all syslog in zed, which will actually take um this event and put post it into the syslog. um The problem with that is it's pretty terse. It just has like you had to check some error on this pole on this vdf. It doesn't give you any. You know this additional information about. Like you know what object it was, and uh you know etc.

F

So what I was thinking was just extending um the all syslog to to put additional information in there and I'll give you an example of that, and I just wanted people's thoughts on you know. Is this the best way to retain it? um For me, it's it's kind of a simple way to do it, but um let's see what am I looking for.

F

B

I think in general that yes, uh on on previous tv, is dev d to grab those same events and log them to syslog uh and run into the same problem with often their uh two ters. Now we have a config file where you can define you know which of those extra fields go in there and we've expanded them over time, but I think in general uh erroring on the side of providing a few too many of the variables rather than few uh gives you a lot more.

F

Yeah, so here's an example of what what I have right now. So, in addition to the pull and the v dev, I can give you the size offset I added recently I have the priority, which is useful because it has async versus sync and read, write you know, etc.

F

um So I I just you know do is just extending all syslog good enough. I mean I could I don't know if anybody depends on the current one, but it's actually here's. You know these examples. Are it just? Has you know it's not that useful to say something happened, but no details so would people be I mean? Would it be okay, just to extend it to do more like this, where, if there's information there um within reason, in this case everything related to a checksum, I pushed out.

E

One of the things I.

F

E

A

I was just gonna say that those history events seem pretty useless as they are.

F

Yeah now now we can configure there's a configuration to to you can ignore events, and so, at least for our purposes we'll ignore those, um but for things like checksums, io errors, you know faults. I think you want to have a little more specifics.

C

What is the purpose of the history event when you have no information there.

A

Yeah, so I don't think a lot of thought necessarily went into uh what output should get sent to the syslog. Initially, it was just set it up to get something logging and then I don't think we ever went back and refined it. Basically, so everything got logged, firstly by default, um so I think it's great to go back and revisit like what should we be logging.

F

Yeah this was this was my first tab. You know something we could do like. Basically, if there is information there um go ahead and post it and uh we could even get fancy if we wanted to, but um so I I don't know if there's a problem with just making all syslog better, you know it more extended uh versus you know, making that a configuration option where you can have um the terse one. I don't know the value of the terse one. To be honest,.

C

Having the first one might break existing stuff, but I like the idea I like having more information all the time. As you know,.

B

I think you could likely get most of the chance of having problems with existing stuff. If you just make sure all the new fields are on the end and you don't change the order of the existing fields so that if anybody's got a script, that's blocking out a value from the existing syslog or whatever that it's still in the same place and then, if there's extra stuff on the end of the line. That's that's fine because I found that useful.

B

I had extended the one on freebsd for check some errors to have, like you showed there, the the offset and so on. So I could see if these check some errors. Somebody was getting in a vm were always at the same offset or if they were different each time.

B

At one point I went so far as to have the expected and actual checksum logged. So I could see like is one of these all zeroes or something and something's going on, uh and so I think, having more information is almost always better.

B

Clearly, dan.

C

You need to re-implement our gun, you need to re-implement asl.

F

uh No, I I like um that's a good point. I I think I can just depend the way. The way it was set up um and I can make a configuration that you can opt out of the extended mode. I suppose, if somebody wanted the old, terse behavior.

E

In a similar vein, I've got an influx db interface to that, where it spits all that stuff out to influx db, and we use that so we could correlate. You know obviously performance and trends to events.

F

E

I'll get that around here somewhere, I don't know if it's I can push it up. Obviously,.

F

Okay, it sounds like just augmenting. The current all system would be the direction it sounds like it's going to be. Okay, as long as I you know append, it should be fine.

E

And there's also a cee standard for putting json into syslog, that's understood by things like splunk and elasticsearch. In those days.

F

Is there a convention for people putting that in syslogs right now or.

E

But I don't know seriously how many people really do it in anger,.

F

Yeah for my purpose, I just want to be able to you know in a post mode and go see what's going on.

A

I think it'd be pretty reasonable too, to turn off some of those less than helpful events over here. Like the history stuff right, I don't know that anybody really depends on that kind of stuff at the moment, so maybe pruning it down.

F

That one's pretty trivial there's a config to say, like don't ex, you know, exclude these and then I want to be an optimal candidate for that, because it doesn't give you any information. You know so.

A

Yeah, I think, selecting some better defaults, while you're in there would be great.

F

Okay, I'll put up a review this week, then, on that thanks, everyone thanks don.

A

um So the last thing that I put on the agenda was uh to corner brian and ask you to talk a little bit about the uh just to update on the opengfs 2.0 release candidate and schedule for that yeah. So uh there was a 2.0 release candidate cut two weeks ago now. I think something like that.

A

um I'm hoping to cut another release candidate this week so we'll have an rc2 out and then just a call for testers and letting it soak and um at the moment, we're just cherry picking, critical bug fixes back from master into it. Documentation updates little safe stuff like that, but the intent is just to let it soak for a while stabilize uh make sure there's nothing, nothing comes up, and then you get a final release. I'm not sure how many rc's that will take, but hopefully not too many, and hopefully the changes should be.

A

We shouldn't have too many anymore.

A

E

F

A

To do a new rc every two weeks, yeah.

F

I got a manuscript schedule.

A

There's no exact schedule, but I think two weeks would be pretty reasonable um and then I'm not sure how many rc's we need to go through until we're happy with it. But until we are happy with the quality of it I would say, um but yeah I think every two weeks would be pretty reasonable. I was hoping to put one like out, like I say this week, so watch for that and then by all means pick it up and test it. I haven't heard of any bad issues in the current rc.

A

um Mainly we've been adding patches for, like I said, little bug fixes and then some stuff for freebsd still.

B

uh Do you think, there's still time to get the um boot once envy list uh pull request? That's.

A

B

That get reviews I should go back.

A

B

Paul dignelli gave it a review. You marked it as accepted yesterday: okay, uh yeah yeah yeah- all right, because that one has the libsadvs api changes around that and we would like to not have 2.0 have something different than what we plan to go with long term.

A

Yeah, absolutely uh let me know if there's any other uh freebsd related changes. We should really pull in there because it would be nice if that was very close to what you guys were running.

B

Yes, uh that's very much. Our goal is to to have uh open zfs. 2.0 final version be what ships in previous d11 coming up that'll be february or whatever, but.

A

Okay, well, definitely, let me know if there's critical stuff there, I think, there's a few changes outstanding still, but I think most of it's there.

A

All right any other questions about the upcoming release. I think we're still on track for our planned before the end of the year, so I feel really good about that.

A

I think so I mean, hopefully we shouldn't go through 20 or 30 rc's. That would be bad. I'm hoping for small single digits here would be good.

A

All right, uh we uh looks like we're going for a new record of uh early meetings. uh Anybody other questions, comments, topics to discuss.

A

Today, if not, then I hope to see you all at the conference. um I would remind folks that uh you should you. I would ask you to register, um because the registration we'll send you a link to the zoom on the registration.

A

So if you register so uh and you'll need the zoom link to participate like ask, questions live otherwise we'll be live streaming on youtube um and- and you know anybody will be able to watch it live there, but they won't be able to like ask questions of the presenters during the conference um and then we'll also have like breakout sessions um yeah during the breaks that you'll be able to just chat.

A

You know free form, chat with different folks um in different, like virtual rooms, uh that'll be like different zoom conferences, resume meetings um and that'll be for people that for folks that register.

A

So it's free registration and I hope to see you all there.

A