Ceph Performance Weekly, 16 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-09-16

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right uh before we get started, I want to remind folks that uh josh has an excellent looking article for us to read that we're going to do on the 30th. So uh it's in two weeks, we've still got a little time here, but I'll post, the link from the ether pad into the chat window.

A

A

There we go so just a reminder that, sometime possibly the day before, if you're like me, uh try to read the article um all right so pull requests this week, not a whole lot of new stuff. uh I didn't actually see anything that looked particularly relevant. That was new, um but we did have a pull request from radic that closed.

A

uh Regarding optimizing carriage handling and bufferless c string, um ilya did the review on that one and approved it and merged it. That's good. We had four prs that have been updated, at least in the ones that I was looking at.

A

This ongoing osd compression bypass pr uh casey had done a review on that and it looks like now. Eric has been testing it and said that during testing he was seeing a lot of errors, so uh that is going to be ongoing. I think uh not quite ready.

B

That's just noise from the rgw suite. I don't think, there's an issue with the pr itself.

C

Oh good, okay, cool.

A

Any any other info on that casey like is it otherwise is it looking like it's ready to go.

B

Yeah, I'm happy with it cool.

A

A

This bluefish fine, green locking from adam uh that uh I think that was no longer dnm.

D

In and now, in addition, it is required to order fix for blue fs inode one spillover on runway. In some cases uh I know spillover.

A

Pr is that uh I don't know if I've got that one listed here you happen to you. uh That's.

D

A fix not a performance issue.

A

That probably explains why I never added it in here. um Okay, very good! um Do you have somebody to review that one.

D

No, not not really yeah, that's a pesky pr, it basically revamps entire blue fs, and I wasn't I was really hesitant to push for it. But then I fixed the bug that the problem that we never could have really fixed with that, and that's really, in my mind, put more pressure on this pr.

A

I think we should try to rope igor into looking at it or sage, if you want to which one uh this. This is uh 42099 I'll, put the link in the chat uh right here.

A

I remember right adam: this came out of majian peng's uh uh attempt at doing something kind of similar right.

D

um Exactly just made an enforceable um release of lock just inside read, you know right to be able for others to go and that really had a nasty side effects and we decided that we really should uh fix the locks and because it really gives um improvement from blue fs buffered io, when in the same time there are compactions and uh some other actual rights, and you can see on the performance.

D

Interaction unlocks in that area.

A

Yeah yeah, I do remember my jimping's vr looked, I mean the performance was nice out of it, but it was scary.

A

I have not looked at your pr at all, but it's pretty.

D

D

And it's scary yeah, it's very long.

A

Yeah, do you think sage? Are you having interest in and reviewing this.

E

Just looking at it now I mean it's been a long time since I've looked at this, but I think you could be a better reviewer, but.

A

E

A

Yeah the first attempt at this it was not safe. um This one might be safe.

E

I can review this, I can read it.

D

Okay, I mean it's uh safe enough that now it is possible just if we go in a corner case that we do not have a space or runway space for blue fs lock. We can just stop and everything still having the proper locks. We can now allocate and rewrite just from the scratch blue fs log, which previously was just impossible to do safely.

A

All right well, thank you adam! Thank you sage. That's that's an exciting one. If we can, if we we can convince ourselves and believe that this is a good idea, then I think this will be a nice win.

D

In any case, I wouldn't like to merge it before gabby me and grabby fix that sporadic issues with no column b pr, because that would be a very problematic if we had two delicate errors smashed at the same time,.

A

Yeah, I I hear you on that one. I think it's not not a bad idea to get that worked out. First.

A

All right uh next uh kifu's pr that uh updates our version of the roxdb lru cache to conform to the new or to allow it to conform to the new version in box db. 6.22.1.

A

We agreed that we don't also want to update roxdb yet, but that pr so far looks like it's fairly benign. um It still works with the current version, we're on and then potentially allows us to upgrade. Rocksdb I've been running some perf tests on that, just to make sure that there are no regressions introduced um it, it looks fine. So far, um I did also tried upgrading to roxdb 6.23.2 based on kifu's commit um it's having compilation issues with uh uh the iou ring stuff that it now includes.

A

um I'm sure it's probably just a bug in the the uh like our cmake stuff, but um I didn't look too closely at it. So I think we'll we'll just wait on rocks db upgrade, since I don't want to rush into that anyway. This is like no one else does either.

A

More generally related to that pr, uh josh should recommended talking to seth maintainers. This all resulted from a maintainer wanting to use kind of a cutting edge roxdb to compile stuff.

A

I don't know what they're going to use it for, but um kind of given the issues we've had in the past, with adopting rocks to be too fast, even releases like with the dramatic changes that they made in their their um their their right path or or um like how they they read from cache versus how they read from disk with read ahead, uh I mean that code has changed multiple times in the last couple of years, so it's I think we need to start getting a little bit more careful about how quickly we just jump on the new rocks tv releases, but anyway, that's a different discussion.

A

So uh probably probably try that on stuff maintainers list, um ah the only other one here updated uh the cash binning, my cash bidding pr uh neha mentioned whether or not we want to get this in for quincy. I talked to adam a little bit about this about a week ago. Maybe um I just got a couple of other ideas for doing something a little bit different, but you know we're we're all kind of in a time crunch here.

A

This code, already more or less exists, just needs to be uh rebased again and then probably go through a lot more testing and analysis. So I I think I'm going to try to make it happen. um That might be the next thing. I work on after finishing up kifu's thing. um Well, also, looking on the um uh wall clock profiling, stuff, but um anyway, uh that one I don't. I don't want to.

A

Let it go yet um there's I think, there's still value there, so I'll probably try to get back to that soon um lots of stuff under no movement adam. I know you get a couple of different things here, but getting reviews is always tough.

A

Igor's not here right, oh uh so adam, I'm gonna pick on you a little bit. um What do you think we should do about cash trimming? We've had both your pr and eager pr sitting there for a couple months.

D

Even forgot about that.

A

I know right that happens to me too.

A

We should we should fix it, though use one of your guys's pr's. I think.

D

Yeah, let me let me review that quickly and then come with with weighted answer.

A

Yeah I mean it, doesn't it doesn't have to happen right away right? I mean we've been sitting on this for a while now, but we should do something.

D

Okay, so let's, uh let's make it for the next uh performance weekly. How about that sure that sounds good.

A

um Then we've got the tc malik uh uh thread cache settings, pr which do you know it was there anything that prevented us from doing that? Was there? Was there anything broken about it?

A

Does it just need a review.

D

Or as much as I could test it, it worked perfectly, but I only made an on local local tests. I did not check how it might influence in some cases. I just didn't test it on topology.

A

um Let's see oh my shorted cash for rgw, um I think mark hogan is taking that over. He said he was interested in doing that, so um I there's been no more discussion on my pr, but that doesn't mean he hasn't been working out in the background, so um I'll try to find out what the status of that is unless casey, if you know.

B

Yeah he uh he did sign up to take it over he's, been making a lot of progress um looking into the performance regressions in the beast front end, though, oh.

A

Cool that's more perspective, yeah. How was that.

B

There isn't a pr for it, but he's uh added a ton of details to the tracker issue. If you want to track that in the ether pad, I can find a link yeah.

A

That's a really good idea. I haven't, even I don't have anything in trackers here, but it's.

A

They're, adding.

B

It's linkedin chat. You can put it wherever you want.

B

But yeah he captured some very detailed flame graphs. So it's a lot of data to consume, but I think that'll be really helpful.

A

Oh, that's great.

B

Time out, there is way more expensive than we expected it to be.

A

Yeah, I assume there's just some polling or phillips, but.

A

Cool well: hey, oh yeah, I'll I'll! Look through that later, um unless people are interested in looking through it now.

A

Yes, that's great.

A

All right, um I think, oh new pull requests, oh good. The mds ones made it in. I was actually just thinking that that uh that zhang's uh prs- probably I missed those but that's good.

A

Who added those? If I, if you don't mind me, asking.

C

Oh patrick.

F

C

A

Hey uh patrick, are you, do you have a working.

G

Mic he said he got up at one point. I bet he's listening still.

A

Oh okay, yeah. I see that now cool, okay, um so yeah I I did notice uh on the mailing list. Young. uh I sent a big email uh talking about some of this one of the things I was trying to understand, based on his last emails.

A

He said that this doesn't apply to situations like in the uh the I o 500 md, test hard tests, where you have multiple clients, uh putting many files into like a single directory, and I was trying to understand how that reconciled with this pr43 125 that we've got under the new section now, where we're randomly distributing dirt frags to multiple nps's. I thought that would be like kind of exactly that case, where you have um lots of files for multiple clients with multiple juror frags.

A

You know all in one directory and then those dirt phrases are distributed to many mbs's, that's actually kind of the situation. I thought that would apply in so I wanted to ask jung or patrick if, um if they could kind of explain that more um I'm here now.

H

Oh, hey, patrick hey, I don't know um for the I o 500 hard test. Is that how large is the directory? In that case,.

A

So, usually, in the hard test, you've got one directory. You've got an arbitrary number of uh of clients, and each client can do like write out an arbitrary number of files. So um the test is timed: it is kind of up to you to define um how much to try to write out within that time limit. um It's actually even timed is a little bit incorrect. It's a minimum time, but no maximum time, so you can let it run for as long as you want, but it has to run for at least five minutes.

A

If I remember correctly, okay, do you recall how many.

H

A

I'm thinking we were doing probably around 20 to 30 000 file creates per second, and that was over like a five minute period. So whatever that works out to be.

C

30 000 in file creates per second that seems as high.

A

I thought that's what I was able to get on the upper end. Maybe it was, I think, typically was quite a bit slower than that, but I think I managed to maybe get it up to that under certain scenarios.

H

But I think we did some pre-fragmentation of the directory. I think you got something yeah, some pr which also spread out the dirt frags before running the test.

A

Yeah, it's possible, I'm misremembering because it's been a while, um let's just even say ten is ten reasonable or five. I know I've hit yeah. That was number one.

H

More reasonable, um this pr may help if we also add some tricks to pre-fragment the directory, but I think that requires hints from the client that may not be allowed by the I o 500 testing framework.

H

Quite possibly um we can. We can.

A

Another thing we can.

H

Do is make the mbs a little smarter. If it sees um you know, a lot of file creates in a directory. It could try to pre-fragment it, but I mean that. Will you know that that can backfire for other cases, where you know it's just adding a few files to a directory.

A

Yeah, it looks like with the I o 500 they let you do anything you want to the parent directory so like whatever directory all these, this stuff is going to end up in the parent you can set whatever x out of flags. You want on it or you know other things you can. You can do, um but it's the the individual sub directories that they don't want. You touching from from what I've seen.

H

A

H

I'm sure there's some hints we could provide to the mds through some exciters that would improve our performance. There um preemptively spreading the diffracts or just doing this um sharding the metadata by spreading the dirt frags out randomly across mds's uh is, I don't know antithetical to the early mds design. First ffs. I don't know how sage really feels about it, but it could.

D

Be an interesting.

H

uh Config option: I was also thinking what would be nice, although I've gotten pushback from jung in the past on this is, is just providing a a config change for a sub tree, similar to how we're using for ephemeral pinning just to say you know, I want the metadata sharded this way and the random ephemeral pins were kind of one idea in that regard, and so was the the distributed femoral pins.

H

If we could also do it. For this case, we just say every dirt. Frag under the sub tree should be charted across mds's. Then um that could be.

H

You know that would that would be something I'd be more willing to merge, because I'm a little wary of of having such a large change in here and plus I I haven't gone through any of this code. I assume there must be some kind of config for turning this on, because you wouldn't want it. In the general case.

A

You know the the dynamics of tree pinning if we could figure out how to make it almost not ddosed. When you you get so big. That seems to be where it really falls apart is it it can't um actually distribute uh subtrees properly it you end up like failing, lock acquisition and it just all like falls apart. If we can fix that it might work better, I mean it might. Work like you know well, even, um but that that's. That seems to be where it keeps like.

A

You know, failing at least in the things I was trying to do.

H

Yeah, um we still have not had any like anybody, who's really dug into the balance recently, so I'm sure there's lots of little things that can be done to improve it.

A

It wasn't even so much the balancer itself, though it seemed like when I I thought it was at first, but it seemed like it was had much more to do with this like global um shared uh uh cash right.

H

Yeah there's there's reasons why the migration might fail um and yeah. So the balancer can also do some repetitive things that just doesn't make any real progress but potentially messes up performance stage. I'm interested for your thoughts on these ers.

E

Yeah, I missed the beginning of the discussion until you mentioned my name, so I wasn't sure what I was likely to not whatever I missed the first part. But um are we looking at the we're just looking at the list of.

H

E

Looking at this.

H

Gpr's from jung, one of them adds uh well, I haven't looked at code, yet I haven't seen how it's configured or turned on, but one of them randomly distributes every dirt frag and a sub-tree across them. Yes, it's using the new, consistent hashing. We have for ephemeral, pinning like nested.

E

All the way down, or just at one level, everything that seems like a bad idea to me.

H

But I think it avoids making sub trees because everything is is distributed, but yeah, um I think it. It probably has some nice performance characteristics for this. This aiml workload, but um you know, for general purpose file system use it's. It's really not very good.

E

A

It would be, it would only be within the directory that you've turned it on the right. That's what we think not not over like from the rundown or is it.

H

Well, if we're just talking about one level that would be distributed effect.

E

It would it still, it would still create subtrees right. It would just like you, don't, because the number of mdss can.

H

Well so sage, the sage is basing this work on another pr, which is also in the list on this either pad called the uh pulling the subtree map out of the mds journal, which is something john, was working on to improve the performance of mds when we have hundreds or thousands of subtrees, um because the subtree map is written out with every journal segment which can get prohibitively expensive, um and so here's a pr to pull that out and this other stuff is based on that.

H

I don't think I think he was in the middle of reviewing the the pr to remove the subtree map from the journal, but we haven't made progress on it since because of its size and with jung leaving. I was a little concerned about merging that because it really just changed everything. Oh my god, it's huge.

E

Yeah, I mean it seems like it's like a deeper question of which direction we want to go. If we really want to go in a direction where we have a bazillion subtrees, then something like this is necessary, but if, instead, the thinking is find a way to still keep the treatment modestly or reasonably concise, then.

A

Agent um in those I o 500 tests that we ran, it was subtree map, encoding or journaling on the authoritative mbs that the and it's just like one director with billions of files from lots of clients in it. That was, I mean it's just awful.

E

A

So that's why sean started trying to see if we could pull this out and do something with it.

E

I mean the subtree map also is like. If I remember it's like it's encoding the flow. I know right. It's not just like, like. I wonder if the subtree map could be more like also or if it's not not journaling before I notice just like, I know numbers because I know it would be.

H

Compact, I'm not sure, but I I all I know is jung, took a really detailed look at this, and this is the direction he ran off into.

A

Okay, I had um one really minor pr that helps us a little bit um that merged a while back. uh It's it's not even directly related to this was um about like a pen, hole and buffer list, um a refill pen or buffer sorry uh implementing dynamic, a pen length. um This thing this this actually might help a little it's kind of stupid, but it it it's just taking care of like some of the work that was being done over and over again. Every time we were like encoding, the subtree map.

A

So we might get a little benefit of this, but it's not really it's just a band-aid. I don't know, probably a small one anyway,.

H

Yeah, I wasn't I I mean coming back to the subtree map, though, like I, I wasn't really sure why it needs to be in every segment. I never really got a satisfactory answer from john. You know sage.

E

Why it needs to be in every segment yeah? um I think it's just so that um the trimming logic doesn't have to be super careful. I mean you've asked you have to when you start really when you replay you have to um have a subtree map, so you are replaying.

E

You know what's authoritative whatever, so that you can rebuild your cache appropriately, um and it was just simple as to write the whole thing, because I just assumed it was going to be small, but as long as you eventually as long as you don't trim too much so that when you start over, you don't have enough contacts as long as you have enough context to do replay then it'll be fine, but just be a matter of working out. What what that context is?

E

I mean this new pull request is a week old. So is junk still working on this. Like, let's see doing it seems.

C

To be where he's comfortable.

H

The the the two, the ones that are in the randomly just review.

E

It's a brand new portal.

H

Yeah those are brand new and jung, just posted to the staff user encephalus about these pr's he's been working on. He actually is at some chinese startup and they're doing a iml stuff still working on cfs there. I see okay, but he's not really gone.

A

Yeah the different focus slightly right, but yeah.

E

They're still working on it, yeah um yeah I mean. Maybe that is maybe that is the direction to go in I mean I guess my assumption was always that that the map would be small, but I think, even if you have like a pretty conservative view of things, you would have multiple directories in the many directories in the file system that are hot and so you're hashing even at that level, and so that would be in metadata servers for every one of those directories and m directories.

E

That's like already a um a lot, so maybe this makes sense, but.

A

And they're they're kind of.

E

Scared already so complicated, but making it more complicated.

C

A

I think I mean all this work around like okay, the ephemeral opinion code is not that complicated. Even the balancer itself is like not that that complicated, but man, the whole cash system and the distributed cache. That's like awful. I tried to read through that, and I still don't understand it.

E

Let's see it's all built up on top of that yeah, the balancer is just a big ugly pile of heuristics, basically.

C

E

It's not too terrible, it's not too bad. um I mean you could like, throw it away and rewrite it and then like nothing, would change.

A

Yeah, exactly that's, that was what I came down to is like.

E

This is pretty. This is uh fundamentally changing the way that, like the replay dirty state, I think, is being tracked.

E

um I don't know.

E

I don't know, I think, maybe it depends on how much we trust young this brain to like be able to understand this. Probably.

H

E

Should if I should make sure that, like at least one other human can understand before committing to this path, and that make sure it's not some yeah.

H

um Whatever so, I'm I'm planning to go through these pr's, but especially once my my time frees up in the next month, but uh yeah, you know yeah. We can't merge these without you know someone fully understanding them who who regularly is in uh upstream yeah, and maybe something like then upstream I mean jung- is, is working on stuff, but he's not really.

H

um You know regularly doing or he's not his fed stand up. Like he's, not triaging bugs he's yeah, I mean it'd, be nice.

E

If you shut up and stand up to like support these and move through, but- and maybe also the um I don't know where the the testing stands, but um having a set of tests that have like pretty large numbers of mdss and a like a workload with like thrashing or something just to like, really push the the boundaries here, like aggressive trimming or I don't know whatever it is- that we think is gonna.

E

um I think the risk here I would assume not having actually read any of the pull requests. The risk I would assume would be if the um the total subtree map state is spread over lots of different log segments that we're like tracking that correctly, so that we don't trim something such that we can't rebuild that state or that we don't have the important state before we need it.

E

So things like aggressive dog trimming or like wildly varying contributing restarting the mds, making sure that if you restart the mds- and it has to rebuild everything and then go continue to trim that it doesn't lose something all that stuff anyway,.

A

Do we do we have any kind of like reasonable path toward making some of this stuff like less complicated? Is there anything we can do to like yeah? It's just so.

C

It's so intense.

E

It's pretty intense.

E

I don't think so.

H

Yeah, I think it's just inherent in in the yeah. It's inherent with the distributed cache with caps. The mds is the replication yeah of trees renaming all that stuff is is just adds um but yeah. What sage said, if, if it wanted to make it simpler, it would have to be we'd, probably have to go back to the drawing board with the architecture and think about how we might do it differently.

E

Yeah all right, I gotta drop for another meeting, bye, guys.

F

uh Patrick, while you're here, I was curious about the seven best um qos efforts. You see that it goes though um some activity in this one in the last uh month, where um they say that they're actually using this this in production now.

H

Yeah, this is another giant pr that has not been reviewed yet.

F

Yeah, okay folks said you got a chance to look at this at all.

H

Nope, it's it's on my list, which never stops crying.

F

A

Another one to add to- and maybe patrick I can help out on this- is um implementing the memory auto tuning first ffs like I know that, there's like that outstanding pair has been there for a couple of years, but um I could maybe try to actually get it using the same one. The priority cache like we're using for the other demons.

A

If, if you think it'd be worthwhile.

H

Well, I think the the it's just coming back to the priority cache the the main challenge. There is dealing with the the presence of capabilities, because you just can't drop items from the mds cash without recalling the caps that may be pinning them um yeah the logic.

D

H

Significantly more complicated than what the osd does, which I assume can just drop things whenever it wants from the cache.

A

Not not really, actually, this is all written with the idea that you're kind of like making suggestions more than demanding things does that make sense. Okay, like the whole architecture, is based on. You ask the the particular cache what it wants at different like priority levels. It gives you back what it wants, then we kind of go through this whole process. Saying: okay, here's what you should get, but please please do this. It's not like a uh you know, immediately trying to revoke everything.

A

It's like okay, here's what you should try to target and then we go through this iterative process where we look how much memory we're using. Then, if we're still not under that, then we make new suggestions and you might end up kind of like starving something that can release memory, but the whole goal is to really keep things as much as possible below some memory threshold.

A

If that kind of makes sense,.

H

Yeah, okay, I mean it is it's worth looking at and mostly would just be. I assume changing the lru that we use for denturies priority cash instead, uh most of the memory tracking is is done through a a mempool, um which I said it works pretty similar to what we already do with the osd um yeah.

A

Most likely, what we do is the same thing we do in in the osd, which is basically just to make a really thin wrapper around your existing hash that implements either add the interface to it or just make a thin wrapper around it that implements the the priority. Cache calls that need to be made, and that will um yeah it's it's fairly. Non-Intrusive.

H

uh It's certainly worth looking at right now. The the one pr we have outstanding for the memory targeted is uh something siddharth was working on the mds memory target, which is supposed to be analogous to the osd memory target, and that was really just some logic to set the mds cache size according to what the mds thinks it needs to be in order to stay at its target.

H

um um You know he had some remaining work to do on that and the pr became stale, so it just really needs to be revived um and yeah. That's that's low hanging fruit. As far as just getting that worked on, and then I think the priority cache would be a good next step.

A

So the the priority cache basically incorporates like generic code for doing the same kind of thing. So that's what we use in the osd and in the mon and uh I was hoping I might be able to get the rg rgw guys to use it as well. But um it you know it it. I don't know in the mds. It might maybe it's more complicated and wouldn't work, but um it might be a nice way to just kind of avoid having lots of independent implementations of the same thing.

A

F

F

A

Well, I'm talking to you about that. Do you guys, besides that, the one cache do you guys have other uh caches or buffers or anything that that need to be like regulated to keep the client.

C

The lips ffs, oh.

H

C

H

um Okay, it the the cash management. There is fairly simplistic in that it's going to cap the number of inodes in cash, if I recall correctly and assuming it hasn't, been changed by someone last few years.

H

uh Of course, some things are pinned in cache because, for example, with cepheus it's going to get a reference to an inode and then the kernel controls when it gets released.

H

Sure the only way that actually we can force the kernel to release its references to an inode is to actually remount this fuse mount. So actually it's got some cute logic. When the mds revokes an inode capability that fuse will actually remount itself, um causing the kernel to release all of its references.

H

uh And you know this has actually been a long-standing problem with cepheus, because there's no great mechanism to tell the colonel hey. I need you to drop this reference to this high note.

H

We used to have some special api call internal to the colonel, through fused italia, to release a reference, but that got deprecated for colonel reasons, and then I think that left us with remounting, although I think there's been some work on the next version of fused at some kind of support to do this again. But I haven't looked into that carefully sure.

H

In summary, the cash management in the client is simple, but also a little interesting.

A

Was um was someone on this ffs team was they were? I think I remember someone was looking at fuse in general, like trying to update stuff and yeah.

A

Are they still working on it.

H

Not actively, no probably we do. We need someone to take another look at the updated fuse interfaces and decide if there's something we can do differently to it and just make things better.

H

Actually, that sounds like a good project for our new um uh ffs team member, who was an intern elsewhere at red hat and just joined this ffs team mirage to pitch that as a startup project for him, cool.

A

All right, patrick, it was nice to have you here. This is this- is the first time we've really had someone knowledgeable about stuff of us, which is.

H

Why I sometimes have a conflict with this meeting or I usually have a conflict with this meeting, but not today? Okay, I'm cool I'll. Try to make these more frequently, because I I think that conflict will not be as in the future.

A

A

All right: um well, let's see, is there anything else? I don't think I've got anything else, guys I'll open it up. Anyone have anything they want to talk about in the last 15 minutes here.

F

I guess just going back to that um qspr for um and yes patrick um they also mentioned they had as for tesla, they implemented around mds like thrashing essentially sounded like um I don't know. If you look through their presentation about this looks pretty interesting.

H

By the the mds dm clock, folks, yeah.

F

No, I wasn't aware they did a presentation they'd, like the slides in their latest comment there. Maybe it's worth looking through these in a future uh performance call or something. Here's a link.

F

I see, okay, I'll, take a look at that.

A

I will I will say that um that if we can fix some of the qos problems in this ffs, that was um the other big thing I noticed in the I o 500 test is that we were having some clients uh completing much much earlier than others, and I think not necessarily strictly due to just like, like weird balancer issues or other things, um even with like ephemeral, pinning well.

A

That was also a separate problem, but I think qos also played into this, where um the more we can make things, even the better we're gonna do on that test.

H

Yeah qos is just actually the tricky thing to get right in cfs, because up to this point, clients have been unrestricted in how much work they give the mds to pc calls and because messages are necessarily ordered between the client and the mds.

H

If the mds needs to ask the client to release something from its cache or release a capability so that it can do work on behalf of another client, um it may first have to chew, through a number of messages from that client before it can get to that. That cap release that the client is giving back to the yes and so um bolting on qos onto the mds is, is tricky business because you have the potential for creating deadlocks.

H

um So in some ways we need to, you know, take a detailed look at the clients and, and they they definitely need to participate in this qos nicely. So that they're, not um you, know, creating these deadlock situations by, for example, giving the mds too much work and we have to provide legacy support and it it's not simple.

H

um But I have not again taken a detailed look at this pr. So I'm not exactly sure what they've they've done yet.

F

Thanks thanks, that's interesting, patrick. I wasn't aware of the protocol limitations there.

F

Yeah, it's definitely a hard problem even for a simpler protocol. The osds I mean we're only just now, um starting to get to the point where we can uh be implemented and test. The client versus client, qos and radios.

H

I wonder if we'll get to a point soon, where we don't need to put the metadata pool on a separate set of osds, the mds can have appropriate quality of service guarantees. Then we can just use the same set of ssds for all of all the pools or need to use. Ssd storage.

F

Yeah I'd imagine so if we had ssds everywhere, you said.

A

Even in that scenario, though, I mean the tests I was doing, the osds were not always working particularly hard. There were a couple cases where we pushed them hard, but not not in the cases where we were really slow.

H

The real danger is having the data pool also on the using the same set of osds with the mds, because the data pool can easily be overwhelmed by aggressive clients and, if they're, on the same set of osd's, the metadata will slow down dramatically, and it just has that that just affects everything then, but generally.

D

H

Often see that the data pool is set on our disk drives, but now you know these days, it's very common to see all ssd set clusters, and so there's no guarantees anywhere. You can't make assumptions, which is why we, we just have default advice of putting the the ceph minutes, ffs metadata pool on its own set of ssds that are exclusively for cfs.

A

When you, when you do that, so if you so, if you just have ssds your or nvme drives you'll say, and you separate out the metadata pool to be on a different set of osds than the data pool in this case, did you ever figure out what the limitation was in that scenario that was actually like, causing the meditative pool to be more latent or slow.

H

A

H

um You mean, without separating the the group of.

C

A

So here let me rephrase it um say: you've got a bunch of nvme drives and you can either dedicate all of those for both data and for metadata, or you could split them so that only some of them are serving data, and some of them are certain metadata.

A

It is, is your convention or sorry is your your your observation that is better just put some of those away just for metadata.

H

I haven't seen numbers for nvmes, so I I can't say whether it's necessary in that particular case, but I imagine that even with nvmes, the osds could be overwhelmed by by clients doing large, read and write workloads on osds, especially for a smaller cluster, in which case the you know, the mds is not getting any kind of priority treatment from the lsds. So even simple things like writing to the journal to record you know, file opens and closes, and app updates would be slowed down.

H

You know, even by a few milliseconds can have broad locations for the performance of cffs overall.

F

A good argument for making sure that uh ray sqs works well for his best clients, the available control, our priority between that the metadata server activity and the client activity.

H

Yeah, especially for edge clusters, where the number of osts is not, you know, you only have a few osds, but they do have. You know cutting edge hardware like nvmes. um You know it's even more important to make sure that, and uh you know we don't need to carve out a number of osds for the for the metadata pool right right.

A

What I was seeing during the I o 500 tests is that the the really hard tests the osd, seemed to not actually be doing a whole lot. It was kind of like we had a lot of contention on like a single mds like an authoritative mds, um even in like the ephemeral pinning tests where you are doing like this round robin stuff um I mean the osds were working harder, but not as hard as they can work.

A

um It looked to me more like what we saw was that you'd end up with certain mds's hitting their their kind of uh inherent limit, which is like maybe 20 000 ifs or something around that that level of performance and then others end up with a lower number of uh subdirectories, or maybe I don't remember how you've changed it at this point, but um they have a a smaller proportion of the work to do just by random. You know bell curve, distribution and end up by. I think for.

H

That particular test the problem was the subtree map. The authoritative mds had a ridiculously large subtree map, and again it comes back to that problem. I was discussing earlier with ch, where it has to write the subtree map for every journal segment exactly.

B

H

A lot of the operations required synchronous updates to the journal, and every time they wrote a new segment out. It had to write the subtree map out, which was why the authoritative mds was so slow and it was affecting the performance of the other mdss.

A

Yeah, I guess I have not yet observed a behavior on this kind of hardware, where it's just all really fast, where the osds like putting the the data and the metadata on the same osd spread across the whole cluster has been.

A

Obviously the bottleneck doesn't mean it isn't, but it just hasn't been like really clearly that's. What's you know, slowing things down yet.

H

I think it's most obvious when you run when you put the media pool on hard disk drives as well sure it's absolutely. You know. We don't regularly hear about these problems.

H

um When I do testing on line node, I I I still put the uh os the metadata os. I have a separate set of metadata osds because I actually do see the problems when I, when I'm doing something with like a 16 or even 32, osd cluster uh and they're on enterprise ssds. Even though they're vms, um I can easily you know, create slowdowns by doing a large workload with like 64 128 clients all hitting the cfs cluster.

H

In that case, um and all of the metadata map, no, the metadata the fps journal is, uh is all um blob store and objects and okay, but the directories, the directory objects, are all on and then there's a few other data structures which I believe are blob stores. um They open file tables. Another example of an old map store that does the mds uses.

A

Did you did you happen to notice what looked like was slowing down more if it was uh a map or or data accesses up to texas.

H

No, I don't think I checked what was the exact pause of the little metadata ops. Usually I found out after the osd started dying or something.

C

H

Yeah I mean the the nice thing about linode. Is it's easy to make a you know even a large cluster for cheap, but you know unless you're willing to you know you can make vms with a lot of memory, but then you have to start shelling out more money which I try not. I try to avoid doing so. It's easier to just take a few ost and use them for metadata.

H

um I don't care if the data reads and writes are slow, but the metadata you know can't be so.

F

Yeah another thing about, like with blue, store, having usually being deployed with the ssd for rocksteepy. If the most important metadata is in omap, then you might not even need full ssd osd's everywhere.

A

Josh, this might be another argument too, to let bluefs actually write out hot objects to the fast device, instead of always putting them on the block device or device or whatever we want to call it.

F

Yeah yeah, I guess we've talked about a couple different strategies: there either objects or for tiny objects, but kind of using put some some things on the fast storage.

A

It's the same thing: the intel guys want to do with uh with their open calves. It's just shoving that underneath the osd and letting it figure it out.

H

I don't want us to go down a huge rabbit hole of optimizations. We can do. I think mostly. We just need to try out the new qos features of the of the osd with in this particular scenario, of a small ceph cluster, and if that works, then we can make appropriate recommendations in the documentation, but otherwise I I don't think it requires a lot of.

H

You know a a lot of detail. Sorry, um I I don't think we need to investigate this too much. The you know, just separating off a few osd is in a large cluster, for metadata is not a is not a huge ask for for the large production clusters we know of like it's certain.

F

Yeah, that makes sense, especially when you, you probably have a bit of a different memory profile for those nodes as well.

H

Anyway, I gotta run to a conflict, so um hopefully I'll see you all next week or sometime in the future, depending on. If that other meeting conflict continues. Okay, sure.

F

It's not too explosive, it makes it sound like a giant argument or something.

H

It is a giant argument.

F

A

F

A

Doing right not.

F

The conflict, so it sounds like an argument to me.

H

It's a horrible conflict medieval all right, see you guys all right.

C

See you later all right we're we're at the end of the hour anyway, so any anything else guys.

A

Nope all right and have a great week see you later. Everyone thanks see you.

F

A