Ceph Performance Weekly, 13 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-05-13

Description

https://ceph.io/community/meetings/#performance

A

Well, it looks like we've got josh, hopefully we'll get more people from corey here soon.

B

Yeah, I hope so I added another topic to the agenda mark about um a project that I'm thinking about for an um an interview starting in a couple weeks. Oh excellent.

A

Yeah I see that that that's a great idea.

B

I think we'll need some input from a lot of folks to understand what would be most useful to collect there.

A

Yeah, I think so we had a an intern working on something that wasn't this, but it was is kind of similar in a way with cbt results, and I think the trick to making something like this work well will be to make it really easy to change the schema or through the data that's collected and and regenerate whatever the the kind of central repository it is from from the raw data uh you know easily, without without having it all kind of be fragile and- and you know prone to falling apart when things change.

B

Yeah yeah, certainly it would be like a very loose. There would be no set schema. It'd just be adjacent bob.

A

Oh, I see okay.

B

That's kind of how telemetry works already.

A

A

So you probably will need at some point to index the whole thing right: it'll be too slow if it's just raw json. If you have tons and tons and tons of performance data.

B

But the important.

A

B

Actually has json support, so you can kind of query uh different kinds of fields within it pretty easily.

A

Will it will it create indexes for that those json that raw json data then.

B

I believe so yeah.

A

Nice yeah, that's exactly what you should do, because then you can still do fast, queries against it, but then keep all of your data in a nice loose format. You know like json human readable format like json.

B

Yeah yeah, oh yeah, we can always generate indexes later as well, like you said, as long as we have the original raw data.

A

Yeah yeah, I haven't looked at any of the databases in years, so I I didn't even know that postgres, let you do that! That's that's! That's cool!.

B

Yeah, I think it's probably an addition in the last like five ten years or so as support for adjacent more directly.

A

I'm I'm probably like 10 years out of date at this point.

A

Cool, that's that's fun, yeah! If um that sounds like a really good project. It'd be super interesting to see what um oh yeah. We've got a couple of perfect cameras to be really interesting to see over.

B

A

Like large data sets like that, I.

B

Think there may be a lot more that we uh don't have. That would be really interesting too.

A

What do you, what were you? What did you have in mind.

B

I think it's about like um distributions of different parameters, like maybe distributions of I o sizes or.

B

Bucketing I o inside different types beyond just read and write sure.

B

um Maybe things like it even at the booster level of um like distributions of um oh node sizes or number of x, headers or map uh key value pairs. You have associated with it with the objects yeah to kind of get a kind of get a kind of try to try to get a view of the uh aggregate kind of data set distribution and the workload distribution.

B

That makes sense.

A

If we get the yeah, if we get the age binning in, we can also then start seeing like how how cold the the o node versus o map data is relative to each other, like is, is um that that could be really interesting to see.

B

Yeah, there's probably a lot we could do more with like memory statistics too.

B

A

B

All have affected the caches.

A

Yeah yeah, we we have hit rate type stuff for the blue star caches and um I think adam has something that tells us hit rates for rocks to be right. Is that what you're says adam, I think so, right.

C

Yes, that yesterday I shared with you, it just tells uh p trade for uh block cache.

A

Yeah, I didn't get a chance to try it. Yesterday. I ended up following up on some other stuff I was doing, but um but today I'm going to try applying those to uh to master and run some tests with them.

A

So yeah we could josh, we could cover the whole thing right. We could cover all of our caches, the the hit rate statistics for all of them, yeah. That would be really nice.

A

All right: well, let's, let's get this getting started here: okay um prs! I did not see anything performance related that was new this week. So if you made something I apologize, I did not see it. I was quite tired and didn't have coffee yet when I did this so so that's why I'll blame it on um closed two pr's that I saw um we merged the initial osd support for c-store yay. uh So that's super, exciting uh and sam gave an excellent presentation on some of his work.

A

I don't know if that's shared publicly or not, um but uh he he went over a lot of uh his thoughts on on and camp plans for that, um the other pull requests that merged. Oh sorry, it was closed. uh Was this uh throttle uh uh request sent to monitors here for logging um that uh I think is just favored I'm being closed in favor of a new pr, so we should see something uh replacing that soon uh a couple of updated prs, though um adam. Let's talk about your uh your your uh do.

A

Small right, uh sorry do right boom, so smart, well, smaller yeah yeah! I again I did this before coffee. I think I just kind of five words. uh Let's talk about that after we go through the rest of the prs, uh because I'd like to understand it better. um Okay, uh this optimized client, uh oh, uh comes knows these client requests parallelism.

A

um I think kifu reviewed that after it was rebased and uh and.

A

As I I'm missing it wrong.

A

Chuhan is that right.

A

I think they're they're they're discussing it in any event, um let's see rgw d3n cache changes that got rebased, uh but nothing else. I think the work to set container memory limits in cepheum.

A

uh That also was rebased and I think sage, maybe updated it a little bit um and then oh, uh this actually was supposed to be enclosed. uh This work to improve the efficiency of ordered meth listing uh by eric that also merged okay, um lots of stuff in the no movement category, but I don't think anything real interesting to discuss right now, all right, any anything I missed.

A

A

All right, then uh adam uh so josh, I think, explained to me last week why it made sense why your pr to change and only do uh direct io inside um uh do small right makes sense, and I I thought at the time that I understood what josh was saying, but I was wondering: could you could you explain it again, because I I need to remember.

C

Okay, my maybe my thinking was that, since we do a buffering in bluestore ourselves, we do not want to pollute page cache in any way.

C

If we use in buffered mode rights, then we will pollute the cache a page cache and it would be better not to uh eat additional memory, just to be aware that we can give more space to blue store buffers and most and in addition, that's the only place when we even make it possible to do aio rights that are buffered. All the other cases are never buffered.

C

So it was more like a cleanup from my um my my thinking, because either we we cache it in all conditions or we never make a possibility to double cash in um system pages system. Red pages that that was my my thinking- and there is no really more logic behind that. That's it. I mean there is a logic. I wanted to make some simplification on aio rights, but that was a secondary goal.

A

So right now, by default, like igor, pointed out in the pr we have blue store default, buffered right, disabled.

A

So we we would not be buffering these writes by default in the current code right, if we were to dock the pr, we wouldn't be buffering in any way.

D

Well, there is one exception here uh if right request.

D

Is coupled with an additional flag which is of a device will need? Then we set buffered mode to true and then then slide comes buffered with both blue store and io level.

D

But other than that, uh well, actually, I'm not sure if this flag is actually uh by any client uh but other than that by default. We have this booster default buffer right, config parameter, which is set to false by default, and this makes all rights.

A

I'm a little afraid that, right now we have blue store default, buffered right disabled. So we don't use our own right cache, but we do.

A

Set well, okay, so right now in the current behavior, I suppose, then we are we're not we're saying the flag to whoever buffer it is, which I assume is that right so right now we're not doing any we're not doing either by default right.

A

Sorry so um adam's pr changes it so that we do aio right uh instead of using wctx the the right contacts buffered flag. We just set it to false right all right so right now we default to have a blue store default, buffered right, disabled.

A

So, like you said most of the time, we're already false anyway right that flag should be set to false, not always, but most of the.

C

Time, okay, but am I I'm like confused here? There are two things: one is our control over caching data in bluestora when we do write and we have a flag for that. It's let me verify that blue store default default, buffered right, and this tells us to cache the data in blue store buffers. When we write that's one.

D

C

And the other thing is that, currently, if we buffer in blue store, then for small writes in addition, we would also buffer in system and I'm challenging the usability of that.

A

Adam the the key piece for me now that I'm remembering this- because I I was remembering I think, is that if I, if I remember correctly, we usually set the right context- buffered flag, based on the value of uh blue store default. Buffered right. Is that correct?

A

Yes, not always, but usually um so I think I understand now why you want to do this right, because if you already have blue store default, buffered right disabled, like we do by default, then already you don't use either cache. You do a direct. I o right and you don't use our cache, but when you enable it, then you most of the time should be doing both with the current code.

A

And why would you do both? Why? Wouldn't you just use our cache or just use the page cache is that exact? Is that.

C

The thing exactly exactly I want to have buffering in blue store and not in.

C

A

So what we don't know right now is whether or not the caching at the blue store level is effective or the caching at the page, cache layer is effective, doing buffered io right. We just know that enabling it may change the results, but we don't understand why.

A

Is that valid? Is that a valid statement.

C

It's true, but I don't understand the value of this statement.

A

Well before we make this change, it seems to me like we need to understand if we see a change in the behavior.

A

Are we better off caching, at the bluester layer or at the page, cache layer? My instinct would be we're better off at the blue store layer that we are better off having our own cache there, but what we saw with roxdb right is that in fact, buffered io was far more important than we realized, possibly due to a code bug. We don't know, I just want to make sure we don't end up repeating that same mistake.

C

Okay, that's that's correct, and this is uh I share that concern. That's that's true, but why would we leave.

C

Having uh buffers a double buffering, basically only for uh small rights, do we remember logic behind that.

A

I I don't think it's probably a useful exercise.

A

Let's just say that maybe it was based on a theory that that, having um like a secondary cache at the page cache layer would be better. So we have a primary cache of the bluester layer and kind of a secondary page cache layer.

A

If things didn't fit into the blue store cache, maybe they can fit into the page cache. My suspicion is: that's probably what the thinking.

A

Was if you have lots of osds that are all almost uniformly utilized it? It doesn't necessarily make a lot of sense right, you're, better off just giving the memory to the booster cache.

A

But maybe if you had one hot osd and a couple of others that weren't just hot, maybe ruxty or uh sorry rgw indexes or one osd, or something and there's tons of uh oh map entries and be cash there. That then then maybe but but then again, that's that's roxy be page cache, so they're page cache. Okay, so.

C

Yeah, so maybe let's solve it that way, I will just close the pr, because I don't really care so much and we will save time on discussing. Oh, I I don't sorry.

A

I don't think you should necessarily sit close, the pr I don't know this wrong. um I just think we should test it, but um you're you may actually better. I I'd like us to move away from buffered io, especially with lib, a all right. It's it's not really the way it's intended to be used.

C

So, okay, on the testing front, I really can devise a series of small right tests. I mean tests that will come comprised is a in a high percentage of a small rights and try to toggle performance. When I crank up buffering in blue store, I can do that. I I will have some problem with limiting linux kernel um with buffering my extra data. I don't know how to do that.

A

I can help you with that. uh You can use c groups seemingly effectively to do it. That's how I was doing the omap testing by changing my.

C

Memory limit, so maybe let's split that to just our one one talk and you will teach me how to use siggraphs to limit uh caching memory, and I will finish and make the tests uh just to show what what the benefits and costs are cool.

A

Igor does that go ahead.

D

uh Well, I think I I understand uh adam's points about about using this config parameter to control, just blue store cache and do not impact the.

D

The page cache this definitely makes sense. uh Maybe we we should introduce an additional config parameter to to control page cache.

C

Usage uh yeah that would be cool that that really makes sense and then just put it everywhere.

D

Yeah yeah absolutely, and this way at least we would be able to uh to to benchmark uh booster cache and compete against the page cash and decide which one is more fairly efficient.

D

Yeah, so at this point I agree with patterns points, so it makes sense to uh to have this patch into it, and maybe, additionally, you might want an additional fix, an additional patch to control within page cache.

C

Okay, so you will think I should not reduce uh that logic, but extend it to provide an extra parameter just to allow uh also caching in page cache, so we could manipulate parameters and get full control all right. So I I'm.

D

Not sure if it's useful for production but for development purposes, it might be helpful.

A

Yeah a dev option, I think, makes sense, but but dev only.

D

Lately, for now just to be able to to benchmark various options in in a simple manner, and I believe it's pretty straightforward fix, so it would be very.

D

Complicated so well, we we can definitely go with the current patch uh and maybe additionally, we might add some more another. One parameter.

C

I think that if, if we are to add another one, then I will just scrub this one and just modify, because it will be like adding and removing that doesn't make sense. I will just provide additional parameter that will control whether we use buffered, io or.

D

A

A

All right, um well, I suppose I'll go into. uh I've, got just kind of a small update on the omap bench testing that I was doing um I'll put the link in the chat window for the data here.

A

um So this is the same thing that I showed last week, but with some new updates, you look at the third tab, the luminous to master investigation.

A

So what we saw previously was that there was a really big difference uh between luminous and master, uh we're far faster in master, especially when using buffered, io um and- and there was some concern about how much faster we were.

A

So um I went back and started looking at running omapbench against various versions of the historic code that we have um in some cases. I did need to change it somewhat to make it work. uh The object, store interface has changed since luminous, so um I think everything was basically right, though. um What I saw is that I don't think there was any one pr that necessarily uh improved performance overall, but it looks like there were a number of pr's that maybe did.

A

um There was definitely something that happened early on um in in maybe the mimic uh period of master uh after luminous was released.

A

uh That looks like it really improved um the the keys performance, but um it was still like, maybe four or five times slower than master at that point, um so I didn't figure out what that one was because the farther back I go the harder.

C

A

Is to actually get stuff to compile, um especially that's the toss eight, but um in the mimic time frame. uh I thought that maybe it was going to be uh kind of around this, this uh pr 2177, where we changed the flushing behavior uh in and also change the object, store, interface, um the flushing behavior in bootstore and the object store interface, but it turns out that had very little effect.

A

It didn't change anything at all, really what it it was and- and this maybe should have been obvious to me since I wrote this code, but it was when we introduced the osd memory auto tuning. That was what seemed to make one of the biggest differences in all of this and and the reasoning for it is kind of obvious. um Prior to this, we we statically assigned the caches to uh the uh owner cash and the kv cache. uh It's not entirely true.

A

We had some capabilities in the old code to kind of try to like rob from one and give it to the other, but it was is pretty limited and it didn't work right, um so it it never really kind of did what it was supposed to do. I think, just from what I remember of looking at it at the time.

A

So when we introduced the osd memory auto tuning, it allowed a blue store to allocate almost all of its pre-available memory for caches to say roxdb to the block cache. So you could really aggressively uh cache omap if there were very few oh nodes. So in this case we we only have like. um Oh, this is actually not right there.

A

We only have a hundred thousand um uh uh objects and we have um uh 100 uh uh omap keys per object, and so in in this case, um we actually did not need a whole lot of cash. We primarily need everything to be an omap or caching omap entries, and so what we're really seeing in this test is actually that the amount of memory that's available for omap cache.

A

In specifically the rocksdb block, cache is much higher, and so a lot of the tests, especially get is much better.

A

What's surprising to me, though, is that in master, when we turn off buffered io, all of a sudden, the performance is bad.

A

So something very strange is going on here because clearly giving roxtv more memory for the block cache when in buffered io mode helps dramatically, but it doesn't appear to when we we set a bluester buffered io to disable our bluefest buffered io to disabled, so there's still something very strange going on, but that's the pr that really made the big difference in terms of a lot of these uh numbers that we saw um there have there. There are some other ones I mean still set.

A

Keys is like twice as fast in master as it was back in mimic when we merged that, so we've definitely had some additional improvements uh since then, but uh you know that was the one that I really. I saw that kind of made the the big difference in in a number of different places.

A

um That's that's! All. I've got uh right now, um questions or.

A

Comments all right. Well, then, if there are none um I'll try to next take adam's work. Looking at uh he developed a pr that uh will record the roxdb block, cache hit rate numbers in in our proof counters.

A

So um I'm going to try to work on applying his to master and really try to dig into why we see this dramatic difference between buffered io and direct io in a test that should be reading everything from the the block cache so hopefully, next week I will have some numbers there and we can maybe figure out how to fix it. So the mode is performing like we expect it to uh so. Okay, that's it um josh. Would you like to talk about imar the topic.

E

You added. Oh sorry. Yes, sorry, just a question, um I know in previous prs, uh so this is uh joshua bergen from digitalocean. I think you've interacted a little bit with alex maragon as well in one of the previous prs kind of related to the series of topics. One of my colleagues. I know one of the races that you had looked at. If I remember correctly kind of showed that we were pre-fetching over and over and over again, do you have a memory of that?

E

Okay, do you still suspect that is in play here.

A

I believe that we have uncovered what that was.

E

A

Looks like in, uh in fact, igor is the one that fixed it uh igor. Maybe you want to talk about this. um This is, I believe, your your uh investigation into uh delete uh iterating uh iteration during delete where we were only doing 30 objects at a time.

D

Okay, so what's the question here.

A

Well, we were seeing repeated.

A

Rage scans, basically that were causing extremely slow performance and um I believe it was actually the the thing that you fixed, where we were basically uh re-scanning: every 30 objects during deletion in pg deletion.

A

Oh, I see yeah you're talking about pg.

E

Deletion specifically.

A

Yeah, so is that the case you guys were seeing it in because that's the big one we we recently saw where we were seeing these repeated uh prefetching uh due to range scans.

E

I have to look up the pr. I actually thought that that particular pr was well. Maybe that was pg deletion uh yeah.

E

The reason I was asking about this specif specifically is: we are running uh bufferdio right now, um but we've we've run into a number of corner cases, even with buffered io, on where old map performance gets really really bad in uh in with rocks to be on blue store um to the point that we might we're actually considering moving back to file store for our indexes right now, oh okay, um so in particular, like there's, been a few cases where we've seen this happen, um but a very easy way for us to reproduce the issues that we see is um say you have a big old map.

E

Let's say it's: 2 million objects or something or 2.2 million keys in the old map, and then you have a loop external that is doing a list and then deleting a range of keys list during delete range of keys over everything repeat and what we find is we'll start off deleting at say a thousand keys per second, and that rapidly drops off. I should say rapidly over time drops off down to like 100 keys per second with this osd 100 cpu usage.

E

I'm not sure if you're seeing that same sort of thing in your test as well.

A

um I have seen something that looks a lot like that when we were doing um delete range, and you ended up with a ton of tombstones that you were iterating over. But as soon as you compacted, everything got good again exactly yeah. Is that what you're seeing.

E

Yeah pretty much exactly if we compact the performance goes up. Sorry, I think you're always trying to talk.

D

uh Yeah and uh just want to add to mark's uh comment that perhaps it's not necessary to have reg deletes to get this degraded roxdb stage. It seems to me that multiple removals impact, the roxdb performance.

D

As well so it degrades after the performance degrades after myself, removals and one needs to perform compaction to get the performance numbers back.

D

So originally we were thinking that we just get this degradation after deleted uh range deletes, but it looks like every every delete impacts that maybe single deletes not that uh impact that not that badly. But if you apply multiple of them, uh they still impact the performance.

B

D

Me this applies to snap streaming, to map removals and to pg removals.

A

Igor, it seems to me like we need to track, have a mechanism by which we can track the number of tombstones and when we exceed some threshold, then uh issue a compaction.

D

And yeah, maybe uh well in my second pr, which is still pending. uh I I reworked the map removal a bit and it triggers range deletes followed by the arranged compaction, but this is supplied to each removal.

D

Only so snapdreaming and maybe manual or map removal are still suffering from from this issue, and maybe we should introduce some additional means to handle it.

E

And I guess what I find is odd and and admit we have not done extensive testing on this, so it's entirely possible that we'll find the same problem in file store, but I'm pretty sure that when we've run rocksteady and foster, we do not see this performance degradation.

A

File store stores a lot less data in roxdb, but might have the same problem. You just don't suffer from it nearly as easily or as quickly.

E

Well, I guess I mean running our indexes like we do still have some old luminous clusters, where our indexes are actually rock stevie and file store, and so it should be the same amount of data right because it's just whatever rtw is storing as omap.

A

Sure I just mean there's not like other data in rocksdb as well.

E

I see so you're thinking that, because blue story is throwing more of its own allocation state, for example, in rocks tv yeah.

A

I mean it might not be that it. You know, especially now, with like the calm family options that adam introduced. What are do you have column families, starting in your your deployment.

E

uh We haven't turned that on. I wasn't aware that that was really a tested available option to octopus correct sorry. I thought it was specific.

A

E

It yeah or was it pacific. Probably pacific is probably.

A

F

It's specifically yeah.

A

F

Which version are you running josh by the way.

E

A mixture of nautilus and luminous um nautilus, uh mixture of 14, 2, 8, 11 and 18.

D

Just in case, uh what drives are you using.

E

uh The new generation are um intel nvme for two teras, four teres. Why am I blinking on this? I'm not sure, but uh pretty fast. The drives themselves are pretty fast. When we run into issues it's almost always cpu.

A

Yeah, it's just chewing through tons and tons of tombstones during iteration, so anything that does lots of iterations just it'd be bad.

E

Until compaction happens, yeah so you're talking about like triggering compaction. My recollection, I walked the rocks db code on the compaction path a couple months ago, so I'm going by memory here, but I think they do take some level of tombstones into account in their heuristics in terms of what um files they should be compacting in their background compaction.

E

um So they do try. uh One of the worries we've had is at least in nautilus. The default configuration for aux db might be hurting the background compaction processes in the way that a file store configuration doesn't in that it sets a older option for the number of threads to use for compaction um versus the rocksdb version.

E

Has a capability to have like a set of flusher threads, a set of compaction, threads, more concurrency, etc, and I'm pretty sure file store by default does not override any of those options, and so it might actually have better background compaction, um behavior, maybe that's what's compensating here as well. We haven't had enough time to really experiment with that, though,.

A

I don't remember what versions we were worried about, but there was a point at which we were worried about um a data corruption issue with multiple compaction, threads and older versions of rxdb, but it might have been pre-nautilus.

A

um We were trying to be really careful about not not doing anything that was gonna. Potentially you know have problems.

F

Yeah, I think you're right mark. I remember a nautilus. We did uh update the roxdb version because those issues, corruption issues- were fixed, yeah and also we did increase the number of compaction. Also, if I remember correctly,.

A

Yeah, maybe, though we could let rocks db, do it like um allow it to to float those? Are you guys seeing that the compaction threads are where your your bottlenecked like? If you increase that number of compaction threads, does it improve performance for you.

E

uh Honestly, we haven't had enough time to really experiment with that at scale. In production um we did a major discovery. I want to say about a month and a half back on this, but that was all in controlled, lab environments.

E

A

In the testing that we did back in the nautilus time frame, we we did see that um there's a sweet spot where, if you have too many compaction threads, that actually slows it down slightly. um It's probably like four between four and eight I'm guessing two was pretty good. It was like the at least in our tests. It looked like the the benefit after two started. uh You know uh flowing down significantly and then after you went above like because you're, six or eight or something, then it it kind of plateaued or even decreased.

A

But you know your environment might be different, so it'd be it'd, be super interesting to know. If, if you see similar things or or different things,.

E

Right, okay, yeah: I think we probably just have to do a little bit more um experimentation digging on our side to come up with more solid. This works. This doesn't sort of information um yeah.

E

The other thing I want to comment on. I don't want to take over if there's other stuff you want to discuss in this meeting, um but uh the discussion earlier about whether or not to write through the os cache um earlier. um That's particularly interesting to me as well. We have um we have.

E

We have older clusters where we're basically running uh rocks to be right on the spinners like we don't have separated flash for the rock's tv portion um and what we're finding there is a really weird pathological behavior, where um you write through the os cache and then do a flush range, essentially right across a bunch of buffers and tell it to flush out to the back end. The os is: writing those as a whole.

E

Bunch of I think, 512 bytes so like bezier block size, writes down into the lower layers which are supposed to gather them back up into big, writes again via the I o scheduler what's happening is the dm crypt layer is slow enough. That little bits of those big writes are actually leaking through and being scheduled to the disk, and so, instead of having a like, say, it's a 512 kilobyte right to the disk.

E

That's supposed to be just one big chunk, it actually leaks through as five or six and then we're missing rotations on the disc, which is causing massive increase, while I should say massive, but a large increase in right latency for those rights, the disk. So I actually started writing a patch and I kind of abandoned it. I got a bit nervous about it where I stopped writing anything through the os cache in bufferedio mode, so we would get the read benefit from buffered io, but avoid using the cache on the right interesting.

E

A

Super super useful.

E

To know, though, now with uh igor's fix, like the reason we had buffer on was because of the pg deletion, so um we're waiting to upgrade that system to 14 to 18 and then we'll just be disabling buffer io on the spinners at that point, because that that's really the only thing we know that was biting us was the pg deletion there and then that should address the right latency problem for us as well. But I thought it might be interesting for you to know that.

A

A

Absolutely all right, adam sounds like uh that's a uh endorsement of of your pr.

C

Yeah, it looks.

A

Like all right josh, I know we don't have a ton of time, but did you want to try to uh talk about your topic or should we move on.

B

Let's see next week, uh that's a good timing from youtube. Okay,.

A

All right well, then, uh anything else, guys that we should talk about this week or should we wrap.

A

A

Oh yeah, absolutely.

E

Okay, I I posted this to the dev mailing list um and I'm sorry to to bug on this again. uh We do have some really old clusters that are still using file store on spinners and we are constantly plagued by the oh, no so the inode iteration issue in xfs and at one point sage, said: hey a newer kernel fixes this, but unfortunately he didn't specify which one I was really hoping that maybe someone somehow remembered which kernel has um better inode scanning during flush range and the answer is no, that's fine.

E

I was just kind of hopeful.

A

I do remember seeing him say that, and I don't know what it was, but he he saw that that was so. I think we'll have to find out from sage unless someone else knows.

E

Based off of the time when he said it, my guess was somewhere around the 4.13 4.14 range, but I mean I'm just guessing at that point. So.

E

Okay, that's fine thanks.

A

Yeah no problem, uh one question for you actually uh on your file store clusters. Are you have have you done? Okay, with uh multiple ag's on xfs, with file store? That was the thing that really um kind of made file store fall over a lot of times. Oh the.

E

Our file store, well file store, falls over all the time. For us, both systems are awful um and rolling them to blue store is difficult because they're so slow that we can pretty much only do one backfill at a time per osd in file store um and yeah they're they're difficult, so we're trying to get to blue stores as fast as we can on those, but as fast as we can is probably a one year plus project on those clusters.

E

Okay, so I'm just looking for hey if there's some kernel that we can get to, that makes the performance a little bit better. That can speed up our roll to blue store project. So that's the reason for my question. There.

A

My I don't. I have no idea what your file store deployments actually look like, but if they're, like any other file, store deployments I've seen with lots of uh objects, your your directories are just fragmented to hacking back. You know really really bad um across multiple accessory groups, assuming that you haven't done it to one.

E

uh Yeah, I don't think we've done any custom tuning on those I'd have to look at it that we have, and they do have very very high object, counts yeah. So.

A

So I don't know I I posted some stuff about this years ago, but the basic idea is that when file store does a split, a directory split, it will then move objects uh into those directories, but they originally were in a directory that was assigned to one ag and the new subdirectories are going to be different ages, probably most likely they're different aging, not guaranteed, but most of the time they will be, and so then the first set of objects that ends up in that ag are in fact um uh sorry in that directory.

A

In that subdirectory are from a different ag. Then new objects will be put into it. So you'll end up over time as the directory tree gets deeper and deeper and spread wider with just a completely random mix of objects at different offsets on the disk inside one directory, and so, if you're going to scan that directory, the whole thing is just insanely bad. It's just random. I o.

E

Yeah, it looks like we're pretty much just doing random value all the time with those things, but it's not really surprising, given the workload too. So, oh okay,.

A

Sure, but it makes it makes things like directory listings super super slow, that's what we saw anyway right, so anything that does like a scan over directories can be awful like backfill yeah, exactly.

A

All right well joshua, thank you for for attending. It was actually really nice to hear something from somebody. That's got a real deployment. I I I very much appreciate it.

E

I'm glad that we could make it. We usually have a conflict at this time pretty much every week. So, every time we don't have a conflict, we try to listen in at least cool.

A

All right well, then, have a great day. Everyone thanks for coming and see you next.

E

Week, thanks bye, bye, oh thank you.