Ceph Performance Weekly, 29 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-10-29

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right looks like we've got folks, trickling in now. That's good uh all right! Let's get this thing started, so uh I saw two new pr's this week. uh One was a performance improvement for breeding, mirror snapshots and rbd. Asynchronously jason did a review of that.

A

I haven't looked at it super closely, but it looks like uh it's moving along next. There was a pull review for integrating avx, both eraser, an avx 512 eraser coding implementation.

A

I think that's actually just a build change but um reviewed that so uh not sure if how much of a performance improvement that provides- but I take it this morning so hopefully we'll at some point be able to track that uh there are two pr's this week that closed the first is the slip rbd cache, uh I think, jason finally thought that was good enough to merge so that has been merged.

A

um I believe that was quite a bit of code, so hopefully we'll we'll be able to harden that and get into a point where folks can use it, um but this is the initial pr so good um what else? uh Oh, this bufferless rebuild one. uh It looks like that has been replaced and I did not catch the replacement in this so I'll get that in here as well, but uh there's a new pr for that.

A

All right, uh updated um adam. If you're here other adam okay, we don't, we don't have uh core adam um there's. This uh blue store dynamic levels, uh pr that has been there for a while and they noticed that it was not behaving properly with adam's column. Family sharding work, so I think they are requesting a review. uh Oh adam, great.

B

A

Hey adam, uh I was just talking about uh this pr uh that changes or provides initial implementation for um doing dynamic. Leveling in roxdb um and they've requested a review from you because they said it's not working right with calm family sharding, but they think it works right when we they don't use it. um So they requested a review from you on that.

B

Oh thanks. That's interesting.

A

Sure yep, just uh if you have time I know it's everyone's super busy, um okay uh and then uh the other updated pr I have is just um uh my very, very old, now uh cash binning uh code.

A

Most of this is actually in master already um that got broken up into multiple separate, smaller pr's, but the one remaining piece is the the age binning code itself, uh all the other things that led up to that are are in so um that is now being worked on in a separate branch.

A

um I suppose I should probably include that here, but um we can we'll talk about a little bit more here uh coming up, so uh I won't. I won't discuss it now, um let's see so otherwise there wasn't a whole lot uh adam. I think you've been reviewing yours changes, but it looks to me like it's: it failed tests, but otherwise I think you were really happy with it right.

B

I'm very happy with igor's uh improvements to uh for deleting pgs, especially with the first part, which is really clean and gives much improvement, I'm not so keen to go with the second, a second part which changes our format and only gives us boost if our pg's do have a lot of uh omap data. That's the only case when the second part really shows its uh performance boost.

B

But if, if you have a lot of omaps in your pg, then it's like 10 times faster, the deletion using uh the second part, but yeah I will I will still uh would prefer.

B

I think that when we change a format uh to extract the change of format to a separate pr and then just make use of it with with additional um code that takes advantage of it yeah, that's it: okay, okay, cool.

A

Cool um okay: let's see what else? um Okay, there's uh your your code for uh making uh it so that we're we're more choosy about how we spend our cash for cash charts. um I thought you take that with do not merge, yet is that, yes, it's still.

B

It still do not merge. It needs to some kind of this, I think, should go to the final code, but um I guess finally, we should first see a a cash aging. Cash binning has been aging before before actual algorithm for being more aggressive with allocating space to shards should be revised. This is because um current implementation- I I feel like it- will make it very slow for some previously under allocated charts to to exit that state to before they they in take subsequent iteration can get enough pressure to go from a very low.

B

So possibly there is some like limit like one megabyte or something should be considered. I, but I've not made a change tests for that, so I cannot provide a number, so I left in dnm.

A

Okay, yeah, it makes sense. uh Do you have a feeling one way or another about having the cash shards independently compete for uh age binning and priority cash.

B

And yes, I I'm now considering that this is like making just more elements more caches without any actual benefits for um for process of properly allocating sizes to those entities.

B

A

Cool all right, well, I'm sure we'll we'll have more to discuss on that one uh as we get more results with the testing we've been doing.

A

um Okay, let's see anything else, uh I think most of this other stuff is just kind of all waiting in the wings, so not a whole lot of other work here all right, I think that's about it for pr's, unless I missed anything guys.

A

All right, well, then, uh josh, I see pg, auto scaler is the first thing in our list here. So uh do you want to take us away on it.

C

Yeah sure I wanted to talk about this first inside uh junior is uh to be late for him, so I wanted to get to that for earlier, um so that we talked about this a bunch in the past and uh dad your ideas, and so we wanted to improve the out of the box behavior, for when you initially install stuff, you can get good performance um from the get go but slow like what we do. We talked about doing this in the past was um using like the full budget of pgs.

C

We have um for all the pools to start with and then only scaling down um the number of pgs. We have when there's pressure from other pools to actually use that and need need more space or parallelism.

C

So this essentially means changing the way that the um auto scaling algorithm works. So, instead of going up by capacity, it goes. um It starts with all the pg's allocated and shifts them around if there's a very large difference in utilization later by keeping the behavior, where it only makes a change through the pgm. If it would differ by a factor of two, so we don't have any kind of uh oscillation or small changes, it's data movement and is pretty expensive.

A

So um agreed 100, I think that's that's good thing. I still don't kind of well. I guess the the thing that I I keep coming back to is that it seemed like we could get some of the benefit by scaling. The pg log length, rather than um reverting back to the auto scalers, have the first line of attack like that are still really good.

C

Maybe we're already scaling the pg log length dynamically. Now it's it's a pro sd budget for the pg log.

D

uh Sorry, as I shared.

D

Oh, it is no, no, that's not that's. For osd, you you're specifically talking.

A

Yeah yeah exactly like, instead of um first adjusting the number of pgs per pool to control the um kind of the the distribution of pgs across the osd. I wonder if um it might be lower impact.

A

If we started out by scaling the number of p g log entries per pool instead and then only fell back to adjusting the pg count for pool when things got like really out of whack.

C

Are you describing? Are you thinking of like having a higher like default target for stephen, where the uh the pg log length comes in because that's kind of like a secondary.

C

For controlling memory, rather than directly affecting, like the amount of parallelism we get.

A

So so I see two separate limitations that uh prevent us from having more pgs per osd. One is the amount of memory that we consume for pg log and then the other is the um the kind of overhead of just having too many peaches in the cluster in general. There's probably a third thing of like too many pgs per osd, but I don't think we're actually anywhere close to that. Yet I think it's those other two things that really kind of limit us.

A

So um it seems to me like we could eliminate some of the problems that we face by implementing a total overall cluster-wide pg count limit, and if we haven't hit that yet then you know we can we're free to kind of you know, have more pgs and then also scaling the pg log length on a per pool basis, so that then you could have more pgs overall in the osd but then still scale things back to control memory by scaling the pg log length right, you're, not reducing the overall number of pg entries, you're, just changing kind of the the dynamics of whether or not you have long pg logs with few pgs in that pool or more pg's in that pool with a shorter log length for that pool.

E

How do you deal with different kinds of pools like ones that are supposed to consume? More, um like just, for example, uh how do you differentiate between a data pool and index pool and rgw?

E

Do you start out with the same number of pgs and um weight or just pg log length, so I mean technically you'll see different patterns in different pools, right.

A

C

So like for metadata pools, we set this to like four, uh so today the auto scale uses that to consider them as if they were four times their size. I think we could use that in their reverse direction. um We're treating things with rpgs, where we say this would be like a quarter of their allocation as then that they would get as if they were a data pool.

E

Yeah that could make sense.

A

My hope is that if we went down this route, we could have enough pgs per pool for a reasonable number of pools. That concurrency would no longer be a problem like we'd, be able to have enough minimum per pool that then concurrency isn't really the problem anymore. It's more about making sure data distribution is good, which we have the balancer for so hopefully we can. You know not have to worry about that either.

A

um I think what we face right now is that if you were to create a large number of pools either we have so few pgs in each one that we lack concurrency or we have such a high minimum like say, 16 or 32 pgs per osd in that pool uh with a large number of pools that you end up with a ton of pg's, and then we end up in situations where, like with containers, the osd could actually go out of memory.

C

I think, with the memory limits now, um they won't go out of memory. They'll just have a much smaller cash, but I think that that number of pools is really unrealistic for, like a a small cluster.

A

How many pools like what? What's what? What do we think is reasonable for, like a number of pools for a given cluster.

C

Yeah 10 10 20, yeah dude.

A

I'm kind of afraid that that's not going to work in the long run that we're going to be getting more and more requests for having more pools than that, and you know I mean we can dictate it right. We can say no. This is really the max we can support, but I think.

C

That's kind of important direction, we're going, uh but with uh we need to have some limits right I mean there's going to be even with uh a dynamic log length, there's still some minimum. We need for dupe detection, so we can't get get rid of that memory used entirely.

C

So then, at some point there needs to be a limit.

A

But we're still we're not we're not fixing that by having fewer pgs in each pool right.

A

Like we're hard limited right like um like, like you, said 10, you know as an example, um you know if we have 10 pools, that each have a minimum of 32 osds per osd per pool, that's still like 320 osds on the three sorry 320 pgs on the osd, which you know yes, we can. We can make that work right now and four gigabytes. We can shoot the caches a little bit and I don't remember what that will actually work out to be with our defaults um for pg log length.

A

But um I guess I guess my thought was that if we kind of turn this around a little bit, we might be able to support a larger number of pools.

A

That, then, would have, I guess, yeah, like you said it would be shorter um overall numbers of dupe entries and pg log entries per pool.

C

That's not a use case that we should be considering like we. We don't need to worry about that use case. I don't think.

A

Okay yeah, as long as as long as we really can just say: 10 10, you know pools is the like max. You can do for per cluster, it was just um it seems like it could hurt us in the long run, maybe not.

C

I think like well, I look at the things that people think of using pools for could be done.

C

You know with other mechanisms like with the namespaces for multi-tenancy or with other kinds of groupings for isolation like one of the things that sam has talked about um is uh kind of uh an overload on top of um crush, where you have like rbd images, for example, using a subset of the clusters, so they're not necessarily spread, even though they're in a given pool they're spread over like a smaller area within the cluster, so they have a uh kind of reduced.

C

Potential uh fault area: this also helps with the some other ideas that um that's kind of sca, respecting the total number of uh hosts that need to talk to as well.

A

One one question for you: josh related to some of this stuff.

A

um Oh, this will be my last uh attempt at this and I'll give up on it. But if, if we have like just completely like cold pools, like no one's been doing anything on it for a while, like rbd hasn't been touched in months or whatever um do we still really need to keep those dupe, ops and and other pg log entries in memory like? Does that make sense on a per pool basis to be continuing to keep those around or can that memory better be used for other things?.

C

That's really unnecessary and.

C

I'm not sure that that's a very common uh problem necessarily, but I think that's something that we could solve kind of orthogonally to what we've been talking about.

C

Okay, that's kind of more about the pt log, storage and and current policy.

E

It also depends.

C

E

I mean if those pgs that are there on those pools are all active and clean, then probably yeah. We don't need the pg logs as much, but uh in some scenarios uh there are these corner cases where there are um unclean pg's and there we definitely want to keep the pg log around, because whenever you want to use those pools back, you'd need them to recover right.

A

E

I I actually um going back to the initial discussion. Let us assume for the moment that we don't have uh more than 10 to 15 pools what what is our current bottleneck and what can we do to address it with the spa like whether this back pressure mechanism is the only way to go about it or is there anything else we need to do.

C

Like consider, consider, like a large cluster, say 100 osds and used to start out by creating some pools. Today you get a pool with like 32 pgs, because that's the minimum there's no data in it. That's not going to give you anywhere near the spread across all those discs.

A

Didn't I thought we implemented the thing where we scaled back, though I thought that already went in where we use the the ideal number of pgs and scale it down rather than scaling it up.

C

No, no, that that that has doesn't exist. Yet! That's that's why it's hydrogenated.

F

Oh okay, I thought that already made it in sorry. Okay,.

C

F

No, no okay, so that helps as long as you've got like a couple pools right like as long as you can.

A

And just um like create a pool with like 100 pgs per osd, starting out you're great it's when you've got like nine already and you're, creating the tenth that it doesn't help as much.

C

Yeah, this still helps. In that case, um you still get assuming your cluster is large enough to handle that many calls.

A

You would, um you would presumably already be using like some distribution of the available pgs for the other pools and as soon as you add that tenth, it would then have to steal stuff from the other pools to add pgs for the new one.

C

And it is a there's, a fair amount of wiggle room like like between the 100 target and like the 300 hard limit, there's a lot of space there.

C

So I think we wouldn't necessarily want to be stealing ceiling spgs, but in many cases, uh maybe he's purely adding adding more.

A

You'd be you'd, be really close to your 300 limit already with like even just the minimum 32 per osd per pool, with nine existing right.

C

If you're, if you're talking about like a very tiny cluster yeah.

C

I guess I'm more concerned about the larger cluster case. Okay,.

C

Got I agree this this isn't really that relevant for, like a tiny cluster, um where you're going to be constrained by the hw pool using the minimum, and it's going to get the full parallelism, because there's not that many osd's, but for a larger cluster.

C

That's where you really benefit from getting that increased pg them initially, and in that case you would have enough wiggle room to work with before adding like a 10th or 11th pool.

A

Yeah I'll make it clear, I'm not against any of this. This is actually. This is absolutely good, um more just the what I'm one can the the worst thing in all of this is the data movement right.

C

A

Definitely want to win.

C

This weekend, yeah.

A

Yeah, so anything we can do to avoid data movement is really, I think, a win. However, we do.

C

That yeah and you and you bring up some good points there. I think orthogonal to like this, this uh pg part, but uh was about the the way we store pt logs and uh memory we were using there. We could probably introduce uh like ways to kind of get rid of that. A lot of that memory usage when it's not necessary.

C

Like respect a little bit diversified like uh having hello use like a scheme.

A

um Yeah we really someday soon are gonna need to be able to keep a lot more pg log entries and do ventures. I think.

C

A

And we don't have the memory to do it in memory.

A

Maybe with with like obtain- or you know, persistent memory, maybe maybe we get enough addressable space that we can do it. Then I don't know.

C

Yeah, maybe with a very reduced uh subset memory just for the dupe detection.

A

Yeah, I mean you figure if you've got like 32 gigabytes of that, even just available for duke detection. That gives you a lot more headroom than we have now yeah.

A

Definitely, junior sorry, I've been I've been taking up a lot of this conversation. What do you think about all this since you're going to be looking at it.

G

Sorry, I don't know much about pj log, so I don't know seems like I'll go with whatever works best yeah.

G

I think the the auto scaler makes sense, but I I just I'm not familiar with like how I would affect in the wrong run or what kind of other use case that it will affect if we were to implement this new algorithm for pdr scalar, I guess.

C

Yeah, I think, we'll uncle one I'd like to test with, like uh kind of uh artificial maps that, like represent kind of real clusters for kind of different use cases and see what what the algorithm would aspect would be.

G

C

So like we don't want to get into a situation where we just like use all the pgs for these metadata pools, for example, and there's like nothing left for the data pools.

A

As you as you work on this um just kind of like I said earlier, the anything you can do to avoid data movement, I think, is going to be really really useful and beneficial.

A

So that's kind of the goal with it all is is try to avoid having the cluster doing this kind of stuff over and over and over again. In the background.

C

A

That's why, like the small movements, are bad right, like you know, adding a pg one at a time slow over time, you know constantly rebalancing stuff is kind of a lot of people were complaining about that like a year or two ago, when this first was implemented as they create a new pool, it gets like eight pgs or something, and then um you know, as they start filling it with data it. It's constantly rebalancing and slowing down.

G

So I guess like just uh if the target is like two times the amount of uh what we need to scale we just have to, or we can increase it to three times. I don't know so that it avoids like repeating, as joshua was saying.

C

Yeah, I think we can maybe like test it out like how it would function like if you, if you had like a cluster, that's set up originally with like this many pools, and then you add a new pool and start filling it up like what? What uh at what point? Would it start moving things around and uh we can, and we can like see how that would be like too much movement at this point. If we want to change that threshold or what the impact would.

C

Be I think the nice thing about this is that the auto scaler is kind of relatively isolated, doesn't need a whole lot of external stuff, so we can kind of test it test this, this algorithm, um pretty by itself kind of a in a very interesting way. We don't need to like set up giant clusters to figure out what what behavior is.

G

Just like plugging in numbers, I guess yeah.

C

A

It's pretty easy with um the current code to see the impact when you create a new pool and just start like throwing data at it too, like you'll, you'll, see it right away with it on and off the difference between it. So that's like you know, a good starting point to look.

A

A

All right cool anything more on this guys.

C

We carry pretty well thanks. Mark.

A

Oh, thank you uh all right. uh Well, let's move on to owner memory, usage and cash age. Binning um adam has been doing tons of work in this area.

A

um He uh fixed a bug in the existing uh uh cash code that was preventing us from properly growing the uh of the data cache, I believe, uh buffer cache and then also um got the old implementation of the um oh, no double caching, pr uh working with column families, and then we also on top of that implemented the old cash age bidding pr and have all this kind of working again um adam.

A

Do you want to talk a little bit about what you've seen and then I can also talk a little bit about what I've seen.

B

um There is nothing new from my site from last week, uh performance meeting I've been entirely uh dedicated to to testing igor's pr and fixing stuff with that, but uh I can replay what was uh what was the result. Basically, uh when I tested your uh bundle, meaning that your original pre-based and also age binning and also fixes um the limiting factor, seems to be that sizes for kv caches, both all node cache and regular roxdb block data cache just kept growing and they never yielded any data, and I made um detailed uh analysis.

B

What are the actual uh elements in that caches? I mean, and not actually what the content is, but evidently roxdb leaves those entries in our caches that they are just very old. They can be one or two hours old, even with intensive testing. Only new ones are being used, but the old ones are just there lying around and they eat up our space. So that's the most of the new interesting stuff. I cannot about uh using caches.

A

And that was the exact same thing that I saw adam. Let me share my screen, um so what we have here is a comparison of the different um uh intermediate steps that adam and I were taking um going through some of this, these prs and older code and new, newer code that we've been working on, and so um this first graph is master and there are some problems in it, the biggest one being that this was before adam's fix for the data cache.

A

The data cache was not properly growing, but also um we get into a state where basically, the meta cash and the kv cash are being distributed equally, even though that's maybe not ideal, and so you can see that, like the um the yellow line and the orange line are almost exactly the same.

A

As we move on through the different uh code that we we did, um you can start seeing some changes in this. um I'm going to skip over graphic number one and just go to graph number two here. That's with adam speaks from the data cache. So now it's actually getting more memory as it should, um and then this is also including the oh, no double caching fix.

A

So we see that um the kv cache is actually much smaller than the meta cache. Now it's much smaller, but getting much less memory. The yellow line is much lower than the orange line.

A

Number three is where cache age bidding is first introduced. In this case, we actually see that the the green line and yellow line this is for the the kvo node cache and the kv cache. uh Basically rock's, dvd's versions of these caches is actually growing, like adam said uh kind of without bound, and it turns out that's all in the first priority level, which is for rocks to be indexes and filters, um not not great.

A

um We really want to prioritize caching these things, because if you have a bloom filter miss you could potentially be doing a lot of reads for every o, node cache mist that you have in blue source cache.

A

It's not good, but it turns out that it looks like roxdb doesn't actually delete these things. They just wait for them to be expired in the cache and because we're prioritizing uh caching, these things they end up getting even more priority than bluestora's own node cache, they're kind of like the highest priority thing in the current or in this implementation of the code, and so they just keep growing because they're never deleted.

A

So if we keep going down scrolling down, um I'm actually gonna go all the way down to graph six here.

A

If we change it so that we don't prioritize roxdb's indexes and filters in price zero, and instead we have the compete with all other kv entries in roxdm's caches. We can eliminate this so instead of seeing in in graph four here this kind of green and yellow line growing and growing and growing. If we have those compete with other kb cache entries is too small or relatively small, and instead we give a much greater percentage of the cache.

A

The available memory uh to o node, which is this orange line, but before the orange line never exceeded one gigabyte and then in the new version, where the um the indexes and filters compete with other kv entries and the overall cache is um compared to the owned cache.

A

Then we see that the owned cache gets much more memory and is uh actually working a little bit better. um Also in these graphs, you can see that uh everything's kind of working properly. uh Initially we don't have a lot of oh nodes. The data cache gets a lot of memory. That's the steel line.

A

The light blue line um initially is ramping up quickly during the pre-fill stage, but then, as we add, o nodes and we grow the owned cache, we start seeing that that data cache shrinks and goes down um from some of adam's testing. I believe he was actually seeing that if he he alternated between workloads that had lots of oh nodes and few oh nodes, we would see cache memory being given back to data cache, which is kind of exactly what we want in all of this.

A

So I'm hopeful that the code is actually doing what we wanted to do. It's just a question of figuring out how to ensure that all this works. Well, one thing we do see still in all of this is that tc malik very quickly full fragment memory with all these small objects that we deal with creation and deletion, especially in separate threads of all. These small objects, the the auto tuning code, more or less works.

A

But as long as we have all this fragmentation, um we see that as soon as the 4k random right workload starts after the pre-fill stage, um the amount of memory available for the cache just plummets down to to a very, very low value compared to what it used to be, and then we kind of regrow it here as as tc mal works things out and maybe helps get rid of some of the fragmentation.

A

um But still we fragment memory terribly in in the osd, with all the objects we create to delete. So if we want to improve this, we probably need to figure out how to avoid this kind of behavior that we currently have.

B

Mark, I maybe uh I think I will rephrase, because that's not very, I think, that's not very clear, I'm sorry. The problem is that when we had a pressure from a data cache that forced uh metadata cache to drop uh some old entries, we did actually see properly metadata cache to drop some oh notes. That was very good, but unfortunately we couldn't get data cache to get the memory, because the data left by after using for a metadata booster metadata, it was so fragmented.

B

It could inform proper amount of four kilobyte blocks for caching, so we left with less usage of metadata cache, almost the same usage of a block cache and a lot of unmapped memory from tc malloc unused.

A

B

A

Yes, yes, this is our problem, so adam, I'm thinking if we can use radix uh implementation of the opportunistic ring buffer or allocating some of these things. Maybe we can improve the situation. We can't fix it, but maybe we can sort of make it a little.

B

Better, unfortunately, I am skeptical, because I understand that this allocating in ring buffer works very well when we have short lived objects that we can push our ring buffer and then reuse it anew, but for the cache it's just totally opposite. We expect to hold items here for a long time and more problematically.

B

We modify those elements after they land in our cache.

A

I propose that we take radix work. We make a modified version of it that allocates the ring in chunks and then, if we have a chunk that has some old entries still in it, we set that aside and then over time we look at the old chunks and we compact them back into the ring if they are only have a couple entries in them.

H

uh Mark I would like to jump in here, so I I disagree with adam's statement that only short-lived object could use could come from some kind of a pre-allocated free list.

H

I think the concept people need to to agree here is that if you know that you can do that many iops, it means the amount of items in the system is pretty much constant. I mean your your biggest size, don't think about either machine. You could do say hundred thousand iops. You don't need more than hundred thousand object, so you pre-allocate hundred thousand object, and then you don't need to reshift them. Now you don't have to pre-allocate the full hundred thousand.

H

Maybe you do 50 000, which going to come from the recycle phone, uh a pool, and if you grow more than that, then you might need to do some different allocation.

I

But you can, I was talking specifically about the lru cache.

A

Though, like the.

H

A

About any kind.

H

Of object, any kind of object. If you have say you have a task object, you have an active threads doing some kind of fires, so every iops got some. Did you guys hear me? Yes, yes, okay, so I'm saying every io have some pretty some common data structures which going to be used.

H

So if we could uh assign them to a pools and every aisle get could could uh allocate stuff from the pools, then you have some close estimate of how many objects you need from each type.

A

I agree with you regarding things for, like in in-flight objects, for iops right for maintaining for getting those through um what I I agree with adam, though, that I don't think that we can use the ring, as is for long uh term, storage for lru cash, o node entries.

H

So we discussed it in the past and I raised the concept that every oh, no, they have some constant part and some dynamic part.

H

If the constant part we all are will all arrive from from a common pool, then you could just keep recycling them and the dynamic part you could allocate for something like you described some kind of a body allocation system where you can allocate stuff and it's going a power of two kind of object. So when you stop using them, you could uh aggregate them back later.

B

H

B

Explain why I think, uh is the problem when we store an o node representation in blue store or node cache. It means it's a set of c plus plus objects which are linked together by pointers. Some of them are stored in maps and some are stored in intrusive lists.

B

We do that automatically using default allocator. These elements are spread over different pins in tc allocator.

B

At this I imagine they they do uh with the proper proper uh and their bins are related to their respective sizes.

B

So if I have multiple of those objects uh loaded consecutively, then I have fully intermixed data from different objects in a very close memory regions, and this is why I think we happen with a fragmentation when we arbitrarily delete uh from cache some objects. We just create a lot of holes, but never continues reagans, because we never had a chance.

A

Adam, what do you think about the idea of modifying the ring buffer to do chunky allocation and then at some point, if there's not enough entries in old chunk, then we compact back into the ring.

B

What does it mean? We compact.

A

So say that you have an old chunk that, because it's lru cash there's still some entries there, but they're, not very many, like you have a 64k chunk or 256k chunk or whatever, and maybe now you still, you have like 25 entries that are in it, but it's otherwise empty, because it's fragmented empty space.

A

Now we take those entries and we copy them into new chunk when we create a new chunk.

A

B

So easiest implement okay, so easiest implementation would be that we have um some mechanism that forces um all allocations regarding single object to uh fall into a single bin.

A

Yeah yeah exactly we, we say that now um these these entries that were long lived. um In fact, maybe we say that these entries that were long lived, they survived the rotation of the ring. That means that they weren't short-lived entries and other long-lived entries. Maybe we put them into different memory space, because now we know that they're long-lived we put them in long-lived storage rather than short-lived storage,.

B

But it requires an active operation.

B

A

It would require what a memory copy, but maybe that's less overhead than dealing with the fragmented, fragmented space. I don't know it depends on how many long lived versus how many short-lived entries we have.

C

Mark with all these um allocations, do you have an idea of which kinds of objects? Those are primarily.

A

We have some some wall clock data showing where we spend time in memory copy and where we spend time creating and deleting things at least some of it. I've seen in the past has been like the creation of red black tables and also the creation and destruction of temporary, like hash tables.

C

I think it'd be really interesting if we could map those back to uh like which which objects are using them and see if it would make sense to do some kind of slab allocation for those kinds of objects. Perhaps.

A

Yeah, maybe maybe a more targeted approach like that, would make more sense. I don't know I.

C

Mean a lot of things.

A

C

Aren't allocated in bufferless, so I mean the the buffalo cycle. Thing um makes sense for some some part, some objects, but others right.

A

Yeah yeah, um the the one. Some of the things that we probably should try to zero on in on are any objects that are created in one thread of legion, and I don't have a real great um uh understanding of where we actually do that. But I know it's been talked about in the past.

A

I mean I would assume that it's, it's related to stuff, that we create um like in a tpo ctp thread, and maybe we delete it in the kb sync thread or uh you know, maybe there's other threads. That can do things like that.

C

uh Yeah, that's probably the primary one that we talked about in the past was more about the messenger threads and the like tposd threads or other osd threads.

A

Yes, that yeah yeah yeah.

C

Like allocating the buffers off the wire.

A

Yeah, so that's probably something: okay.

C

And yeah, so I got like a. I guess that that's more about the data buffers in particular.

C

A

Right yeah, that would be a project for somebody to look at is you know, do we when do we do that? How do we do that? Is there a way we can stop doing it.

A

Adam, I think the fact that you saw that that you saw that when we do all these- um these oh note creations and then later have fragmented memory so that we can't give it back to buffers. That's key. That was a really good insight that you saw that, because that's kind of what I suspected but didn't have evidence of. That's that's really good.

B

Yes, but I'm thinking that for a second, we were talking about a different thing, because I I perceive the fragmentation coming from caching and freeing from different threats as totally different topics. Oh, I think.

A

That they are different topics, but have the same effect. I think tc malek, fragments things in has multiple ways that narita's fragmented, but all that ties back to memory, fragmentation.

B

Oh, yes, that's that's a good insight. I really assume that never thought about it, but truly, if different tpos dtp handles the update of an object in booster memory, then it will just go from the memory pool of that other thread. So it just adds to fragmentation problem.

A

So this will be this is this is a good project for somebody that wants to take it on. I think there's a huge potential for benefit here, but it's going to be tricky and take a long time to get this right.

C

Yeah and like we've, been discussing several different aspects of it too, like there's the the caching aspect, the fragmentation aspect to the analogy and yeah different threads and and the use of uh like real men, pools yeah so in some ways, uh independent but uh projects, but with the related effects.

A

Yeah, the the good thing, though right, is that we're actually now starting to get to the point where we're seeing and collecting evidence of this we're seeing the the actual impact of memory fragmentation, and we now have the ability to start kind of actually seeing how bad it really is, and it turns out it's really bad.

A

C

I think it was interesting to see how this works with, like with uh sea, star and crimson, uh when we, when we use the sea star allocator instead of tcmlic.

C

Since that's a kind of perv or thing is, it is still as bad. In that case,.

A

Yeah yeah, I need I need to understand better what kifu and and sam and the other folks have been doing regarding the um how messages come in on the wire and then um and then how that gets distributed to different worker threads. That might play a large part in what we see.

C

Yeah, I think I think, for our the idea is to avoid any kind of inter-thread communication as much as possible. So we shouldn't have the.

J

C

In one thread and allocate on a different thread problem.

J

We mostly, we mostly have to avoid that because of the racist sister works one we want to share nothing between or almost nothing.

A

And does that involve a copy then from one one thread's memory to another thread's memory.

J

No, we they more or less. The idea is that one core handles, which is one reactor of sister. He handles all everything in one osd. We we didn't yet complete, work on and end to end or whatever you call it issues, but this is the main intimate.

C

A

I'd like to um I yeah, I know, there's been different proposals over over time regarding how that works. I'd like to maybe we can spend another whole performance meeting talking about crimson and what you guys have been doing with it.

A

I definitely enjoy it.

J

What we have in the what we know.

A

A

um We only have a couple of minutes left, so, let's quickly move on so iou ring. This has been a topic. That's come up really recently. um I went back got kernel 5.9 installed on some of our test. Nodes uh tried to build our code with it um when, when we include the the libiyo u-ring support in ceph, um uh it doesn't build. um It looks like we're trying to use uh some of the syscalls using a c interface that doesn't exist. Maybe it used to- I don't know.

A

um But anyway, if we switch those to sys calls, it works okay. um So uh I don't trust these results that I've gotten. Yet I'm actually not even quite sure if we're invoking the libya ring correctly, um but the the end result is, I don't see any difference, um at least in these tests.

A

What I did see, though, when I was doing this, is that the kv stinks. Sync thread in these tests is like 90 to 92 percent utilized and we again, like we've talked about in the past, I'm seeing that we are spending cpu time and wall clock time uh doing key comparisons to keep the mem tables ordered so um just more circumstantial evidence, perhaps that um we're spending time in this particular thread doing that, and that may be a limitation that is actually we're hitting before we hit.

A

uh You know, io, submit or other things in the bio and thus perhaps are not seeing a difference between iou ring and liberiao.

A

I have to do more testing. I have to verify that urine is actually properly being utilized here, but these are the initial results that I'm getting hopefully we'll have more next week.

A

Any any questions on that.

A

All right, gabby, I'm afraid that we maybe ran another time to talk more about pg log this week, but I will put it as the uh the first thing for next week. Does that sound? Okay.

E

A

A

F

Anything else guys before we wrap up.

A

All right, then, well have a great week. Everyone and uh we'll see you next week. Peace.