Ceph Performance Weekly, 28 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-10-28

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, uh let's start this off. uh Okay, I saw three new pull requests this week, uh one to set the minimum allocation size to the optimal I o size for certain devices. The idea behind this is that there may be new uh device types coming out. In fact, there are already new device types coming out that want larger than 4k allocation units, but having said that, uh they may not require it, so there's some trade-off between uh performance and uh wasted space and all these different things.

A

So uh there's some discussion happening in that pr. Regarding that uh definitely an interesting discussion if you're interested in this whole topic, so um yeah there's, there's the pr um another new one from igor to make the shared blob fsck process much less ram. Greedy igor, that was a huge win. It looks like I think you said, 12 gigs down to like a half a gig.

B

Yeah in my scenarios, it looks like that. It's not always that huge, but it greatly depends on the amount of shade pops. One has.

B

But yeah in a couple of uh deployments I could see like millions of shared blobs in the rocklin in booster database, so.

B

Last time it showed like 27 gigabyte of ram while executing this surface.

A

That's crazy and yeah well go.

B

Ahead, uh well, maybe we should make additional investigation. Why so many blobs are present there, but anyway shouldn't grow. That much.

B

So nobody cared about the efficiency of this process at the initial development and um it looks like we missed this thing.

A

Excellent job, that's that's a huge improvement. um I think you're right too. If we can do a better job of decreasing the number of shared blobs. That's, I think, a very good investigation.

A

Was this some a real customer cluster that you saw that improvement, or is that just synthetic.

B

uh Well, I haven't tried that in the field I made some artificial uh testing uh yeah so well. Actually I I changed the design of this internal data structure, so I I believe it should open in.

A

Nice excellent job looks great, igor all right uh last. Pr that I've got here was from mark cogan, and that is a work on this db. Store that we were just talking about a little bit ago, um looks like add, config uh options to set us single light performance tuning uh parameters, so uh yeah active work on that looks really. Neat um definitely would encourage the rgw folks to to talk more about it and what they're, what they're doing?

A

um Okay, let's see, I didn't see any closed prs this week. um It's possible. I missed something but didn't see anything specifically performance related, um but we did have three updated prs.

A

Let's see this one to optimize, pg peering latency uh is neha here. No, I don't see her okay, so this one basically uh she's going to take a look at uh it looks like maybe it needs to be retargeted to master, but um in this very narrow case that they have, it looks like uh they. They saw a fairly good reduction in hearing latency, um so that was it yeah in this very specific case so looks like it might be a win um next pr, the mds remove subtree map from the journal.

A

This one, a week or two ago, a design documentation was provided. I haven't looked through it yet, but it looks like uh that. There's still ongoing work being done with this, so that's really good. uh It did receive an update recently, I'm not quite sure what changed I didn't. Look that deeply into it, but it's still being actively worked on, which is really good. um Okay, last one. uh Oh, this is an older pr.

A

From this past summer, uh optimize object memory allocations using pools um that one would have been kind of uh stale for a little while, but uh gabby uh uh went through and added some discussion uh to the pr and then I I also added a little bit there um gaby.

A

I I really like uh the idea behind this pr and, more generally, the idea of uh doing more to allocate memory from uh pre-allocated regions or at least use memory, I should say from pre-allocated regions or or maybe even better pre-allocate objects and reuse them any any more thoughts on that yeah.

C

Actually, I would maybe do a few more things most of our object, the nested object. So if you allocate an object within it, you will have other object requiring more dynamic allocation.

C

So on so when you look at a big object like an object node, I think you're going to end doing like 20 30 uh calls to malloc, oh to you, yeah, so even if this optimization you're still going to have to pay all these locks, unlocks overhead fragmentation and so on, if we could design something like a standard object for some kind of of operation, which we can do and say you know what even so, we should allow up to 75 entities, but normally we don't need more than eight.

C

So let's allocate everything with eight object, and so you create the big object and all the nested object inside using some kind of I don't know, 90 percentile short.

D

Object, optimization.

C

Yes and it's a little bit just a short, it's going to be probably the vast majority, because while we do allow a lot of crazy options, the fast majority don't use them. So you should have something for the normal case. Try to get a single call, recycle the object. If you need more than you say, you know what in that case, I'm going to do some allocation.

A

Agree 100, if we can have the standard case especially covered by just the parent object, without having to have like a bunch of other little things allocated off to the side. Only do that in special cases. It could help to make it so much better.

C

We've been back and forth about this, because if the nested objects are all coming from from a single continuous memory range, then you're also not going to pay all this memory trashing with l1 l2, cache being paid in and out, because the memory coming from all over the place, so everything will be continuous once you get it memory prefetch going to get you the whole thing inside, you should get significant performance, optimization on the memory layout and also because you don't pay all the extra cost of doing dozens of smaller locations.

C

You don't pay fragmentation and even so, if you say like when you pre-allocate and you assume you're going to have eight sub-objects, while in many cases you can now have only two or four but keep in mind that every time you allocate an object, if say it's a linked list, you still have to pay some extra work for all kind of uh link, object and such, but if you put them in an array, you don't have to pay for this, so it needs a little work.

C

Unlike what this vr is seems, like I don't know, very quick solutions which might help in some cases and it's very generic, but if you want to do it properly, you really need to, and I would start with something that, like a request object if we service 50 000, if I io per second, then probably we don't have more than 1000 object alive.

C

I don't know, find the number of how many alive, because if an operation takes about, I don't know 20 milliseconds 10 20, then you don't have more than like 1000 alive at a time, so you allocate them with some preset values and just keep recycling them.

C

Eventually, it's going to be cheaper because we don't want to fragment the memory and the memory is going to be more condensed so I'll start with this and then slowly walk the way down the food chain.

A

And uh in wall clock profiling, we see the evidence of this regularly uh new and delete and and object creation and object. Destruction are often some of our biggest consumers. Overall, when you look at the the the amount of time spread across the code base, where we're we're doing work.

D

Yeah I have a. uh I have a lower level comment about uh yeah. There are some design decisions there that, and I can't say that I question I would like to really ask the designers why they choose some specific, why they made some specific decisions.

D

For example, why not use the emr locators? Why not? Why not use a make use of existing or standard or boost pull memories?

D

There are many existing implementations that I think are better and at least in one major point design better from this.

D

Maybe I'm I don't understand the full set of requirements that they were trying to answer here, but I will ask, I will add some questions later on, maybe sunday to disappear.

A

I think you should run in oh, this is still up in the air right. I mean nothing's really been done, so I think uh having more people looking at. This is absolutely good idea. Right now,.

A

uh uh Gabby I mentioned in the pr but erratic, and I were working on uh just specific well, it doesn't have to be, but we were looking very narrowly at encode decode and uh allocating memory from ring buffer a thread local ring buffer, uh rather than allocating from uh malik uh to start out with, but using malek as a fallback when the ring buffer. When you you couldn't allocate from it, then you you just fall back um and uh we did we. I think we, if I remember we saw some advantage.

A

Actually it's been a while, since I did this, um the the data is, is here, but it's very difficult to parse through it all um but I'll I'll paste in case. Anyone cares enough to look at this, um but the the gist of it was that um the we saw an incredible amount of allocation. I mean uh many many gigabytes per second that was going through this thing um and I think that there's there's a huge amount of opportunity for us to do much better.

A

That code never really made it into the code base. We didn't. We worked on it for maybe a couple weeks and then we found other things, but I think the idea behind it was actually really good and something that I think, if you could either resurrect us.

A

That's why I tell you that it's been over a year since I worked on this as I look at it again right now, I'm like holy smokes. I had a lot of data here.

A

Oh boy, I used to understand what all this meant at one point.

A

But if you look at like in the second tab.

A

Or even the third tab that total fallback length, that was, I think, more or less the amount of data that was um coming from malik and then the that was like column, h and then column d, the total alk length. I think that was basically how much we were allocating from the the ring buffer. If I remember right, aggregate across ring buffers.

C

Is this per second per minute per hour.

A

um Let's see, that was probably the aggregate in total. um If I remember right, but maybe maybe it was actually, this was a per second, I hope not well. In any event, it was um I'll I'll have to go back and look at it again, like I said, um been a while since I've I've done this, but the the gist of it.

A

Overall, from what I recall when we were working, that we had a huge amount of memory, allocations are coming through for encode, decode and um and that we did see the behavior we wanted to see where we were getting more allocated from the thread local ring buffer as we increase the buffer size, uh we could get to a point where we're doing like a lot of it from the ring buffer and that, then you avoid the lock contention overhead and avoid a lot of the the problems that you have with tc malek when you have to fall back to allocating from the the central uh uh space which we did see happening.

C

Can you maybe just increase tc monologue local buffer size, because this amount doesn't mean local buffer? If you just change the settings.

A

So we we already do that to some extent and that actually helps us pretty dramatically.

A

um I think what happened is that we see a lot of fragmentation with the thread local buffer when, because of the way that tc malek has to do it for many allocations across the entire um the entire process, whereas for encode decode, we know we're guaranteed that these are short-lived, they only live as long as the current up, so we do it in our own. We know that it's going to go away and that the ring is going to progress. You know fairly reasonably.

A

We shouldn't have things sticking around so that when you loop back at the beginning of the ring as long as it's big enough, you're- probably going to have free stuff or free space that you can work with specifically for encode decode.

C

So why don't we just allocate space on the stack, it's the cheapest yeah and easiest, and you always give it back. There's no verbal action whatsoever.

A

It's a good question. Maybe we do it that way. Maybe it's better. I don't remember, but you know it's. It's not.

C

There's a limit to the stock side. I don't know how much space.

A

C

Need for decode pink stock size is limited to about one megabyte.

A

Yeah I mean it's arbitrary right. It's whatever we're encoding decoding.

C

So a single encode decode is probably a few kilobytes, no more than that, but if you do and and if you recycle the same buffer again again, so you take local buffer on stock and if you call it then yeah, it should.

A

Be fine well right now we walk through the entire structure of what we're encoding decoding at once. We figure out all the different types of data that we're going to do work on. Then we allocate one big allocation for it and then we actually encode into it or decode into it, depending on.

C

A

Guess this word.

C

We got them one by one.

A

We we work one by one, but we do one memory allocation up front because we pre-walked the entire structure figured out what the sizes are going to be, and then we actually go. We do one memory allocation. That was an optimization we made early on in blue store to avoid um making lots and lots and lots of small memory um allocations for every single structure.

C

But say we use like a constant 4k 8k buffers engines, pack, everything inside it and it's easy to recycle.

C

And you don't need so. If, if it's going to be 7.5 kilobytes, we don't need to be to have the exact size just give it constant eight case and just keep allocating 1k 8k understand I mean even the stock. You could have enough space and once actually sorry take it back the stuff that we encode decode. Do we pass it as a buffer list, because if so, it couldn't stay on that on on the stack.

A

Yeah, that's probably true.

C

Okay, if that's the case, then you need a slab allocator or something like that. Yeah.

A

C

You don't need to allocate the exact size it's okay to have. uh I don't know power of two kind of buffers. So if you need 7k, give it 8k, but it's better than keep fragmenting the memory. Even 8k you're going to give it back to you and then you could reuse it.

A

So for a pen hole, we actually do that. Now I made a pr that did that a little while back, um let me find it.

D

A

And refill a pen space, so this helped and appears to help in some cases.

A

But the entire way that we do encode code is kind of not ideal, even just going through and doing that. Pre-Computation of the entire you know, memory usage is is not cheap. Radic said that he's seen cases where that was ended up actually being slower than just doing. The memory allocations up front.

A

So you know, I think this whole area probably needs to be revisited.

C

uh uh For something like this, I would definitely use slab allocator or some kind of a body allocator. When you keep using power of two you use them, and then you combine them afterwards. When you free them, you don't need exact allocation size here, it's okay to have some space wasted.

C

But like this you don't keep fragmenting the memory. You don't need to walk yourself out, because you can just the slab could be local to you. You could catch some stuff, some slabs.

C

And whenever your cash is, is empty you're not going to take one entry you're going to take like I don't know, 10 entries whatever.

A

Yep, I agree with you gabby. I think that's probably superior to doing this. This thing that we've got now.

A

We'll just need to figure out how to change it. It doesn't help that what we do in blue store with denk. That's really the only place that we're using this new thing anyway, where in most of the other parts of the code like in the mds, we're still using the traditional scheme where we basically just allocate as much space as we need for each individual.

A

uh uh You know piece of data and we do it one by one by one.

C

I think that's been going to lead to fermentation if you're going to get.

D

C

A half kilobyte, then you're going to have a lot of empty holes inside which, over time, the whole system is going to be fermented and location, going to be slower and slower.

A

Yep yep, and even for like mds uh journal journaling. uh We we see both fragmentation and we also see that the journaling process is very slow, be spending so much time in memory allocation.

C

And we discuss in the past possibility of doing kinds of encoding decoding, doing uh binary storage. It is going to be cheaper. Yes,.

A

Yes, there's a, I think, there's a very good question of whether or not we should be encoding decoding at all. I mean it's like var, engine coding, theoretically, isn't that expensive and it does help if we have a ton of zeros all over the place, but maybe it's not worth it.

C

And that goes again to the first item. If the object are not going to be dynamic but static with static tables, then this kind of object could be just give them the pointer and it's going to be binary stored. The reason I'm needing coding coding is because we have uh b tree so you have to walk the tree. Then we have a linked list. You have to walk the list and the memory is not continuous.

C

If the memory is going to be continuous, then you could just jump it dump it as is so it's the thing is circling one of the another one.

C

You need a way to build things, and- and maybe that requires even some limitation to the protocol. I know the code exists, so probably that's hard, but maybe there could be like a gen 2 protocol, but people committing to gen 2 protocols, they're going to use three defined sizes.

C

So if you're going to say your request is going to look like this everything's going to so when you use the gen2 protocols, we could just use structures, do one allocation, get the whole struct and and and when we need to do um serialization, we just take it as is, and there is no work that have to be done and we can. We could still do backward compatibility by.

C

If you don't confirm to this new api, then we will still do the stuff, but hopefully everybody eventually would go and and move to to this fixed size thing like. If you look at the mainframe protocol, the ckd in theory, it could be any possibility they have in the secretly. You could define a lot of possibility in how you're going to make the disk layout in reality.

C

What they do today is that they added few fixed size modes and probably 90 of all I o and mainframe, is using like two or three fixed size modes, and people still support all the crazy possibilities that you had before.

C

But if people are liking, new application for performance you're going to use this thing, so maybe it's like a roadmap to the future.

A

I think version two is a good idea too, and and currently we have all these conditional uh uh situations for older versions of encoding like in the mds and it's not as bad in, like blue store, since it's much newer, but um there's all these.

A

Things to support uh stuff that was from 10 years ago, right like encoding, that was different than years ago. You know we can version two. You could do away with all of that, and just start over from scratch. It'd be probably better than trying to shoehorn in all this old stuff.

C

And I mean it's required, people would sit and look at the protocol and say: okay, we in the past we gave a lot of flexibility, but what do people really use? Because if we, if there's some options, that people use once a day and there's something people use all the time, then maybe you should eliminate that possibility. In version two.

A

It wouldn't be bad to align it with crimson right.

C

Yes, I mean it's independent of crimson, but yeah you can.

A

Align it. At the same time, though,.

C

Yes, so define a simpler protocol.

C

And then all the requests and data structures should fit inside a constant size object and the object could have some flexibility. So say I don't know uh attribute. I know it's it's it's an easy one for me.

C

So at the moment there is unlimited number of possible attribute my understanding that in reality we do like four or five. So you could pre-allocate space for eight attributes, and you say you know what I'm allowing just eight attributes.

C

If you need more than that, then you have to go and use the old api and you could maybe even mix match. Maybe you could every request. You could say I'm using this request using a new api. That's like was using the old api, so everybody which could fit within eight attribute they would use the new api people which need, I don't know, 27 attributes they would use the earlier one.

C

But then everybody will have eight attributes, which is a bit waste of space, because maybe people are going to use just two for whatever you need enough space, but usually you think about it. Even given something which is going to fit 90 of the people is not going to be too much of a wasted space.

A

And uh and really is it more wasted space than we suffer right now from fragmentation.

C

It's not just fragmentation again.

C

Or I don't know red black tree that they have every object got a lot of overhead because you got the uh right pointer, left, pointer, yeah.

E

C

All this and everything's done with dynamic location. So that's another overhead, because there's eight byte extra for every allocation and they're not even talking.

C

You pay in memory um memory layout and then you pay that when you're doing dump or marginal demarcian, you need to walk the whole thing. While, if you keep everything as a vector, everything will be sequential.

A

Yep, I agree with you: gaby.

C

And I think that is what people are using on: every high performing systems. If you go back to early 90s, if you look at what cisco is doing, if you look at the big database servers, this is what they do.

C

If you have to do with a lot of iops, then you need constant size and I mean if you look at the tcp protocol, all the structures that are fixed.

C

If you look in uh scuzzy protocol, the structures are fixed. You could have some flexibility but like it's like 64 bytes, for the structure in fcp in the fiber channel. Maybe you don't need 64. Maybe you could fit in 48, but 64 is enough, but you cannot have more than 64. you're not going to have 47 bytes.

C

So everything is just recycle again and again, so you never have to do any dynamic allocation. They keep a pool of. I don't know, 4k object of 64 and you keep recycle.

A

Yep yep, I think that's what we need to move to.

A

Even in crimson we're not going to see the benefit of everything else, unless we do this.

C

I think memory access is a big killer if oh, every object need to be working from different locations, especially if the memory is fragmented and say you allocate a link list with eight object, but because the memory is so fragmented, the first one is going to come from that page, the second from that page and you keep being uh bringing a cache line. Every time you access one, you walk the list.

C

You have to walk a lot of memory pages which going to put with your cache and and so get everything into one page, one continuous buffer, continuous memory.

A

So so do you think you're gonna try to take a stab at it? Gabby.

C

Actually that'd be interesting, yeah I'll be interested in enjoying this kind of project.

C

I know it's a refactor, meaning it's not going to add new functionality, but I think that's at the potential of serious performance increase and we.

D

C

One by one, you don't have to do the whole system at once. You could start with one object type and then measure what kind of impact it does and then for throat just yeah it could be start easy and once you do it for one object, you'll realize how you do it and then you could start doing this one object after another, but it's not going to be done as nicely and generic as this thing from ibm. The ibm thing was you don't understand the system, they don't they don't claim to understand.

C

So they just make something. So in our object, they would use these things and they will call malloc again and again and again for the same object and they're going to get a lot of recycling but yeah. If you understand the system, you would get one object at once.

A

Yeah that was kind of the same way with the thing that radick and I were working on. Well we're not going to change the entire code base, we're just going to try to uh make the behind the scenes, make it a little better, because we have some knowledge that at least for encode decode. We know it's going to go away soon, as opposed to other things that may or may not go away soon. So maybe, with that knowledge, we can make it a little better.

C

And if you find like what object are your wars? Hitters? Then you deal with them and they're still going to be a lot of small things which I don't know: management and uh telemetry awards like all this stuff. That doesn't happen so often, okay. So it's going to stay in going to use dynamic allocation, but your big object they're going to be pre-allocated and used from a single pool.

C

So even if memory is going to be fragmented, your main data objects are not going to be affected because they they have all the memory reallocated and then another thing that you do, which is again it's one of the principles of system design when the system is coming to a place that you are reaching your service level, you start to trash.

C

Usually you build the system in a way that you say I can service. I don't know, 1000 requests in parallel while giving you some predictable response time. I don't want to go to like spike of 5000, because it's going to cause the performance to be very slow. So once you reach 1000 you're, going to start rejecting things or just cue them, but don't process them because it's going to impact your performance so just keep like a minimal request, but don't start processing them.

C

So, but that also means that you know how much space, how much memory you need because you're going to set your own limit, you say I don't know 1000 parallel iops are going to happen. um 4K, I don't know you said the numbers anything above that you just cue them and you don't allocate resources aside from the request itself, which is relatively small.

A

It's nice too, because if we do this right, it means that all of this memory, auto tuning stuff that I've written becomes obsolete and I wanted to I wanted to go away right. You know it's trying to dynamically uh uh account for changes that are happening to keep us below a certain memory limit. But if we just are smart about, you know the allocations upfront, then we don't need to do any of that. We just have our pool memory that is static anyway,.

C

That's like the base design principle. Actually, since, since the early 80s people been using this kind of principle, you know how your system behave and then you know like in the old unix. You have your limits and when you change the limit, it's just going to assign how many objects you could process at once. If you're going to limit that, you cannot have more than that, many processes happens and everything is pre-allocated and everything is working. If you need more than that, then you need to change the settings.

A

All right, I think, we've we've beaten this one to death uh almost a hard time. Yeah.

C

I would be interested anyway in diving into this and actually starting slow by making a single change. Then, if it seems to be useful.

A

In code, decode is definitely a high impact area. There are others too, but that's um that's one of the ones where we have two ways of doing it. The old way in the new way and both are kind of bad in different ways. um So that be, you know, somewhere to start to see what what are we encoding? What are we decoding?

A

What of those things can we start changing? Maybe.

A

See what the the high the, which which objects are the most common right and then you could look at the structure of those objects and try to figure out. Can we can we write these out differently and and have them behave differently?.

C

And that's bringing back to something I mentioned a long time ago that we need to have stat counters about object. How are we using them like if you could have that many? I don't know.

C

What's the max object side, how many fields you can have in how many subfields you can have and then put them in buckets 10 dollars, 20 like 90.

A

I get the feeling that a number of these lists, or um or even uh unordered maps that we have we, we have the ability to have an arbitrary number of entries, but we only use maybe one or two of them typically.

A

I might be wrong, but that's the feeling I get from having looked at this in the past, we'll see.

A

All right, well, we're almost done here um I'll just add the two discussion topics here, I'm doing q3 update for crimson uh just trying to get more data for that. uh I I I'll talk about it later on, but right now uh just trying to collect it.

A

um There are a bunch of peers, just went in for cbt that add some support for crimson and other uh things that kicking around that I needed to get in, so maybe some cpt improvements coming, um and that was all I really had so in the last eight minutes here. um Anything else that we want to talk about before we wrap this meeting up.

A

I don't hear anything so.

A

Do you think gaby? Are we going to be done on that other topic career? Is there anything else that you want to.

A

All right cool, I very much think you should you should dig into this. It's I think it'll be important for us.

A

If you, if you do, let me know I'd, be interested in seeing what you find.

F

So maybe we'll take it offline and try to do this later.

A

Yeah, do you think we would start out with, like stat counters, just to start getting a better idea of what we have.

C

So that could be one option. Another option would be just to guesstimate how much you need from something and then see if there's some object, which are well understood that you don't even need to analyze. I don't know like.

C

If there's some object that you could say, that's the number and I don't need more than that- and then stat counters of course, are always good.

A

You've, looked at the blue store data structures more than I have do you remember as you look through them um with the bigger dynamic uh uh uh elements there were and and blobs and shared blobs, and and these things I remember, there's a fair amount, but it's been a little while, since I looked at them.

C

C

There's actually quite a lot of data structures yeah, because you have uh yeah actually.

A

Adam, you feel free to jump in too. I know you've looked at this far more than I have lately.

E

I did, and I guess igor did so it's so much that I don't want even to start five minutes.

B

Yeah well, actually, blobs and shed blobs are pretty good candidates for being the most numbers entries. So just a simple estimation, you have on a single node, with multiple blobs and now, depending on your spatial fragmentation, you might have plenty of them per single node.

B

So it's it likes body of magnitude, higher than amount of boxes like order of magnitude higher than number of nodes, usually.

B

And in fact we can, we have some means to to count well at least shade, blobs and unknowns, and things like that in data in roxdb, and also, we also have performance counters, showing.

B

On nodes and something else, some additional parameters in in in the memory we change the cache.

B

So you can learn the the ratio between the note and other entries.

A

We don't care about like.

A

Sorry gabby: what was that.

C

No asking eagle, if you think the extent map could be made instead of a map just to be in a rain.

B

B

So well, actually you might take a look at my pr which is pending.

B

Where I tried to optimize.

B

This memory footprint for many of our data structures.

B

Unfortunately, it's hard to get rid of dynamic allocations, as we do not limit number of entries in in our data structures, and this might impact existing users, and this reduces the functionality. Flexibility and things like that. So.

B

At this point, I could maybe think about two different implementations of the blue store, or maybe something like fast blue store, which has got limited functionality at cost of high performance, but not sure it makes sense to refactor the existing store.

B

B

Would impact existing users most probably.

E

Yes, we based a lot of our algorithms on uh entries being dynamically added and tweaked, then split them dynamically and changing that will really impact, um not even data structures, also even some decision process like when to resp recompress compressed chunks that that also will be affected. If you change data structures.

E

So basically, I do agree with the idea that we could reduce the dynamicity in bluestar objects, but I'm a bit scared with the scale of this entertainment.

C

Yeah, what about starting easy with just a request offer every time there's a new? Are you coming to the system just request buffer? Could we just create a pool of requests because that's the entry point.

C

The requests be of different sizes or more or less of a constant size.

C

E

There very soft limit on the size of request that comes to.

C

E

By request, I mean an entire transaction from osd to object, store.

C

And I'm talking about a request from our rbd client to the osd is the request, including unsolicited data, or if the request is coming and eventually when the osd have time, he asks for the data.

E

I don't understand the question. Sorry.

C

When an rbd client send a write request to the osd, does it send a request and then the ost said? Okay now give me the data, or can he send the request with the data.

B

It's a single search operation.

B

To perform writing.

C

So the data is feedback to the request.

B

C

Maybe we want to change that.

B

But you'll get duplicate, latency.

C

We'll yeah that will cause some latency, but not too much, because the latency that we is already milliseconds.

B

B

Okay, so well, it would be just a duplicate round trip from client to the primary osg. But what are you going to get from from.

C

To allow the osd to control the flow and to control the amount of requests happening in the system and the amount of data per request.

C

Also, the okay, so the data for the request: what how do we allocate the buffer for the request data?

C

What happens if you send a request of um eight megabyte? Do you send the data or do you wait.

B

You mean client.

C

Yes, there's a right of eight megabyte: does he send the data, but does it have to wait for the osd to tell it give me that amount of data.

B

Well, definitely doesn't wait uh until additional confirmation from versed, but it it could.

B

Put postpone uh sending additional data so, instead of sending like eight gigabyte megabyte, it might send less in less portions, but it it actually depends on the client or some different client might behave differently.

A

We, if we don't have the space or the already filled the the number of uh items we can handle or the the amount of space that we can handle for ingest. I did we just reject it, reject it. I think that's what I remember happening.

A

Our buffers are pretty big. I think.

C

And now does buffer being dynamically allocated or are we recycling some slab allocator data buffers.

B

Most probably they are dynamically allocated.

E

B

Sure about the details, but but uh to be to to be honest, the current uh the current bottleneck is not in the allocation itself. It's.

D

B

Interlocking between different tasks or things like that, so these dynamic allocations bring some additional overhead, but it's not the primary one. It's my feeling and my uh experience that I've worked what I've observed before so.

A

Igor and wildcat profiles are definitely, oh god.

B

So so so you you, you can try to improve these things, but most probably they wouldn't bring much benefit at this page.

C

That's a bit weird because all data servers.

C

Oh sorry, typically data servers, don't allow unsolicited data because they don't want because they are the one controlling the system, not the client controlling them, because client could overwhelm a server by just sending huge amount of data. Yeah yeah.

B

I I agree, but I I think the difference between seven well. This uh other devices is that we we've got pretty powerful uh cpu here and we are not getting uh that many iops at the moment by due to different reasons and that's why uh we are not affected that much by these inefficient allocations and things like that. So uh mark. Please correct me if I'm wrong, but I I I think we can hardly reach more than 100 000 rights.

B

C

Yeah, but if the right is going to send one gigabyte of data, that's going to be pretty sad for us.

B

Oh well, let's then it's that's a question.

B

Well, such rights are considered as a bad behavior clients and we fix that at the client side rather than so, we are not dealing with.

B

Arbitrary client, we will have our own implementation for beginning.

G

The scazi f's well well well crazy.

B

Has to deal with custom users, but we at itsef backend. We are dealing with our own clients and they they are more or less gracefully.

A

So that number, the the 100 000 rights per second you're right, igor um and now, especially with gabby's work on the allocation uh improvements, uh we've gotten to the point where the tpusdtp threads are actually very busy when we do that, uh the the kbc thread is still kind of a bottleneck, but uh the those our worker threads are actually very busy.

A

What they're busy doing there's a lot of contention involved. There's memory allocation, there's, object, creation and destruction. It's spread over lots of stuff, but we got. I think. Crc is another thing that shows up. We've got you know various random things, but we are very busy. um Cpu is probably uh near the bottleneck with 16 threads with a hundred thousand ray apps, we'll use almost all of it.

C

We cannot just redesign the whole system, so let's try to find maybe something in blue store that we could mess with or some other data structures in in the common path.

C

Maybe the read request: okay, so write the same data but read the just in the request, so there is no data. Can we recycle them like if the rbd client sends written requests? I assume we could use a predefined request buffer and then recycle them?

C

How about that? For start.

A

Yeah I mean it's worth trying the the easier it is to to experiment with the better right. If we can do try things that we can then see if our assumptions are right or wrong, that'd be very good.

B

Well, uh at this point I would bring a bit different idea. uh Well, a while ago I was doing uh some experiments with messenger and it looks like we have pretty significant overhead in in messenger itself.

B

So maybe, instead of trying to fix everything, start with just one component like messenger.

F

uh I think that you're absolutely correct. We should find one component which is easy to change and then see how.

B

Yeah yeah yeah.

F

And uh well I I.

B

Believe I could see pretty inefficient operations at messenger level and just a just a simple round trip to toysd and back could take well pretty significant amount of time if multiple of them are coming are going. So I would suggest starting optimization in areas like that, rather than trying to.

B

To do something to do some tricks with collocations with some non-mainstream exchange, and things like that, so most probably wouldn't get anything from from that experiment.

B

Again, my my uh feeling uh that messenger might be one bottleneck of the dpg logic and the law interlocking there. It's another bottleneck, roxdb, the third one, but in the end they probably write all this inefficiency in in locations.

B

My thoughts, I'm definitely not, I'm not absolutely sure about that, but.

A

uh Side note I pasted in the chat uh the osd client message, size, cap and osd client message cap. I think those are governing when we reject pulling new data off the network.

C

Mark I have to leave.

A

Okay, yes, we're past time anyway, so.

A

We should we should move forward on this, though one way or another um even just trying to make small progress. I think it's a worthwhile pursuit.

C

Yeah, so I think at first we need to find the best candidate. Maybe that's the messenger that igor was pointing, maybe something else. So we should try to look for a good candidate which is simple to define big size object and then set the number of them and then start using them and see what kind of impact. If any.

A

Sure I'll try to get some more wall clock profiles available for folks, too, and then maybe that will help us uh determine areas that might be worth looking at in more detail.

A

Yep all right well have a great week. Everyone thank you for coming as really good talk.

F