Ceph Performance Weekly, 30 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-07-30

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's see so it's been a couple of weeks since the last meeting, um but it's also the summer, which means things are a little slow.

A

We've got a couple of new pull requests both from uh the rgw folks. One is a pr from eric that improves rgw, ordered list map efficiency.

A

I think, if I remember right, it's avoiding a second iteration through the list or something so that's fantastic um and then there's another one uh about d3n cache changes from upstream, I'm not quite sure what that is but uh looks like it's being reviewed and tested. So that's good.

A

um Let's see we had a couple of pr's closed in the last two weeks. um Two of them are from majiang ping related to bluefest. Those are both, I think, minor improvements in different ways. um I've got a pull request for the mds that merged.

A

I had opened that I think last week, but it's uh basically just a a fix for cash trimming. When you have lots of subtrees um this, we previously were we're iterating over the entire list of sub trees to calculate the number of sub trees. But in reality we already stored that in a variable, though, we didn't need to redo that work and that just avoids redoing it.

A

So it's actually fairly substantial when it happens, but it may not be that often so uh anyway uh that got merged, and then there was another pr that merged uh regarding ceph adm that creates osds in parallel, rather than uh creating them one at a time. So, as you might suspect, that's a fairly substantial improvement for self-adm.

A

uh There was only one other pr I saw that got updated, and that was um another one from majiang peng about enabling rocksdb pipelined right. He had responded to a question I had regarding performance.

A

It looks like in in mastered now where we have bluefest bufferdio disabled. It probably yields a a modest performance gain, but it's still probably overall a long-term win so um probably worth doing once. It's all verified and tested and reviewed.

A

uh That's all I had for updated and new cars this week was there anything I missed from anybody.

A

All right, well, um maybe before I get into the work that radek and I've been doing uh buffer list uh ring buffers. uh Is there any? Are there any topics that folks would like to bring up this week?.

A

All right well, then, um well radik, would you like to explain uh your your ring buffer and and uh what you've been.

B

Doing well we're actually working on that together, uh I can go uh the.

C

Everything uh began around the I 500. I have the costly encoding for in mds detected by the io500 testing. It turned out that the encoding is costly because of the allocations performed by buffalo.

C

Those allocations actually an offspring of the fact that mgs don't use the dink the dank framework of booster. It was initially developed for booster. It uses the old encoding uh stuff, which means that there is no pre-processing. There is no reservation uh past the buffer list. That would allow to allocate memory in one single goal and as a result of that buffer list from from time to time, really needs to go to the this malloc, but this malloc student house to be extremely costly.

C

In that particular case, we saw in the profiles that the reason for that is actually exhaustion of the of the perfect cache and as an alternative approach to switching mds to denk, which turned out to be pretty viral.

C

We started experimenting starting to start smoking. We've tried to actually lower the cost of allocation in buffalos the squeeze to squeeze the overhead imposed by a tc malloc, and that's the the role of of the ring buffers where mark has mentioned just before.

C

Overall idea is to have a dedicate a pre and something like actually a slab allocator.

C

That would be that would allow to allocate big tank and big chunk of contiguous memory and actually fragment it over uh sure. It's over the buffer role. uh Instances buffalo are the guys responsible for uh for storing the physical data in buffer list.

C

And basically,.

C

When it comes to the concept, not sure whether I can add more.

C

Currency dialogue go ahead, the current okay. We are managing the allocation at the moment using extremely damp, extremely simple uh idea about a ring above a circular ring a circular buffer, a ring. It's it's, because we need to deal with one pretty nasty things, one pretty nasty thing of buffer list.

C

It allows the allocations to be performed from a separated threat, different one that the one who made the allocation, and because of that we got some extra complexity, but well still trying to fight it and what we got is actually pretty interesting uh drop in the usage of cpu, the md submits uh threat of mds, the one, the one, the single the dedicated the specific instance responsible for doing journaling actually drops significantly.

C

uh Now we can see it's mostly doing. uh It must be weights on some other likely on some lock.

C

But we are going further with testing mark made a lot of runs uh showing it and they showed that they showed that huge, really huge amount of memory flows through the rings.

C

And now we are trying to judge how wasteful buffer list really is, and why it's? Why? Why country for that, when you are pending to a buffer list and it lacks- and it has no underlying it- has no has no memory available. No the memory for happens is exhausted.

C

It goes to the it goes and allocates by in 4k almost 4k for uh units, and that might be really wasteful when it comes in scenarios when, uh when somebody happens just a few bites, the uh baffle is trying to we're trying to uh to estimate the ways there.

C

But still it's working for us.

A

I uh I put in the chat window a link to a spreadsheet that we're looking at um that shows uh when reddick had mentioned the data flowing through the ring. um This is, is kind of what he was talking about.

A

A

Basically, when you look at the md test, easy results, it's like a five minute test and then, by the time you get down to the hard right results that all tests together have been running and aggregate for maybe about 25 minutes or so um about an hour. So just about 25 minutes and we're seeing uh sometimes up to hundreds of gigabytes of data flowing through the rings, and this is just for the metadata journaling in the mds.

A

uh It's not even you know, data flowing through the osd's or anything it's just the that that portion uh for for the mds. So it's it's a little crazy. um One thing radic that I'm I'm thinking here is that there's this tension right between um what we're doing, where we're making these um these bigger allocations up front to fill in uh for the buffer list that don't get used necessarily and versus.

A

um You know the extra cpu overhead that's caused by um by not doing that when you have to do um more work, especially in the ring right when you're doing lots of small allocations.

A

But I wonder if this benchmark does not show us, it really showcasing the real behavior very well, because maybe when you have all that memory, wastage and you have to go back to the tc malloc central pre cache, you can't use thread cache anymore. Maybe that's work that we're not really capturing in the benchmark.

B

Definitely note benchmark is single threaded on the way to the yep, no way to allocate on one one thread and remember if you are different.

A

Yeah, so, even though, in the benchmark, it looks much better to be doing the um the uh you know upfront, larger allocation that then we can use um uh you know for for future things. um Maybe in reality it's it's not as good as we think.

B

Yesterday got a glitch.

A

um Yeah, I was just saying that um you know, maybe maybe um uh the current behavior is not as good as we think.

D

uh Well, that's quite likely.

B

uh That's quite possible with the.

A

Yeah yeah yeah.

C

Just take a look on std vector how it allocates memory, we are very different from it. We are always even well when, even if somebody wants to update one single byte, uh we go and allocate 4k junk.

C

Always that doesn't make a huge sense. I believe it might be really worth to spend some time and try and implement uh dynamic, uh dynamical growing of of the allocation unit depending on the history, depending on the, for instance, on the data stored inside buffers.

A

I was really surprised at how, when we're doing um you know we're we're making allocations the traditional way um uh by using the.

A

However, we calculate alen here um the stuff buffer alec unit size, um how okay, so then we it makes sense we're doing like you know, 4k roughly allocations, but when we don't do that, it drops all the way down to 190 sorry yeah, 182 bytes.

A

I'm very curious- and this is just in this mds test case right, but I'm very curious. How often that happens in other parts of the code as well.

C

Yeah, that's that's good questions. uh What I can recall from uh from the history of buffaloes is that man many many years ago we had a. We introduced a concept called append buffer.

C

It was an optimization to read to amortize the cost of allocations made during appending to baffles and the and the big alloc size uh is actually I I think it was defined there.

B

But I don't know why uh I have no idea what why it's so big.

C

I don't know, maybe there is a good reason for related to file store that is absent in rooster. Have no idea really.

C

Maybe we should uh maybe we should also re-evaluate the allocation policy for small appends.

A

Yeah, I have a feeling that we probably didn't think it was too bad in terms of wastage, but in reality it's actually quite bad. It seems.

C

I won't be surprised if you are wasting 70, maybe even 80 of memory.

A

And if it was just memory right, if it was, if all you were doing, was wasting memory- okay, fine, this trade-off between cpu and fine, um but because of the way that our allocation patterns work and uh are that we're multi-threaded? It's not just wasting memory. Now it's also wasting diffuse. You gain cpu, but you waste cpu too.

C

Yeah to simulate them and free from different threats, that's gonna be costly, and I bet it might be actually a typical pattern when, let's say a tpu hdtp makes some processing made makes the encoding. But, finally, those those buffers are freed in, let's say messenger handing the messenger workers handling the out.

C

C

Do we, but, on the other hand, I cannot recall, I cannot recall such when I was tracing osd last time. Okay, it was a long time ago, but I haven't noticed, uh stinks like I haven't, satisfied this like in the mds case.

A

Well, in blue stars, using dank right.

B

Right right, so it's quite likely that the yep most of the most of the martial link is likely here, is likely performed with the reservations.

A

Radic, do I I imagine we don't have a really good sense of how often buffer list is used only within a given thread versus when it might be read from separate thread. Do we.

B

I don't think so: yeah.

B

But we could, but we could get such numbers.

E

That was something matt was interested in, not in uh uh that was one of the things that he was interested in looking at uh just in the just on how much cost we were paying for all the atomic increments and decrements.

A

Yeah, I was just thinking the same thing right. You were saying it was uh when you were just annotating uh the raw tls destructor you're, seeing 50 of the time spent in that single atomic exchange.

B

I think so I will confirm that in a second but.

C

I'm pretty sure I saw it.

C

Maybe I have maybe I have to.

C

A

Well, adam, in any event, that that might be a good hypothesis that that's a something that we should be looking more seriously at.

C

uh One second, I think I've got it.

C

In allocate uh okay 58 of uh time of cycles burned uh in the raw tls allocate was on that pasting. Oh.

A

And allocate rtl's allocate okay exactly.

C

Oh, the self is smaller than in case of raw dls create in this testing. It was uh looking five percent of cycles.

A

B

Percent of that it's well, we are burning in on single instruction, two f, two and half percent of all cycles, burned by the micro benchmark.

A

Okay, I just posted the the perf uh call graph that I gathered from looking at the um the basically the cash ring implementation, radix cash ring implementation, and this is the case when we are.

A

Just requesting what we need, rather than the rounded up to uh whatever that constant, is or whatever the stuff something.

A

Something so it uh radek. You had said that in your case, you were seeing that raw tls create was taking five percent of the overall.

A

A

Oh, we will oh wait. No, I wasn't saying that we lost traffic, but maybe.

A

A

Are you still here attic.

B

ah Sorry, I was muted sorry, oh doors.

C

Okay, I I said that uh okay, this was in uh five percent, was from the append bench um uh test, not from the upper hall bench, which is supposed to exaggerate the refill, the refilling path.

D

C

But, on the other hand, it's a micro benchmark, so it's single credit in osd or in mgs. I bet it's it could be, it could be. It could get worse because because of increased cluster atomic operations, if you, if you share memory, if you share cash line between money, cps.

D

A

I was just trying to get a sense of whether or not if you were seeing similar distribution of cycles in your tests, as I am in the one that I just.

C

D

Showed you earlier the yeah I see I see uh from.

C

The output from power.

A

Yeah exactly the the call graph, I was just curious because you had mentioned five percent was overall, it was in uh ratios.

C

A

Oh okay, so mine was a little higher. Mine was like.

D

C

C

But to spread it small, it looks smashed in multiple.

A

Places well, I suppose, uh we've probably exhausted all the things we've we've got so far right radic, I think so yeah all right. Well, that's that's it for now guys we're still working on this. Don't have a lot to do. I think, but um hopefully we'll make progress and and be able to at a long-term goal right. The the real goal here is to be able to make um a buffer list, and specifically encoding and decoding faster without actually having to change the this of the existing encoding decoding over to tank.

A

um So I don't know we'll see. Maybe we'll have progress, maybe not, but that's that's the goal.

B

Yep, it's experimental thing.

A

Yeah all right well anything else guys or uh is that it.

A

All right, then, well have a great day.

A

B

Thanks same for you and see you and see you tomorrow or today,.

A

Yeah sounds good, see you later.

B