Ceph Performance Weekly, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-06-10

Description

https://ceph.io/community/meetings/#performance

A

Oh, oh we're waiting for mark. Maybe we can go ahead and start talking about some of the uh tc malik uh cash size and uh screen issues. Nah. Do you wanna uh talk a little bit about.

B

Those yeah sure um so the summary here is that we've been trying to compare uh performance across um nautilus and pacific, especially for rgw workloads um and we've been using two different kind of workloads, one which focuses uh mostly on smaller objects.

B

um I think it's a distribution, so it's worth mentioning that the workload is cost bench and the and the distribution of objects there for the smaller workload smaller size workload is essentially like 1kb objects uh to up to 256 kb at max and there's a percentage involved there about how many uh what size objects. We're writing and the other workload is a mix of small and big objects or ops in gauss bench um so that you just like, have a different workload profile from smaller objects.

B

Now, essentially, what we noticed is that, with smaller size workload we are doing much better in uh pacific. In most scenarios, when I say most scenarios, it means we are doing right operations. We are doing um tests that just do a fill workload, which is essentially um a fill operation in cost bench. We also do a mix of read, write, that's called hybrid, but that that also does better in pacific and there's also another workload that essentially does rights and deletes like eighty percent uh right and twenty percent delete.

B

Even there, you've noticed that pacific is doing much better, uh and this is all with the small uh size workload, but uh it turns out that, uh in the other workload where we have a mix of larger objects as well, when I say larger objects, it's more like max is one one gig or something um it's not the case. In most scenarios, uh pacific is not doing um as well as nautilus. Now.

B

Having said all of this, uh we started digging into what uh you know what has changed, or you know what kind of settings pacific is using versus nautilus, and we found a couple of very interesting things. um One thing is about ost memory target so essentially in pacific. It is um a known thing that osd memory target is not overridden by any of the deployment methods. At the moment, uh specifically, cdm cefadm currently doesn't have the ability to override ost memory target in general, uh we have this mechanism in sephancibal.

B

What it does is that it consumes a whole post memory and it multiplies it with a safety factor um and then divides it by the number of osts on that host to provide osd memory target for each ost.

B

But this kind of calculation doesn't happen for steph adm, so it is expected that we use a static value of four gig in in the pacific deployments, uh but in nautilus the expectation was that this calculation should have happened and, according to according to the amount of memory on the host we actually calculated, it should have been somewhere around seven plus gigs, but turns out. These experiments were running with 2.5 or 2.6 gigs, which is not even close to four uh gigs.

B

So this seems like something uh wrong going on in the deployment mechanism, um specifically in stephanie with the nautilus experiments. So we decided not to rely on the results and we want to repeat those but the other aspect that also came out. While we were doing this investigation was maybe I can just add a couple of prs that have come out as a result of that in the chat for everybody's information.

B

So this is one from page and there is one from adam. So essentially uh there is an environment variable, um the tc malloc environment variable, which is called pc mallet max total thread cache bytes.

B

This is also an environment variable which we, at least in nautilus. We were setting this for both the osd's and the rgw's, and the expectation was that it was using this by default, so you don't have to um either by means of the deployment mechanism.

B

However, we had be it safe, ansible or whatever was setting it when the cluster was deployed, but turns out that uh for 5.0- and we haven't had evidence that even for fort two- and this is all in containers- this was set correctly, so um these experiments again are or even the results are not valid.

B

In my opinion, because the assumption was that the tc malloc environment variables should have been set correctly for these experiments, so there have been a couple of pr's that have come out as a result of this um I'll, give a brief summary and then maybe adam. He is on the call he can talk about it.

B

The idea is uh one: the one from uh sage is trying to set this value on a global level for all cluster wide for all demons, but just by using cefadm, and the one from adam is trying to set this for the osds by means of using the priority cache manager.

B

Essentially, the idea is that if there is an environment variable um that is set- and there is also a self config option that he's introducing, if that one is set as well, the self config option will uh take precedence over uh the environment variable uh adam. Do you want to talk more about it?.

C

Yes sure I mean there is not really much to talk about it's very simple, simple thing: I just digged out through a tc malloc interface and found a variable um api that sets internal variable that that's the same that use uses to control the size of thread cache. This is the one that is on init is taken. The value is taken from environment, variable tc, malloc, total shard cash bytes. Let me copy just for reference into ch chart and the same value. The same value can be changed, programmatically and it works.

C

I mean it can be changed and it's applied.

C

And that has been done on osd initialization right after.

C

Configuration variables from global init are accessible. There is a simple code just to read an osd memory thread, cache variable and apply it to thread cache. That's pretty straight forward logic. There is no.

C

That's just it, and the name of the variable is apparently like this, because I I modified it for osd, so I made it this way. It could be possibly like exported to some more general naming and it could be even applied as a part of global init. Then the same would apply for any client. If but yeah I wanted to do something. That's the correct and extending, I guess, will be a part of a discussion.

B

Yep yep agreed. I think we need to at least make sure that uh this variable is set appropriately, and then we can figure out what is the best way to set it across um so yeah.

B

I think that's kind of the summary so essentially with these two things, uh these two findings, those uh results or trends that I talked about earlier uh kind of at least the nautilus ones where we are not even using four gig ost memory target uh stand void, so we are going to be repeating those uh tests and once we have uh newer numbers, uh we will try to present it in a follow up meeting here.

B

Josh anything you want to add.

A

I just wanted to mention I posted in the chat as well uh for the ost memory target piece uh that has been. I did master for safe adm.

A

um I need to double check exactly which is using like the same kind of safety factor of uh that antelope was, uh but it is tuning that for remaster now it hasn't been backported to a pacific release yet, um but that's that's uh at least implemented now. um Oh.

B

Yeah yeah, I missed that one. I think it's. If it's not risky enough, we can probably backboard it to pacific as well right.

A

Yeah yeah most likely, and I want to take a closer look as well. um I think it might I'm not sure if it's not clear to me, um I've noticed it too close to the edge whether it's on by default, um which I think it probably should be.

A

And whether the safety factors match up with the exceptional.

A

A

With respect to the tc malek thread, cache uh adam, I think your fix is fantastic. I'm glad you found this api, um I'm not sure if we want to apply it to absolutely everything like including clients, but in global init. We do have the concept of an entity type where we differentiate between demons and clients, so we could apply it just to demons things like rhw initialize themselves as a demon, I believe.

A

uh Well, I could say double check to confirm that.

A

That certainly seems like a better long-term solution than I trying to set invite variable everywhere.

C

But then the question would be if the same would apply to any demon, because if we want to pass a different value to different demons, then setting it up in the same place would be strange. At least.

A

Yeah from I guess we haven't, I'm actually not aware of any testing on demons. Other than osd's and rgw um mark did some testing in rhw in the past week. That showed, um I don't think he tested 128 megabytes, which is the value we're proposing to use by default here, um but for rgw's he was testing between like 32 megs, which is the default gc malik and 512 megs, which is much higher than our proposed value, and you did see it a a pretty dramatic performance improvement with that.

C

Well, I can extend that pr to give admin command to control it. Dynamically tc malloc seems to be handling it pretty. Okay,.

A

C

It takes some time to to just drop the caches from maximum to to current, but it's it works, but it if it could be admin command. Then we could dynamically even test. What is the optimal solution, for? I don't know. I just at least test, because currently it's difficult, yeah that'd be quite nice. If we could.

A

Enable that to be uh dynamic uh like uh if it could, if it could be in a config value, that's observed and just changing that config value would uh change it within malik as well. That would be ideal. I think.

C

Okay, so I'll modify.

A

Cool, that's that's much uh sounds like easy. Milk is much more flexible than I thought. So that's fantastic.

A

Yeah, I think that's um all I had for for these. These findings. um Are there other other aspects that I think they think are worth discussing.

B

I think uh mark is not here, but maybe next week you can talk about some of the rbd tests that he's been doing um but yeah. That can be it until next.

B

A

Yeah, I guess the other topic mark wanted to bring up was that his testing on rgw and mtp usage, but uh since he's not around, I think love to postpone that one as well.

A

um Are there any other topics? Folks would like to discuss this.

A

A

All right getting silence so keep this short today and see everybody next week, thanks folks,.

C

Thank you. Thank you guys. The performance weekly without mark is just not the same.

A

Actually, I'd be happy to hear that.