Ceph Performance Weekly, 9 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2023-02-09

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

Hey folks, sorry I'm, a little late core is going way over.

A

I have a feeling that we're probably not going to get them for at least another five or six minutes. So uh let's uh just get started on stuff.

A

All right. um I have two new PR's this week. That I saw um Igor. Tell me about this. Allocator format version 2 PR from you. It looks huge.

B

All right, so the primary intention behind this fix is to allow.

B

Allow this engine to cease allocator to disk.

B

The current the current format, which is a pair of a set of x, free, extents, uh offset Plus.

B

It potentially might uh grow with this augmentation, especially for uh high volume cleaning drives, and it can also overcome the four digit gigabyte and gigabyte file size, maximum file size for Bluetooth, uh which actually requires around to five uh 256 million.

B

Hence the idea is to replace this format with V2 and one which is a bitmap.

B

With bit per location unit, uh draw this.

B

Uses more space on almost empty disks disks, but if uh pigmentation grows, uh the space usage remains fixed, the same as we had with skip the locator versus bitmap one, uh not to mention a bunch of cleanup and some performance improvements around this stuff.

A

So it might be timely, Adam and I were just talking earlier this week about um uh trying to to revisit how Hybrid allocator Works, um maybe maybe doing it a little differently than the the current method, um maybe something that combines trees and bitmap in a little bit different way. Now we currently do it.

B

Okay, but traction is these: PR is almost dependent from marketers.

A

Yeah but maybe we merge your PR and then we use the new the new interface right.

B

Yeah, so uh well generally, this thing works like uh it dumps.

B

Dance free space information from any from any locator via for each interface. It dumps it into binary into a bitmapstick in the bitmap string and then flashes it to disk. So if.

B

If application allocator implementation supports for each interface, then this stuff should work.

C

B

Allocator internals.

A

Do you do you expect any significant performance difference changes with your peer.

B

uh Well, uh it actually depends on the size of the location map on it actually on amount of free base, chunks comfortable heavy chunks. So if it's large enough, then it provides some benefit.

C

B

A

B

There is a unit test attached to which uh one can play with, and you can specify amount of fixed ends in a locator to save and restore and we've around 200 million of uh these chunks. The differences like the original implementation runs runs in.

B

16 seconds: uh well, not uh uh are you on the team versus uh new one running in nine or.

C

B

But but but if you have just a single three extent to the difference, Delta would be different completely different.

A

Yes, can you include the output from those tests in the comment on the pull request.

B

A

So it's it's a big big uh big pull request, we'll have to uh I I, don't know if Adam, maybe is the right person to review it. I mean I can try, but this is. This is a lot.

B

Well, uh yeah, you could put some uh optimization.

A

Yeah I mean as a as I look at it. At least it's not.

C

A

A

This we're adding a fair amount of code, but.

A

Maybe not terrible well.

C

Anyway, yeah yeah yeah, hopefully.

A

um I figure Adam's, probably the one that will fall to to review. um But you know if I, if I can Round Up time, I'll try to take a look. I know: I have another one that you've made that I said if Adam couldn't look at it, I tried to, and that was like a week or two ago.

B

You know Gabby's another person who might want to take a look.

A

Yeah, absolutely we should see if we can rub him into uh into doing a review. It's.

B

A

Cool all right, uh let's see next PR uh here uh you you come to contacted me earlier about this one. uh Do you want to talk about your pull request.

D

Yeah, so basically they are improves the performance of opening images in already but data background for this PR. Basically, in the promises module, we are trying to create metrics for every image and the problem with that is that when you try to to open an image it pretty expensive to open.

E

D

And after doing some profiling on the on.

C

D

Side I found out.

C

D

Time that you opened an image.

D

D

Dmv config type and operate with quick values. So basically, I have the config as reference to the.

D

Values is specific to its image.

A

You are breaking up quite a bit.

D

A

I'm not sure if it was maybe just on my mind or not, um was it or other tired or two.

D

Sorry, what was that.

C

It was it was, it was bad for me too. The audio quality, okay,.

D

D

F

um Looking at the pr I'm just curious, does anybody know why image context needs a copy of the config proxy? It's storing a stuff context, so it could just get the config through the pointer.

C

A

Where you were you able to hear Casey's question.

D

Did you hear me.

A

It's really broken up here.

G

Okay, what I think that maybe I can I can help a little bit okay, but in any moment you can interrupt me okay. So what uh this this whole request is because, uh basically, we need to provide some kind of configuration of custom labels for uh and from a serious metrics. Okay, when the the user introduce a custom labels in Rapid matches, we have the problem that to recover these images, this label sorry take a long time.

G

Okay, basically, what it has done is to profile the different goals in the favorite of the images, and what we have found that we have found has found is that the the current image CTX context is taking a lot of time in order to be recommended for for each unit. Okay, so, basically, what we have done is to replace this, this big context for any kind of inner compose users with a proxy that is getting a reference to the big one. Okay, and that is more more like to be managed foreign.

F

I think the comments in chat are useful, so there there is some storage that is specific to the the image that we need. Instead of just pointing to the stuff context, config.

G

F

Okay, I think that answers my question thanks.

A

All right, cool um I, don't think we have any of the RBD folks here today. Well, I can see so um we'll we'll want to get their input too, but sounds good.

A

Sorry I'm just looking at the comments in the chat right now.

H

I'm just catching up here, but I think you're talking about having a perk uh image context copy of the configuration I think. One reason that's done is to be able to override configuration with the image specific metadata.

H

So there are ways to specify like for image, configurations that RPD uses um and I think that's why it's getting that copy there.

H

A

Quest in the the chat window, thanks.

G

Sorry, just I I didn't understand your question.

H

Oh, it wasn't the question: um okay, so Stephen um I think the reason why it was making copies was that was I. Have a per image configuration possible um these? You can store extra metadata on an image that uh overrides some of the configuration options.

H

It's been some time since I looked at this, so that that was before um the context was turned into a whole. Config proxy object as well. It may have been simpler before that point.

H

The config proxy stuff was only added for crimson and some people locking changed a little bit since then.

G

What if we didn't know that.

G

What you are saying in jobs is that there is another implementation of this config proxy that is used in crimson.

H

I'm I'm just saying that the I I think the behavior here might have changed a bit um because when it was, this was originally implemented with RBD uh config proxy didn't exist, but then it was changed later so that maybe um but the reason we didn't notice this before, like the the uh performance uh piece that you're looking at.

G

C

G

That case, it would be nice to to have some kind of assessment, basically from people from our video okay, we have tried that it seems. That is what it's not. They are very busy. So if you can take a look okay, it could be okay.

H

Yeah I can try to verify whether the.

H

Well, the design works with the system metadata scheme.

H

H

A

All right: well then, let's see moving on uh I did not see any closed, pull requests this week, um I missed anything. Let me know.

A

Otherwise updated this week, um Igor looks like Adam reviewed your uh uh avoiding using whole space iterators for prefixed access, PR.

A

Ide and looks like he actually approved it um and just had one comment: I think.

A

So yeah, hopefully we'll get that in soon um this rock Stevie one we should. We should figure this out. um uh It should not be a hard problem, but for whatever reason, there's a hang up here.

A

I asked radic. If you could take a look at it earlier this week, so I'll go I'll come again just to see um since he's updated or actually be in the past, he should be able to help out. uh If not- uh and we can we can do it ourselves. Up like this is, should be super hard to get done, uh but we we do really want to get this in for Reef, ah see next um that rocksdb iterator bounds for blue star collection list.

A

uh uh Adam found some bugs, so he uh requested some fixes for that, but otherwise I think that uh people people are liking it. So it looks good. um Finally, there's there's an older PR here. uh This is a really really simple PR, it's just disabling busy polling in qat. This is from someone at Intel.

A

um Kefu had marked himself to review it uh like a month or two ago, I think- or maybe it's even longer, um I wonder if we should just merge this. It's it's a really simple change: qat's Intel, stuff they're recommending that we disable busy polling. Does anyone have a strong opinion.

A

All right, if I'm, not hearing a certain opinion on this I think I might just uh I might just merge this. It's they. They give performance results that made it look like it's an improvement, a kind of trust, their judgment on it, since the this is uh their technology. So, uh oh, oh I'll, probably just merge this one other than that uh I didn't just.

F

A quick question about qat in general: do you know if there's been any discussions about um being able to test that stuff in in our own infrastructure, because it's it's hard to review and support that stuff without being able to run it.

A

Yeah I have no idea, not that I know of Casey like this is like the first I've, even I've. Really, you know looked at it or dealt with it, so um yeah, nothing that I know about.

F

A

Hey on that topic, though Casey do you do you feel, like you know, do you feel like we need that before before merging this.

F

No, but they have, they have been doing some other qat stuff that uh relates to rgw's compression, okay,.

A

F

And it's been hard for me to really know what to do with that.

A

Yeah I completely understand um I mean. Can we.

B

Can we test.

A

B

I might be wrong, but it's to me that when I originally did some experiments with this stuff into provided his sort of Library, which wraps both hardware and software implementations for compression optimization or maybe you can so. If that's true, we can test just software implementation.

B

All that Hardware one we use the same API.

B

It was like eight years ago, from my camera.

F

All right yeah, that would be great.

A

I mean we, we have the efficient else, notes which I mean theoretically I think should support this in Hardware right.

A

uh Okay, I'll assume the silence means that no one else knows either.

H

Yeah I thought it was more of like a specialized card, so I'm not sure if the fishnalist has that.

A

Oh really, it's not just like, like CPU optimization for this. No.

B

Yeah, that's like I said so they had both implementations, Hardware optimization and you have an accelerator and another realized. Just on CPU.

C

Now I think the is a functionality you got from the the add-in cards got moved into the the CPUs on dot. You know on the socket, but I, don't know what generation you know. I know the personalists ppus or one or two generations old, so I'm not sure if they have, but they they might well have the functionality built in without needing the add-in card.

H

Okay, that's interesting.

A

I suppose our our homework is to figure that out and then uh theoretically Casey, if they support it, I suppose you guys have one of those right.

F

Sorry I have one of what's.

A

The officials notes I, think Mark's been using it.

F

Okay, yeah I'll reach out to mark.

A

Yeah that might that might be a path forward if those, if they'll support it um now, let me know I can try to help help figure it out.

F

I'm gonna link the qat AR PR that um they've been working on that touches. Rgw put it in.

E

The next section.

A

Yeah get that in and then we can. uh We can try to keep up with it in this P or in this. uh In this call.

A

All right uh for no movement, I I, don't think, there's anything interesting in this right now.

A

I know Igor 48 640 is your faster Bluetooth allocations and Abel hybrid allocators. That was the one I was mentioning. I know, I know you. We owe you a review on that. So I'm I'm, hoping Adam, can look at it too. But if, if not I'll I'll try to find time um and uh oh go ahead, sorry.

B

uh Just said, okay.

A

All right, uh I think that's it for PRS anything I missed buddy.

A

All right um for discussion, topics, I, don't think Corey or David Orman are here, but um I'll just give a quick update since they they said. I could share some of the things that they're seeing um their their cluster is doing really really well. After um a couple of things.

A

uh They they applied, the pr that I had for um basically uh uh compacting an iteration when tombstones were encountered, and that was was huge, I guess for them um they're, seeing a dramatic reduction in in how much time is spent in iteration and that allowed them to remove their TTL attempt to live optimization that they put in place to deal with this in the past, which made other things much better and on top of that, they they applied lz4 compression to roxdb and that's been a huge win for them for uh space amplification.

A

So all these things, combined together, they're seeing uh dramatically higher performance and lower disk utilization cluster. uh So uh this they talked about it last week, a little bit but um they're continuing to see a lot of really good behavior over this past week with it so um yeah really really good I'm, hoping that uh that reef is going to be a really really good release for a lot of people um and that that was all I had. But I I wanted to ask Joshua. If, if you guys have any update on on your stuff.

E

Yeah I can briefly just talk about uh our findings in our staging cluster. um So uh after we met last week, there were three different paths to try. This is referring to the right application. We've been witnessing in um Pacific, so the tracker Issue 5, 8 5 30, not that anybody can get to the tracker right now. It looks like it's overloaded at the moment, um three different options, uh the first one we tried um per your suggestion- and this was just like out of curiosity- because there's no way we're actually gonna apply some product.

E

You have to re-roll all your osds um is changing. The blue FS share, download sized one Meg that seems to work as expected, um reduces the inode size, because the extent list is smaller because you're doing one Mega allocations instead of 64k um Mark's suggestion was try the new rocksdb tunings from Maine and so I applied that to the system. It had the same effect um in this case. It's because the wall is kept smaller and so the eye node just never gets that big again. The extent list is shorter because the wall is smaller.

E

um That was positive. uh We we have an internal item to go and evaluate those settings in like against uh our performance, metrics um and benchmarks, and that sort of thing and then we'll see if we want to actually start rolling those more widely in our infrastructure um ahead of that quarter. Reef but we'll see, and then finally, um it suggested that the bluefest uh incremental log update patch that landed in 16-11 could fix this as well. And it does seem to be the case too. So really like any of those any of those options.

E

Help I mean and they're all improving things completely differently either by keeping the wall smaller or by making inode size increases better or less expensive on the log, because we aren't rewriting the entire inode every single time or by keeping the inode itself smaller. So I did not then try. These in combination because, like that, would be interesting to see what happens if we keep the walls smaller and also have the incremental update node mode, but like it's going down to I.

E

Think it's like 270 kilobytes per second of blue FS, blog updates from anywhere between 400 kilobytes per second and eight megabytes per second, so I mean that the change is pretty drastic I, don't know how much further that's going to reduce just because of normal system operation at.

A

That point so yeah it's uh it's kind of funny, I feel like we. We get these problems and then we kind of like Nuke them from orbit from like four directions at once. Yeah well.

E

And the thing is like I, think, the combination of the the roxdb settings and then the incremental blue FS updates is a valid thing right, because the fact the fact that wall is still getting so big is I mean even if, in the steady state it's not showing problems, it's bound to cause some sort of delay somewhere like a startup delay or something right. So.

A

Yeah yeah and it's it it. It took a long time for us to figure out how to avoid having like crazy, Ray amplification in rocksdb, without keeping those big. You know with the way that we do PG log updates it just it. Yeah it and I have to give credit I think it was to either Intel or Micron. That came up with like seemingly well working tunings um that let you keep it smaller, but no one understood why so um yeah it's every time.

A

We tried that before then like making those smaller, we just ended up with like crazy right, amplification or Ox DB. So it's it's. This weird combination of having to have everything set just precisely right.

E

Right yeah, my new science is because, probably because the levels are level one or bigger stuff just gets deleted at those levels internally. Is that why, like, like my understanding, is you keep the walls? Big stuff is getting added but deleted within the wall, so it just never gets actually committed to level zero yeah.

A

Exactly it's like keeping.

E

Level, zero level, one bigger you basically self Tombstone within those levels before it hits level two in a blow right, I.

A

Think so, like I I I think I had at one point convinced myself: I knew what it was doing and why did it worked right, but then.

E

E

Hey yeah, so that's our updates. um We're I am like literally right now evaluating the 16 to 11 pass list for anything that might concern us um for upgrading in our environment, so I mean I, won't know what this looks like in prod, probably for another week or two. um So at that point, I can like finally comment back on the ticket and say yes or no. This is actually fix. What we're observing at the production level.

A

um Out of curiosity, are you are you going to try the roxdb tunings on a cluster that already used the old tunings.

E

um I, if we do that, we're going to be independent of rolling 16 through 11., okay.

D

Just because I want.

E

To observe the difference separately- yes, yes yeah so um I would like to, uh but that's going to come after. Okay.

A

We have a decision to make for Reef whether or not we blanket tune everything to the new tunings, including old clusters, so that they automatically start using those new tunings when they run out new SSD files and and use the wall or, if we like kind of uh Flinch and and make it so that only new clusters that are deployed use. Those tunings and old clusters continue to use the the existing tunings.

E

Yeah I mean looking at the tunings and especially based off of what um the 11 11 folks were talking about last week. I wonder if it's actually worth dropping the TTL Edition, yes default, tunings yeah, yes um other than that I mean yeah. I I can understand it's kind of hard to say, but it wouldn't like I mean it's not the first time that a major Ross, TV, Behavior change, has landed. I mean we've upgraded rocks to be across major versions right, yeah.

A

Yeah and for Reef, we're, probably gonna, upgrade it again and possibly adopt all these new tunings, as well as the default Behavior.

E

Yeah so I don't know I guess in my mind, like a tuning change and a major roxdb update, are kind of equivalent.

G

A

Yeah and don't I, don't perceive there being any problems with the tuning update right like it's just basically means that now there are actually be. The wall is going to behave differently, which you know is it's not like that should have any real, lasting.

A

um You could always change it back. It shouldn't be a problem um and the the SSD files are going to look different, they're, going to be sized differently and and they're going to behave a little bit differently. But you know it's not. This can be a gradual thing. It's not like a you know, incompatible data format or something right. Yeah.

E

Exactly yeah I I, it's like I think again evaluating the changes. My my personal concern- and we won't know this until it's rolled more widely, would be is level zero level one big enough or are we basically or some configuration is going to exceed it and then also have spill that causes right out right.

D

E

But that, like there's, no way to know that without actually trying, unfortunately.

A

Yeah exactly that's what I've been to like, like relying on you and David to be like hey guys, yeah.

E

So like like, we will get to it soon. I, don't know when your decision Point actually is, but.

A

We we push Reef off like three months, so I think we've got yeah, we've got a little time, which is good. Actually, it's well he's very opportunistically, very good for me, um but yeah.

E

Well, like I'll, let you know as soon as we test it, but it like I wouldn't be surprised if we don't actually get good numbers out of it till March. Okay,.

A

Okay, cool all right: okay, okay, that's good to know.

A

All right, um that's all I had guys. Was there anything else anyone wanted to bring up this week.

A

Well, all right, then, uh thank you all for coming. uh It was a good talk and uh we'll we'll meet again next week. See you guys.

B

Okay but I think thank.

D