Ceph Performance Weekly, 14 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-07-14

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

All right well, um then, for discussion topics, um there's one here that I want to talk about with laura regarding uh this blue star: zero block block detection pr, um it turns out that that's already been committed to uh quincy it's already there, but we turned it off in 1721 and actually saw a performance uh regression relative to uh previous versions of quincy. So we want to talk about that and try to understand what's going on there, but uh until laura gets here.

A

Maybe we can talk about something else, so um one other topic that we've been looking at recently is memory usage in uh the way that.

A

We track dupe ops for the pg log. It turns out that when you have a corrupted dupe entry that looks like it's in the future, we stop trimming and that allows uh dupes to accumulate basically until that corrupt entry is eventually trimmed, uh which could be very very far in the future.

A

So uh we've done a lot of testing and done quite a bit of work to try to figure out uh the right way to fix that. uh The current uh idea is to basically uh detect it and then, when an osd is rebooted with the fixed version of the code and rights are issued, we then slowly well somewhat slowly uh trim, I think 10 000 entries at a time for every right.

A

If I remember right and and then uh get back into a stable state, uh there was an earlier version of this pr that tried to uh trim everything on osd boot up so that we didn't consume the memory at all on reboot. uh But uh I think there were some concerns about the safety of that approach. So uh this is the one I think we're looking at right now. um Neha do you know? Do we have a final version of that? Pr did radic.

B

Yeah I'll paste it, why don't you uh continue I'll paste it offline.

A

Right here, okay sounds good, um so so I think we're basically fairly convinced that that's uh the right way to go at this point. I've done some testing on it and it looked good in my my tests. I think there's uh some more official tests of the uh uh you know downstream releases or or you know, other other packages that are are being tested now. So, um as far as I know, that looks like it's pretty good and we're gonna move forward with it um any and they have any any other thoughts on that topic.

A

Was there anything I missed.

B

No, I think uh we are trying to do our due diligence from all sides to validate this, and at this point, what what we are doing is also validating it on hdds, which is something you know we haven't done upstream. So that's where we stand. If things look good, this will be shipped in the next corresponding. This bug goes as far as octopus, with octopus's end of life. We don't want to risk the upgrade approach.

B

We've already shipped the I mean we will be shipping the offline tool uh method in octopus, but up until pacific we'll be shipping. This fix.

A

Cool all right, so uh then, now I see laura you're here. Do you want to give a quick update on uh on what you guys found regarding uh the blue store, zero block detection, pr performance uh improvement.

C

A

C

Sure um so, yeah in the dfg storage team we've been uh studying the differences between uh 17, 2-0 and 17-2-1 in performance, and um we saw that uh 1721 had a performance decrease from 17 to zero and we were trying to uh better understand that um and we checked the differences uh in commits and found that there were two major changes to uh blue store between the two versions and one we ruled out.

C

But the other was uh blue story: zero block detection being disabled by default, um and this was a feature added to uh 1720 and which was on by default and uh essentially this feature uh detects uh zero buffer lists and skips writing them to bluestore and the goal with that um was so that we could perform some large-scale tests uh and mainly in tautology.

C

So we can perform tests with many osds without filling the devices. um But we found in 1720 that this caused some unwanted effects. uh Side effects, um so we disabled it for 1721, um so we're not seeing, for instance, side effects with rbd thick provision images being thinned.

C

However, in 1720 it caused a performance boost. Maybe in quotations, uh because you know there were, there were obviously side effects that came with that. um So essentially, uh marca was interested in uh understanding where the performance boost came from and seeing if we can get those numbers back in a safer way in the future.

C

But for now this makes a lot of sense because comparing the numbers to the last uh pacific point release and 1721 those numbers match, and then the only uh thing in between is 1720, where uh the blue store, zero block detection feature is enabled and that's where we're getting this quote-unquote performance boost um so yeah. That was just uh kind of discovered uh this morning. So um since I was uh headed with that uh pr, I'm working with the dfg storage team to you know further understand that and everything.

C

But now we we understand what uh that performance boost in 1720 was coming from. That's the job.

A

Were those um were those gains primarily in rgbw workloads, or were they rbd workloads as well.

C

uh It was in the hybrid, it was a hybrid new workload. I believe um I don't know what it exactly. That workload is that I can.

B

So yes, from my understanding, the it was with small objects. So it's tiny rgw objects that we're writing and when we say hybrid, it was a mix of uh read, write delete. I believe.

A

Yeah casey, I noticed um that when I turned on roxdb compression that for rgw tests, we saw a huge reduction in right amplification.

A

I wonder if there's a bunch of fields that are like, if there's a bunch of zeros being written out, that um that aren't being compressed in any way, um but this thing kind of just avoided. Writing.

D

A

I don't know if that matches with what you guys would expect, but is it would that? Is that hypothesis reasonable to entertain, or is it does? It seem weird.

D

I would we be talking we'd, be talking about bucket index entries in potential.

A

Yeah, potentially.

D

um Yeah, I'm not sure exactly what integer fields would be like unused and default to zero. um There are a couple strings which can potentially get long, and so I could see some benefit from compressing. Those.

A

I'm wondering if there's like any intersection between what uh the dfg team and laura saw when with this and and the thing I was seeing.

A

Like the zero block detection.

C

uh Mark were those results that you were talking about. Do you have them anywhere or is this um you know.

A

Yeah, it's in that.

C

A

It's in that big uh draft uh roxdb thing which you may not have access to. Let me get you on that. um I'm gonna publish it pretty soon, but I'll give you access to read it now.

A

And if anyone else wants access to it as well, um I'm going to try to get into the the blog format, probably next week, so that we can post it. But um if, if folks are interested, let me know I can give you access to to look it over. I'm happy to get feedback but yeah. I guess I guess what I'm trying to understand is yeah. Why why the zero block detection seemed to improve performance, to the extent that it did.

C

Yeah, I think, um understanding that and um uh will help us uh potentially uh get those performance numbers back uh in the future.

C

The the main uh topic of a conversation with that um in terms of uh enabling it in on clusters, not in just uh synthetic tautology, testing um adam, I believe, uh said that there is some um blue sword limitations where we we are currently unable to um mark the extents as uh logically that they've been used uh so for skipping writing a zero buffer list. We can't uh logically mark that the extent has been uh used.

C

um So that's what we would need to look into uh doing um and that would make it safer to actually have enabled on clusters, but um yeah. There are just some limitations in that right now and what blue store is capable of.

A

I think we're talking about rgw fill and maybe hybrid read, write, workloads.

E

Do they use random buffers or do they use zero buffers.

A

E

Because at the zero buffers this is a real short investigation. That's why it's faster.

E

It means there's something wrong with the test. It doesn't mean the blister was faster.

A

I think this is cosbench and I don't know what causebench actually does.

E

That would be something to check uh laura. Have you looked it up.

C

No, I haven't checked into that, but um so you want to know if the workload is using uh zero buffer lists.

E

If the workload is sending zeros, then what you've done is you've optimized for the test. This won't. It won't be reflected in any real workload. Yeah. Do you see what I'm getting at.

D

Yeah, I would be really surprised if cosbunch was writing zeros by default.

E

um Until you've checked, though, that's the natural assumption.

F

Yeah, I mean it's a silly mistake for a benchmark utility, but it's also a very common one.

E

It's an extremely common one, generating random buffers is expensive. Computationally.

E

It should be sending a static pattern, at least, but it might not be.

F

G

That's something like our radius.

F

Bench like it fills it in with non-zeros for like the first several bytes, but I think that I actually that might be randomized now it was zeros at one point.

E

Yep, um it's fine as long as the energy first wasn't optimizing for it, but well now blue store is.

F

C

Yeah, so that is something to check into, because I think the goal now, since we just kind of made this discovery um this morning, the goal now the next action items is to better understand. What's what's going on under a hood and um figure out what the tests are using, but that's certainly a thing to check.

B

E

If that turns out to be true, you probably don't care about restoring that performance advantage. It wasn't real. It just makes the tests harder to construct.

A

Sam, if I, if I understand this pr correctly, though it's for any buffer list that has zeros in it right, not not just like data, that's being written.

E

Out to to the objects, I mean yeah, but it's blue store, so it would have to be a large enough extent for blue store to.

E

Notice, yes, there are other things that could affect, but I'd be so I'd be surprised. Okay, on the other hand, I'm not an expert in blue storm, so yeah yeah.

E

E

G

I would be.

E

Shocked if it were possible for this to affect omap behavior in any way.

E

Because level db ss table write outs are not going to contain zeros, no matter how hard you, you pack zeros into the keys and values.

E

There will be other encodings in the ss tables.

A

Sure so probably the two are, the thing that I brought up is unrelated is probably compression and strings along the lines of what casey was talking about, and the zeros are are more on the data side of this. Maybe.

A

I could buy that.

A

So laura, I think I think, sam's right. It's probably worth looking at cause bench to see if it's writing out zeros it'd be unfortunate.

C

A

All right: well, those are the two things I had on the list today, uh I'll open it up. Anyone have anything that they'd like to bring up this week.

G

Gabby I know yeah, I I.

A

G

Something so the snap map here, I wrote, which skip rocks db, doesn't seem to impact paul kozner uh performance testing, which is a mystery to me.

G

So I was wondering if you'd be available, to help me construct and run appropriate tests or sorry targeted testing to show or maybe to find out that it's not that they're the impact on snapdragon and on clone creation. Any clone object creation. I cannot see how this thing would have zero impact, and so there might be something else hiding the performance issue.

G

At least I want to see that this code is behaving as it should be.

A

It is strange I I've been trying to keep up with paul's emails that he sent out, and it looks like the last thing I saw was that he said that, um uh through this constant workload, he just sees the osd's consuming more and more cpu uh is that is that right.

G

Yes, it says the cpu um percentage consumption is going up over time. I assume that that, because they have more and more snips accumulated and because the amount of work you do is related to the number of snips.

A

But do you know, do you know if you profile that, like regular interval intervals to see you know what what the, where the time was being spent in each case.

G

I don't really know how to read these emails.

G

I'm not even sure how is measuring them, but I want to see at the minimum like if there's a snapdragon running on the osd. There is no way that my code would consume as much cpu as the previous cup, just by skipping roxy b. Unless we make rooks to be operation, zero cost.

A

Yeah, I don't know gabby, I I think maybe the first thing would be that someone needs to actually create a test that that, actually, you know, measures this right, because we don't have anything right now.

G

Yeah, so that's for I need your help if you could, uh if you could set up a meeting with me where we could design a test and try to outline the steps that need to be done and what would be considered improvement or what would prove that there is no improvement.

A

Yeah, I don't know, I mean what what do you can you with the the new code that you've written you have anything in mind regarding how you think it would be best to showcase those differences.

G

Say you have a system with many small objects because clone activity is per object.

G

So if you have many small objects and then there is a right touching all the objects, so I know it's not very normal flow, but that would be like best case scenario for my code. So let's say the object are very small 64k each and then you have some sequential random right, jumping doing 64k ios, so every right would touch an object.

G

Then every write would create clone object, which in my code should be more efficient. Then I want to measure timing and cpu utilization and if there is no difference, then I need to really understand what I'm doing wrong. I do expect to see a difference and the other thing is say: you have a system with many objects, and now you run snap trim.

G

Sorry there's a lot of clone object. You run snap train, so I would measure cpu utilization. Just snap me do nothing else: cpu utilization and the time it tooks to do the trim with my code and with the base code again I expect to see difference and last. I would also check in both cases I would check right amplification, because I don't try to disc, unlike snap mapper today, which every object is created on, rocks to be, and then it had to create a tombstone.

G

So these things tend to create white amplification, but I wonder how much of that it is so that just to show that my code is making an impact once we have it, then we need to design a real test, because I don't even know how you could measure snap impact because people don't just touch one object.

G

Maybe you do you create a volume you populate it?

G

You create snaps, and then you run the same random. I o random right with my code and with the base code, but you need to have the same seed. So the number of loan object creation should be the same, and then you measure the impact. So that thing would be.

G

More related to what you see in real world.

A

So it kind of seems to me like based on what you just said, that it might be a good idea to write a benchmark. That's really specifically tailored to do this and then maybe some kind of more generic. I know the the the odf team at red hat has run benchmarks where they're looking at snapshot uh the impact of taking snapshots during a background workload, but um they were they're, basically ddosing the osd with snapshots.

A

So I don't know if that's really what we want, but um I mean it seems like you could write a targeted test that maybe would would give you an idea of um how your code is impacting things gabby. um What would you think about something like the g-test suite just making something that really specifically targets this behavior.

G

So I would start simple, I mean even just do snapdragon and just measure these things, but yeah. Eventually, we need.

G

To find what is a real world behavior, because in real world you don't have every right, creating a clone object, because if you hit the same object multiple times, the first one would create a clone object and the others would just be normally right.

A

I mean it sounds to me like you've got a good handle on it. Maybe maybe just try it see what happens.

G

Yeah, so I again, I would really like to sit with you and design the bed, so it's not going to be something meaningless.

G

It's a tricky thing on snap, because what's the definition of snap cost, do you only test it on object on on the right which interface with snap creation? Or do you measure this how the flow is going when there's snap on the system.

G

And I don't know what others are doing, how do they measure? It also depends, I don't know like in. If you got four megabyte objects and you do random right, so each one is 4k, then only one out of thousand will generate clone object.

G

Overall impact would be very small.

A

That's that is what it sounds like josh was saying how it should work, so you know verifying that probably would be a good idea.

G

But in that case I would expect to see that drone overhead is very small.

G

I don't know that it doesn't seem to be the case.

A

Yeah I well and the the the fact that it increases the overhead increases over time. I mean on one hand, sure you could expect that um in rocks db, things would slow down. um Maybe there's some fragmentation involved at the bluefs layer.

A

You know there's, maybe some explanations for why it would get slower, but um understanding that exact behavior would be would be good um it'd, be I mean it seems like paul's, making some kind of slow progress on that now he's starting to figure things out in the last email I saw so maybe it's just you know it's gonna take some time to to understand that.

A

But yeah in terms of like I don't I don't have any any real specific advice on this other than just you know. Try try crafting something that you think. Would you know showcase the difference between what you wrote and what um what's already there and uh in the process of doing that, you'll probably figure out what makes sense or doesn't make sense, um certainly profile, as you go right like try to to analyze why things are degrading over time. I think that will tell us a lot.

G

Yeah, so I I think we need to attack it offline because I don't want to have everybody now try to design the test.

A

Sure, okay sounds good um all right. uh One other thing I forgot to mention to the group um there's a chance that we may get funding for replacing the smithy nodes.

A

And it's there's a real tight deadline on this. It sounds like we need to have quotes submitted by tomorrow. uh I worked with david yesterday david galloway, to try to nail down potential configurations that might um let us kind of keep either systems or vms that look a lot like the smithy nodes. um There's the biggest concern, I guess, is power usage in the community lab. There is no available uh additional power.

A

That is uh not like per rack. That's the entire um data center, so our hands are a little bit tied in terms of what we can do, but um the the current kind of iteration of the design of these replacement nodes is very much based on trying to maximize power efficiency.

A

If anyone cares about this or has opinions on it, uh you know please let me know asap because we're trying to to kind of get this nailed down today, so we can submit tomorrow. But um you know it's it's we're running out of time very quickly on this.

A

So, just let me know if you care um the current iteration of of this hardware is basically uh 1u servers with, I think 32 core 64 thread, amd processors uh and a bunch of nvme drives that will try to split out into like eight vms per node that looked like it gave us kind of the the maximum speed for the lowest power envelope. uh I doubt that we'll be able to do a straight.

A

You know one system to one system swap with mithy, since they will take more power than the smithy knows, but you know maybe we can do somewhere between, like half to three quarters of the same note count with uh a lot of vms per node, that's kind of what I'm hoping we'll be able to do, and- and I don't know if each vm will be quite as fast as smithy, but I'm hoping it'll be pretty close with a big increase in the count.

A

So uh anyway, that's that's kind of the current rough plan on this uh yeah. Let me know if you care: uh that's that's basically it for that.

E

From my perspective, I think anything that gives us similar to test throughput to smithy is fine.

A

Yeah, uh sam, that was kind of what I was thinking too, if we could get even close to the current test, throughput with smithy, but then um have more nodes available or more vms available, uh so that people aren't waiting in queues long, that would that would probably be a win.

A

The current jobs, I think, are still spending a fair amount of time waiting on us like a single like a compression for logs or other random, weird stuff um that take a long time on like a single single thread or single core. So um you know increasing the the number of vms. Even if it's at a slight decrease in performance per vm is probably a big win.

A

A

So anyway, um that's basically it for that, unless anyone else has thoughts or questions on it.

A

All right, um then, uh any anyone else have anything they want to discuss this week.

A

All right well, then, have a great week. Everyone and uh see you next week.

A