Ceph Performance Weekly, 25 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-MAY-25 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

B

Good morning,.

A

Rach, let's see today is May 25th.

A

So after looking at some pull requests, let's say I'll start with the closed ones. I think two interesting things happened. A lot of them sell on here. So the first one is that the how I emerged, the right, locking tension patch an async messenger. This is something that came up in the profile as quite a bit of time, spent waiting on this particular lock, and so we rearrange the code a little bit in the messenger, and now we don't sit there.

A

So it's good I, don't know if we have like a sort of a run after that. That should what the net gain was, but presumably it helps them the blue, FS patch merged. It doesn't actually enable blue if that sync right yet, but it does sticks a bug in Messer.

A

So it's now safe to use, Bluesfest and great the performance difference seems to be negligible stores for looting it or it is for now the CRC pull request that we talked about last week, I went head, merge that feel free to follow up with changes to the constant.

A

We have a better idea where that threshold should be, but it provided matter that much I don't know how often actually we're calculating CRC's see her buffers. Anyway, let's see we close the Zipkin branch, because those liberate of stuff was rebased and merged a weaker couple weeks ago and I think that our v1 also just merged, and so the original PRS closed, the penis emerged so yeh the last one.

A

That's why the most interesting on Buser side is the kb sync thread: that's gone through several iterations originally starting with some work at Intel, SCI Shaun thing has urged, and that shows like a 10% change on nvme, well, less than a 30% mark was saying when it had also eliminating the finisher, but there's a bad luck that happens. If you do that, so um use more work before we can sort of go all the way.

A

But that's that's good. Let's see a couple of them are still in flight.

A

Igor had a pull request that pushed or Oh any comments on the queue transaction thing. I haven't seen that actually, as that came from but Alibaba I think they opened up here anyway, I massino's come with Seattle. Take a look. There's one for rock CD that avoids memory, copy, sure, I, think I, just mostly did go through a I, would assume it's going to help marginally, but has been tested from the formal side and as if old ones, that as a discard methods for SSD yeah.

A

That good still needs work, and then there are some new ones. Sean peg has a batch throttle. I listed that yet on clean apps actually really performs really I. Don't think that um that's going to testing right now- and there is a change to the encoding getting code code um that we turned up on Big Bang, though it behaves horrendously when but buffer is fragmented in you're trying to code using a legacy code, you could stuff on top of it and so fixing need to fix that.

A

That's it there's a bunch of old stuff here that hasn't hosting any movement.

A

A

Yeah not to anything here is worth mentioning. I guess the main. The only ones here that I think is still in play for luminous is one that that does EC there. You go the second one: try to unshare velocity fever at workload that showed some dealers in QA, but I need to fix those, and we should merge that for luminous, because it'll make blue store, behaves much better on EC of right, pools and I. Think that's that's it for the PRS.

A

Any any questions comments.

C

Sage's there we'll use that that was missing. It.

A

Was the one that you added in there while I was talking but KB, think red yeah.

C

Is equivocal I said in the wrong part as I listened to the updated recommend closed section yeah.

A

And then see what's the form 15 10 10, 100, 1, okay, so this yeah I'm.

D

Just going to say, you purchase the other one, which was a common, improved qsd calculation for 0 buffers, which is great and then I'll just want. Let you know my plan is to rebase to sort of incorporate those changes and then continuing work on encourage ambassador on power for Intel. The waterboarding, faster, ok,.

A

Ok, yep so I'm good yeah, just think somebody when it's rebased an immutable target. Those needs QA. What's a good serve. You sounds good.

A

Ok anything else on PRS.

A

See the other thing.

A

Oh yeah go ahead and go to the discussion topics, so the first thing is blue store, Mark's, been doing more testing and number of recent improvements to the point where we had set metallic size to 16k on nvme, because it was faster, even though there's more right amp, but it looks like that is, may no longer to be the case. It might be about the same and so having a fork.

A

A mental exercise seems to be performing about the same, so that's encouraging, but there is a memory leak mark that memory leak that that's the memory like that you've been talking about. That only happened when metallic size is 4k.

C

Yeah, that's just happening right now with me outside before Kade, but one of the things I have noticed is that unless you restart with it, the overall memory usage looks like it's lower than it is even look like a sixteen Caden and Alec sighs when it's not restarted. So it might be that even in the 16k case we're using more memory- and maybe it's maybe there is some ceiling that we have or star stabilizing, but I mean it's past eleven. You get lights RSS before hanging outside.

C

That's not gonna work yeah, so maybe I don't know it, but so I'm not sure exactly if we are not saying it with 16 K and we're just hitting up much much lower specialist. It's.

A

All try to reproduce it with 4k, either way.

C

A

Okay, I'll see data from Rock City slices you're a copy that would copy it. When writing so.

C

It's the from last week, excuse program. You could have there's um you can see if there some long 1697 of the myth trace that was included there. But one of the reasons why for teaming Alec size is probably faster because we do like Casa there. Oh.

A

Just lost out of data going through our CB and it gets copy. Yes,.

C

You can do this all they.

B

Have now gray yep.

A

So, in theory, if frosty was writing into page aligned memory and was writing in page aligned units and we used the belissa sync right, then we could avoid that copy. But we don't do that. That might be an interesting change to make sure xq to make it change its log format so that it does do that.

A

That probably have lots of people, because even so, if you are trying to write and you aren't aligned to a page, then we have to wait for the I/o for the previous page to be to complete before you write again, you can't have two iOS for the same block outstanding, or else it could race in the block layer. um So it's also sort of a serialization thing.

A

If there's a sanajeh rate code in there, that's less efficient, okay,.

A

So you've seen that 9 percent in time and um copy you've seen that that you measured with the 4k malloc size and that went away it.

B

Doesn't go with it. We go down to like three and a half, or it was big off, but it's nice in.

A

A

A

It sounds like in this final start, so we need to sort of answer a couple of key questions like.

A

4K vs 16, K, min le or or SSD mm table size.

A

A

It might be that we want different box to be options for us. This D versus HDD might want to have different types of options, depending on.

C

C

This is like at this point a lot of the issues that we're running into a are kind of stock, 2d right and then there we've eliminated a lot of stuff in this store itself. So now it's kind of we feel a lot of times, putting key comparisons and Noxubee.

C

We see Tenderfoot, please use mem copy that is sort of us, but not related to the alignment issue against rebel websites and then yeah there's an end table Sussex kind of the largeman tables we like, because one for read ahead is nice and then also it's reducing our om, but.

C

Obviously, other downside is to it so both campus that's the right way to go around you from like those Oh. Unfortunately, things in German codes, not sleepy yeah,.

A

Okay, so it would be a little while they might be simply looking back port, though, and maybe a change of so close to our XP. That we'll see.

A

Okay, what else at async messenger the air tested, calm eyes right, like PR yeah.

C

It doesn't perform improve performance in the Tesla and that the locking the lock contention was almost like the oil and maybe so. Ok, look.

B

A

Yeah shifted somewhere else sounds good, um let's see other things going on, Booth's orifice check filling a disk up and then making sure that it wasn't that you can actually run it as figure out what the memory envelope is make sure it works. So.

A

What's going on and.

A

What else is happening.

A

There's a a little at the big bang stuff. That's been happening last two weeks, so we have a temporary cluster set up at CERN that has like 200 hosts or something supposedly 10,000 owes to you so that we never advocacy at adult 10000 right now. We have like 6500 Louis to use and system and we're testing all the new code for the monitor and manager, with stats going to the manager and making sure it's stable and fixing with this stuff, so that work is ongoing.

A

I'm still fixing a bunch of things but I'm making good progress. The most recent issue I hit yesterday, was that whole rack of Oh Steve went down. I will a created a hole with 256,000 PGs, so this is actually kind of too many PGs for this size.

A

Cluster, but I just wanted to see what happens and then a whole rack went down, and so a bazillion PG mmio Steam app got remapped to PG temp with PG temp entries, and then we started seeing a memory usage for the decoder toasty map balloon because it's a red black tree and it's just super inefficient. It's a red black tree of vectors, which then have another allocation.

A

So it's like two allocations per node and they're, like a hundred thousand of these mappings, so I had a patch that makes OSD Matthew's a b-tree and sort of a flat buffer for that index.

A

So it's only a few handful of memory allocations and much more compact and faster, so testing that but I'm having really got it, can sit here at what happens like that's deployed, I'm still working on that um it's sort of challenging to diagnose, because they're, just having so many pts means that just random things that are totally reasonable to spend CPU time on suddenly take a long time. And then you start hitting weird timeout. So you have to figure out what part is slow and why and kind of parted to bug.

A

So we're gonna do that, um but that's been good, hopefully we'll be able to like have a fully stable cluster at you know: either sixty five hundred or ten thousand those fees and sort of check that box, but we'll see.

A

That's it for me, any other topics. People want to talk about right now, right now about that cluster at 6,500 OSDs, the machines they're enough machines for 10,000 OS DS, but they aren't all added a cluster I'm, not sure if that's because they didn't all get deployed or because we just didn't add them yet so I get it. Does we've been working too much other stuff the time so hopefully we'll get to ten thousand fighting the week.

A

A

A

Create a short short call this week, I.

A

Might have a short call all right: okay, cool I have a good week. Everyone thank you, see.

B

You next week, guys, yes.