Ceph Performance Weekly, 8 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-MAR-08 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

All right well looks like a small crowd. Today. Probably I haven't heard back from sage, yet so I assume he's probably not gonna be able to make it sorry. I've got a I started coming down with a cold last night and I, probably sound like death warmed over so I apologize.

A

Let's see here what we got going on so.

A

There's a couple of new pull requests this week. There's a batch list objects to reduce memory, consumption that just came in. I'm I'd like to actually know how much that improves things, I'm hoping that we can get some stats on it.

A

But I don't I'm sorry. I don't actually know how to say the name here, but miss shi, shi shing, but I think they once they should take a look at it.

A

There's an async messenger PR here by how am I that that I think it mostly is just look at reducing lock contention and apparently has pretty pretty impressive latency reduction effects, and I'm really curious about that, because we we do actually see a fair amount of cpu time spent in the messenger for certain workloads. So that's good.

A

Let's see, I don't know very much both this stupid allocator discard PR. I know we've looked at discard in the past and not really done much with it, but it looks like igor's reviewing it so next time we see him. Maybe he'll update us on it, but I'm a little nervous anytime. We do anything too stupid alligator.

A

Hopefully it's a win.

A

This op tracker stuff that rats love is working on is really good. It's been showing up in profiles for a while Radek dude. Do you want to talk at all about any of that, or is that still just kind of not far long enough.

B

But still working progress anyway, it compared the competition on on arms and clinging memory is on the board. So, even if someone will click the button, there would be no Greek tragedy and we, the idea there is, is to remove mostly stood among the lock. That looks perfectly unnecessary. It's rich right, lock that is taken only in the non-exclusive mode, but taken on the main path on the hot purpose of execution, removing it and unveiled high contention.

B

We got on the spin lock in on the spin, lock guarding the communicated community communication part with op history thread. Also, there are some some non-related at the moment fixes mostly for our spin orientation. It was using only the compare exchange instruction, invert, I Club, that's not the best idea and just implemented some some proposals from from Intel and also from GBC.

A

A

C

See what else we have oh go ahead, he's gonna, say thanks that looks really really good. Radek a cracker certainly been an issue for a long time now. So it's awesome to see it optimize it that much I'm surprised about the mock, / q.

B

Yeah I guess it's not necessary I, honest I can't find any purpose for the shirt exclusive lock if it's taken only insert insert mode. hmm That's my opinion, sir. This needs review sure.

A

All right, let's see so, there's a couple closed ones here, both the first one, the second one I made to talk about them a little bit more in terms of what's the discussion topics below and then that third one adds minimal tracing for cash activity. I think stage just closed. That I think he maybe just didn't do anything with it, so that one is no longer there a couple of updated ones. My PR sounds like I, failed, I. Think for unrelated reasons that look like I need to ask you about that.

A

But yeah look like some other stuff was going on and then yeah there were just a couple other tests that it looks like maybe failed, or at least one this ec back-end one and then the other one said I, don't think a whole lot was going on yet with them all right. So maybe before I get into this, if I have does anyone else? Have anything they'd like to talk about this week or bring up.

A

All right so then I've got I've got two things. The the first is related to this change to you of us buffered I/o, true, so the gist is that we have a user. We had a user on the the IRC channel, who migrated from file store to blue store and was seeing all kinds of background work happening. It was like 22,000, read, I, ops, I've, just you know, apparently nothing see didn't, have any client traffic going on. Well, it turns out that this was work created by the migration process.

A

He based on on the instructions on the soft page. He he removed the old file, store, OST, wait for the cluster to heal and then put the new blue store OST back in and did this for every single OST and it, you know, created a lot of extra background work that he did do be done to clean everything up and, as it turns out when in blue store, when I got him to a wall clock profile.

A

With all this background work, we were spending all pretty much all the time in every single TPO, SD t pthread in collection list, and eventually that was basically just rocks, t be doing 8k reads most likely, sequential, but but I'm not sure it depends on how it was loading.

A

The SSDs into into memory of all the data and I, don't know how much was thrashing, but I suspect it was thrashing quite a bit, but that that's what the work that was going on is tons and tons and tons of little 8k reads so I had him enable buffered reads, and that seemed to help not entirely. It was still doing a lot of work and some of the threads were still really busy doing reads.

A

Not all of them were anymore, so I, I kind of suspect that maybe he didn't have enough buffer cache to really like fully catch everything, but maybe was caching some things, so it was better, but look how the really the problem right is that it's doing these little 8k reads and maybe maybe it it it wants to. But maybe it's really just doing sequential reads.

A

We saw this with compaction to where it was essentially just doing a que, sequential reads when I could have been doing a big like 4 megabyte, read of because was just loading SST files into memory so for the compaction issue, there's a compaction read ahead, setting which we enabled and that works beautifully, but in this case that doesn't apply so we we could turn on read ahead and rocks DB, but that may not always be a good idea.

A

Luckily, though, just about a month ago a PR landed in rocks, TV I've had the upstream that does adaptive read ahead and apparently, according to their test, results, looks like it doesn't really hurt random reads: small random read significantly but does help in the case where you're, exactly in this case, where you've got like sequential small reads so.

A

If, if folks see users upgrading from file store to blue store and things are kind of like falling over or just really really busy, this very well could explain why that is the case. Mark.

C

I'm wondering if it, if you got any high-level details from like what was going on at this point, was this all during backfill and was this particularly bad because they were like a bunch of Echols going on it. Once sir I.

A

The only thing I got out of the user was that they were, they were doing no client traffic and they had just upgraded from file store to blue store, and all this work was coming from the the the remove work queue you remove work. Okay, so it's leading.

C

Straight is at that point, yeah.

C

That's interesting wonder if there's anything we can fix up at that level to be throughout all the street lesion think here's.

A

The wall clock profile snippet that he sent me. This is just one of the the T POS DTP threads, but you can kind of I mean you can see it all. The time was basically spent in period 64.

C

I would yeah I'm guessing that part of the code. Isn't that optimized very much? It's not really been a big deal instantly once you open this kind of family relation case that you'd have a lot of them going on all over yeah I.

A

Think that adaptive read ahead, PR and rocks DB is going to be useful for this and for other things, so I'm I'm hopeful that maybe that will help. But we'll see this work.

C

You know what was that, how does the adaptiveness work in the adaptive read head all.

A

Iii haven't even really looked at the guts of the PR bite. I just found it yesterday. Let me immediate link that it's in the ether pad here.

A

A

That's a really simple, PR I think it's just looking. If something is sequential and then if he sees sequential stuff, then uh yeah I'm, reading two sequential layers, yeah.

A

It's just doubling the read ahead: sighs. It looks like yeah, that's interesting.

A

Well, I think that's the sin, master and Roxy be now so next time we upgrade, we should be able to. We should be able to play around with it.

A

We could consider bumping up the static, read ahead and Roxy be now, but I don't know. If we really wanted to do that or not maybe with compaction, it was kind of a no-lose situation because it's just greeting the whole thing in any way. So it's always sequential yeah.

C

Yeah they depth. If it's done, it seems like it's presently simple but surprisingly effective. So we could pull that in next time and see how that works. That go be the ideal path. Yeah I agree.

A

So that may is a good segue into the next thing here. So I have been starting to look at making a mechanism for rebalancing the caches in blue store and rocks. Db and I.

A

Think I've got something a plan that I think is gonna work, Rox DB at least as so long as you're using the LR cache, has a concept of a high priority pool that you can reserve in in the cache for things like indexes and filters, but it doesn't really expose very much about it right now, so you can't find out how much space in that high high priority pools is being actually being used, and you know it's just kind of inflexible in general.

A

So what I'm doing now is modifying some of that in rocks DB to try to make it so that I can kind of control, know, monitor and control that more in a more flexible manner and then in blue store I can piggyback on the the the mempool thread, where we're already flushing the cash and and do some rebalancing in there I'm kind of a periodic basis. So you know- maybe maybe this is it once it returns seconds, maybe once every 60 seconds I, don't know we'll see how how much overhead it is.

A

But with this then we can do things like say: okay always make sure that we've got enough space for high-priority items in Roxy, B's cash plus 20% or something then after that's fulfilled. We can say: okay now, based on how much own own the own own suis have and how much OMAP we have.

A

Maybe we start making decisions on whether or not we think we want prioritize own ODEs or we want to prioritize OMAP or maybe some some kind of balance between the two you know, depending on, if we're dealing with our GW or our BD or something else and whatever and then.

A

Finally, after all, that's done, we might say well now we want to take whatever we've got leftover and dedicated to maybe you know data or you know, maybe extra space, even more extra space in rocks DB for for caching for avoiding reads during compaction by just having SST scale.

A

Already there or or maybe we're still going to be using buffered reads, in which case we're actually be can just you know, read from page cache, that's T's from page cache, if again, but, but this gives us a lot more flexibility, I, think and and lets us kind of start getting away from this idea of static caches for things and letting stuff make smarter choices about it.

A

I think it's going to be a lot better if we can kind of get into that mindset, so I'm gonna try to have that at least a prototype for that sometime, the next week or two here and then try to get whatever changes in rocks. Tbh I need submitted upstream and between those this new adaptive thing adaptive read ahead and then Radda Slav also has a PR for. Oh sorry, I already forgot what it was that you've made.

B

Abstraction layers over comparison, second, one for avoiding some decoding, some unnecessary technical. It correctly compared.

A

Once all those are, if we can get those merged upstream, then I think it'd be a good idea to bump rocks tibi up to the newest version again.

C

You've been testing you already with your prototype. Okay, so.

A

My prototype right now, all it does, it doesn't actually adjust. Anything right now is just monitoring different sizes of things and starting to pull the stats in so once I kind of am satisfied that the SATs kind of makes sense and are telling me what I think they're telling me.

A

Then I'm gonna try to start making decisions based on it, okay, but the good news is that I can actually get all these stats now, whereas before we didn't really have anything so so now, I'm, actually you know I'm running different workloads and actually watching the the high priority cache items and racks DB increase and sometimes like spill over into the non-priority pool. But you can you can tweak that kind of stuff like how much space you want each? So you know partes will be just kind of trying to figure out. You know what.

A

What ratios make sense and kind of always what we need to always keep ahead of, and so so that were were, you know, always maintained the cash is where we want them to be right.

A

Yeah, so not not a whole lot yet there, but but it's it's kind of the starch and I I. Think from what I'm seeing so far, it all seems like it's. It kind of makes sense and I'm hoping it will uh well we'll have a nice graph showing that the the caches are rebalancing and the hit rates are better looking for it. Beth.

C

A

Well, that's, that's. All I've got guys.

A

Any other last-minute things before we wrap up.

B

Maybe just a quick note note on the mutex implementation from from GBC: basically it doesn't, it doesn't try to it doesn't try to spin even for a second. All it does is to check the atomic the atomic value inside and and if, if it's, if it's locked, then just go to Cardinal I created as a part of the abstract early work, I put a small thing called adopt, adopt grad I, guess the better name would be trying guard or something like that.

B

It can be used to like unique, lock, just like learn, unique lock, but before going to the to actually locking a mutex, it tries several time to try lock. So, basically is a combination between between spin, lock and and mutex.

B

May be useful to plank in other places as well.

D

Did you give any attention to the fact that Angela see modern ellipses that that behavior is already in ether, music.

B

I've checked the implementation of energy PC.

D

Mean it doesn't spin right, yep.

B

D

B

Really used to reverse at least the version life I was checking. It was alright if I, if I was 2.27, that's.

D

Very confusing, okay.

C

I think it was a talk at open source summit. This past fall about something like this with GMC, but I'm, not I, don't think it was a upstream yet.

D

So that from like years and years ago, it's very.

E

I'm not sure that Linux ever did.

D

B

It might help for the contention scenario, but, however, if it's useful, it means that most likely need to change your algorithm or something like that, because you have to actually you have content. You have contention. I treat the trying God as a makeshift solution before ever going to see start.

C

Yeah make sense, especially if it helps out with some other things we want to optimize, should.

C

Case will be interesting to see which, which Perry's it does help, because I would indicate that they did have contention there before.

A

All right, well, I, guess if, unless there's anything else, let's wrap up this week and we'll meet again next week.

A

All right have a good week. Everybody.