Ceph Performance Weekly, 15 Feb 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-FEB-15 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

All right, we are not going to get some of the core guys today, because they are at the fast conference out in Oakland.

A

I'll, maybe just give folks, maybe a minute more here to gather and then we can get started. I have actually been on pto for the last couple weeks, so as well, I haven't been here and I have to apologize a little bit. I I did not make it very well through the the backlog of performance, pull requests that have been updated over the last couple of weeks.

A

So I'm going to have to finish that up today, but I've got some of it done here, but but there's a whole lot of new stuff that I don't have in here. So maybe, instead of specifically going through these pull requests right now, I will mention some of the things that have been going on over the last couple of weeks, while I've been gone, Retta slavs been making a lot of progress on doing some performance testing on the read side and uncovered a couple of different issues.

A

The the big one that that he's actually already got a PR emerged for was in the the hot path. We were accessing configuration options through MD, config T, and it turns out this is actually really slow. I was actually seeing it in some testing that I was doing as well, while I was in PT. Oh, it's, it's pretty bad, so he's actually got a PR in place already that that caches or makes it possible to cache these options, and it helps dramatically it's really good.

A

So that is not only in the read path but and the test I was doing. It was in the right path as well. We were, we were accessing configuration options regularly inside loops and it was it was slow. So that's that's a big improvement.

A

There has been some other things. There's a PR that I submitted for processing.

A

It was basically where we were filling in. Oh.

A

Sorry, oh oh streams when not doing debugging. So essentially we were for the purpose of debugging filling this stuff in, but it wasn't when we actually had debugging enabled and it was slow. So this is basically just for when checking whether or not an OSD is is full.

A

So the good news here is that it's not too hard to fix and there's actually I've got a PR for fixing it and then David's got another PR for fixing it. That works just a little bit differently, so hopefully soon we'll have that fixed and that that also should reduce CPU consumption in the OSD, at least by a little bit.

A

So those two things were were two big things: that kind of stood out as low-hanging fruit in the OSD I guess: I'll, I'll use this as a camera platform, then to talk about a thing.

A

I was working on well on pto I noticed that mem Store is is actually kind of slow, I think other people have talked about this in the past as well, but it turns out that when I was doing some profiling, the the buffer list, implementation specifically of objects in them store, is just it's super high overhead, so I started working on making a have a new object, store playground thing called pet store that I just kind of uses a thing to screw around with and try different things and one of the things I tried.

A

There was using vector based objects rather than bufferless, and that was dramatically faster um and actually really easy to implement. It was surprising, just kind of how how much better it was so using that and then also applying some of these other performance fixes it kind of gave a good insight into what the the bottlenecks in in the OSD are, and the the really big thing that that stuck out is pg log.

A

It's it's pretty pretty high overhead as an example removing in say the the replicated back-end the call to log operation, commenting just commenting that out and then fixing the asserts that that get thrown because of it. That lowers on on my my dev box, the CPU consumption of the OSD, when using an in-memory vector based in-memory, store.

A

It removes the the CPU consumption from like five cores down to about three cores. This is to get about between my 40 and 50,000. Write apps for four K random writes, so it's it's really significant. Just just doing PG log is not quite half of the CPU consumption and also when you look at with blue store, PG log updates are really significant in terms of how much work they're, making the K V store do so.

A

This is just really reinforced for me personally. That kind of, in addition to the the the work that we're looking at for moving over to something like C star, we also really need to be looking at PG log and whether or not we can do any of it better. I think the the kind of idea of using / PG ring buffers is really good, at least on something like nvme, where we we don't care as much about contiguous rights as we do about just lots of parallelism and reducing the amount of.

A

Work that we make rocks TBH do in terms of kind of just record-keeping and then also maybe the complexity of the PG log code itself. So anyway, that that was kind of where I wrapped up with kind of that that project there were some other things, there's a fair amount of encoding and decoding overhead. That kind of crops up once you get other stuff out of the way for object, info tea and a couple of other data types, so probably improving.

A

That may be using the same mechanisms that we did in blue store to avoid appends to the to the buffers. This is probably good and then PG info. There were some in overhead with that as well, but that's that's kind of getting more into then the the higher hanging fruit, I guess so yeah, that's that's kind of what I've got for the past couple of weeks. They've been going on. I, don't know: Rattus love, and/or Adam.

A

Do you guys want to talk a little bit more about some of the things that you guys have been looking at? I think maybe you've done that and some of the other weekly meetings here, but is there any? Are there any updates that are worth mentioning I.

B

Guess it would be to talk about some, our debugging facilities. We have in OSD that are affecting them the main path. For instance, we have our apps. We have traction over-over mutexes, it's it's responsible for Lok, Lak, Deb, it's responsible for for counting and providing a lot of our authors with methods. Cult is locked by me.

B

Unfortunately, at the implementation.

B

This single day back I, would say. Method requires a lot of additional Atomics understand atomic increments, which is painful because requires and fans instruction I thought.

B

I'm curious: what if we could, whether we could use some private and very, very ugly, but cheap mechanisms that some of the pivot implementation of for NPT l there is. There is facility to check to real to implement the is logged by me very, very cheaply without additional atomic.

B

However, it's said it's ugly, its private, it's a member of private structure and basically, we shouldn't do that in your production code.

B

However, in production code we are, we are affected by debugging facilities, be affected by unnecessary atomic increments and unnecessary calls to through PLT, to beefed itself etc, cetera perfectly unnecessary.

A

Have you so I know I've seen the presence of mutexes in the Wolcott profiling results I've seen?

A

Are there any specific things that you're, targeting or kind of.

B

A

B

That such abstractions, like common / mutex, are very very common. So even if we can squeeze 10th, it's most really small small, the small part of single percent for each such usage over re, we could we could get a lot of benefit.

B

What is interesting is when I'm, when I'm running path. What what is interesting is the difference between mutex mutex, sorry come on / mutex and the p-fleet call to lock to actually lock them in the underlying low-level mutex and can be significant, especially in especially on on V start when logged up is enabled. However, even after this implicit, it's still not zero.

A

So, let's see here there are yeah there's, so there are some places where I'm singing sure about definitely the shuttered up work queue.

A

A

This is that's a big one in the finisher thread and so I'm. Looking at some of the wall clock profiles that I I got from pet store in this case it's just a single finisher thread by still spending like 17% of the finisher threads time in mutex, lock.

A

Although sixteen point six percent of the time, so the majority of that time in mutex lock was spent in pthread mutex, lock.

B

Yep for this guy, we we can't do much to be honest.

B

Let me provide a direct link, one second.

B

Okay got the line it's on line. 19 two here is the ring.

C

A

Actually see some time spent in post lock just a tiny bit, though at least in the profile I did the wall clock profile. I did wasn't much.

A

A

I did, however, see a fair amount of time spent in the perf counters, like you were talking about earlier.

B

Yep yep yep yep.

A

Think, just based on what I'm, seeing here, the perf counters were definitely definitely more impactful in at least according to these stats. Then maybe the flocking is.

B

Yeah but those they are enabled by default. The early exit is the early exit for the the kinetic option is deeply buried inside definition in CC file, so no way for compiler to in line, and also we are doing a lot of calls to safe clock now, because, because of perf counters.

A

Yeah surf clock now and mono clock now. Both are showing up at least moderately I. Imagine that would have a couple percent impact based on looking at this.

A

Radek in in your in your read tests that you've been doing. Have you seen much overhead from the mos d, op encode, decode functions or or also from the.

A

B

Structures nope.

A

B

Let me let me verify.

B

Actually getting to the mutex, you are right: I've, confused, I've, I've missed, looked I've, that's the right! Std meet our common mutex doesn't do the SED atomic operation. I confuse it with a with our we've, read right lock. We have also come on air lock.

C

I there is a sturdy atomic there. Yep.

B

Moreover, it's turned by default: it turn it student on almost for our all cases. There is no something like the lock DEP to to disable those tracking globally nope, it's possible only by passing additional additional parameter on construction that it seems that most of the of the users don't do that.

A

I, just posted a link in the etherpad to the profile that I I was looking at here.

B

Posting to edit pad.

A

B

A

Window here there you go, take a look at that. So this is this. Is this is my my vector object, memory, in-memory, object, store thing and then the OSD with the log operation, ripped apart without a call to some of the info PG info stuff, ripped out your MD config, T kasher, and then my.

A

Check full PR that gets rid of the the overhead for processing the OU stream.

A

So the some of the things there Radek that might be interesting are, let's see now I may have made this cut out too much stuff, possibly log op stats, online 184, that's exactly the stuff that you were talking about earlier with the.

A

The the counters, the proof, counters, logoff stats, is almost entirely doing stuff. Clock now and proof counter work. I should have maybe made this a little more detailed and and gone down into it. But that's that's exactly what that is about you, OSD op effects that might be I.

A

Don't know what that is, I look at that in more detail later.

A

A

We're spending a lot of time, also in the destructors, for up request and up.

B

A

B

It's quirky, it can definitely can be quickly because we have conditional there based on time, so it's it could be dad that profiling effects the results as well. However, I I started digging their way in with perv and some reduced frequencies. Something frequency and still some overhead is clearly visible. There are some play. First of all, those guys are running upper locking and unlocking mutexes over and over. There is a call on track table called gate, duration,.

B

That is called frequency that it's called very very frequently and unfortunately takes a mutex and makes some comparisons of on on member on last member of STD vector. So memory did the reference there. Also they operate. The comparison operator is far away from being optimal. I have some putters that switched from STR cmp to to equality operator that has early exit based on size, miss math. Unfortunately we are, we are well sink. We are losing the the information about string size because of taking raw because of taking two pointer to see string.

B

I've exchanged that with three STD string Bureau and this I love to remove some some rep rep rep Zee TPMS calls in the code.

B

Also, the reserved wall in tracker, the OP tracker, is good candidate to to apply to apply tiny vector on it. It started, however, we have to unnecessary indirection and memory in directions there, because we are storing pointers, to charts pointers, to charts on a CD vector completely unnecessary.

A

The up request is actually showing up in a couple places here that decide. The destructor for up request is showing up in a couple places here. This is Mars is actually.

B

Lock DEP is disable system. I had I had some profiling traces with extremely high usage. High rate of CPU burned, CPU cycles burned there because of mutex dance and the mutex dance is extremely costly. When you have locked, DEP enabled like like in all V, starts run without the the no lock depth parameter.

D

After work, yep.

B

All these thoughts spawn clusters have have lock depth to none I'm.

E

B

Yesterday, but really it's Derek just take a look on your self config. If flock DEP, is there me to repeat.

A

Is it enabled by default yeah? This wasn't? This was not used by V start. So, okay.

B

In defaults, it's by default- it's it's false- it's disabled so affect its production code, is definitely and not affected. Okay,.

A

Well, this is a this would be a self compiled, but but still yeah.

C

A

A

Wow boost intrusive pointer, so the destructor for the intrusive pointer for op requests was using like 8%.

A

That's pretty high Wow well anyway, yeah that that whole area appears to be worth looking at up requests in general. Let's there's a lot of stuff here: yep.

B

Trying to to refactor it a bit.

A

Also, the up context: destructor appears to have been using a fair amount. It looks like we must be creating and destroying vectors.

A

A

Okay, so we create one for every. Do op call.

A

Do proxy right, we create one yeah, we're kind of creating them in a lot of places.

A

So what happens when we create one? No.

A

He's definitely destroying them seems to be costly.

A

B

By the way, the profiling shows that on the right path we have, we have significant significant overhead coming from front end not on. We are not only bound by memory waiting, which is typical. We saw it in in rocks DB. We saw it in HD drink right path. We are also constrained by front-end performance client of.

A

B

Mean front end of CPU: oh ok,.

B

A lot of instruction cache misses a lot of instant ITL, be missus, also very huge amount of microbes delivered by micro sequencer, which is interesting, I would say it might suggest that all that we are doing a lot of complex operations like like the like, employing the machinery for the dedicated for string processing in x86.

B

Could, however, I can't see you I cannot see a lot of read instructions across the suffice decode and maybe there are maybe there are in leap, leap, safe, common or something like that or another library. Still the sources is mystery to me.

B

A

We interesting to see what you see once we I should commit this branch in and, if you have time, try it out with all of these things ripped apart and just see how bad it still looks. After getting rid of these.

B

Definitely see states were disabled during that states that the test, so, of course it's it's, it varies, it could worry with, with all will vary with process of not even microarchitecture, even with with model here one processor, one CPU has eight mega megabytes l3 cache and we are fitting in another. One have another one has half of it and we are lost.

B

A

Think, based on kind of what we're seeing, though, that almost right now seems like it's it's so kind of far away from where so much of our issues are right. I mean we have so many things that are going on. It's it's even hard to tell where it's like.

A

You said where it's coming from I kind of feel like we need to to strip all this stuff away as best we can and then kind of start, adding it back in and and then very, very methodically understanding how how the different pieces of the OSD are kind of introducing different issues.

A

I think we're starting to get there now. But it's it's um there's, there's so much this going on right, it's hard to narrow it down. Yeah.

B

I guess it's it's! It's a natural product of the development path we had Lagasse couple years ago there were the overhead of the overhead. Our debugging facilities doesn't have matter. However, when we started targeting SSDs crazy, fast SSDs Wow a lot to change a lot of things changed yeah.

A

Absolutely absolutely I mean a lot of this. You know in the the hardest days or even hardest days with with you know, SD journals didn't matter. It's now that we're we're really starting to to chase these high performance deployments that that all of this now is kind of starting to really make a big difference.

A

So, okay ref counted object, six percent, but four and a half percent is just the destructor for MST, MOS, Diop and again a destructor for vector we're creating vectors there and destroying them.

A

That's in the messenger thread, mm-hmm.

A

An object ER up and the destructor for op again another 5% there and again vector vector destructor, not as much but a.

A

Lot of time is spent in destructors.

A

I should add a feature to this profiler to make it so that you can do things like just add up all the time spent distracting things.

C

A

That's a good idea: I think I originally had planned to do something like that and then I got distracted and didn't do it. It would be very easy.

A

All right, well, I, think the gist of all of this is that we've just we've got a lot of stuff going on here, I, wonder how often we really need to be doing all of making these temporary vectors or objects and and doing stuff in them. If we can get away from that.

A

Yeah anyway, that's that's all I've got for this week, Radek or Adam, or anyone else any anything. Anyone wants to talk about before we wrap up here. We've got like 20 minutes left. If folks are have anything they want to discuss.

B

No I'm good so far.

A

A

All right well.

B

Just one quick question: do we have an inevitable reason for for compiling and linking safe osed position, independent code.

B

Actually it's we have. We have some overhead only because of that. Let me provide you with a link.

B

Okay got it basically, each hour, each call to function just just piece of code, regardless whether it's coming from from shared library or our translation unit, that is part of stuff itself, has to go through PLT, which means one call and one jump both of before they are under ectopy. There are and direct jumps that are requiring resources from from grands predictions and imposing unnecessary, unnecessary pressure on front-end. Maybe maybe cache misses, maybe some linking something like that, but we are paying the cost, but is the reason? Is there any good reason for that.

A

Some chain is that we needed it for something right and we just enabled it globally.

A

I assume that's probably answer but I don't really know.

B

Adam is working on proof on compiling stuff, with off F Peaks, oh yeah,.

E

I'm trying to do that and I actually I succeed with with compiling safaris D without F peak and and I'm just trying to verify how performance looks and I will I will do some proof sampling to to look if we have better performance on our CPU. Currently I just don't see much difference in in.

D

E

I ops, but it seems there is a bit less CPU usage for for analogous amount of of I hope. So, but it's just I have it five minutes running now. So that's very, very fresh. Was it hard.

A

To compile the suppose D without that and.

E

No not really at all the infrastructure is already there. It just needs tweaking with the global options to disable some Global's unable, locales and add dependencies for actually the pieces that are used for a short libraries just to point them to be position. Independent code, just not think nothing, difficult, okay, I was gonna, say I can.

A

I could give that a try with my tests as well. The.

E

A

E

Can sure sure pull requests right now? Okay, should be working.

A

Yeah I can try to apply it to my branch and see if it works. Let's take a look at what's changed here.

A

Oh that's, not bad at all. That should apply pretty cleanly. Okay.

A

Yeah I can I can try it out since right now. This uh this version of pet store, I guess is- is pretty good at uncovering CPU related issues, or at least it has been so far for me, are you testing it? We just with like blue store on in Sirte or something yep.

E

I'm testing, that's on you certa, seven sure, okay, blue store blue store stuffing on the surface of them yeah. It still probably will show it up pretty well.

B

Hello mark Easter, avaible.

A

It's in the branch yeah, absolutely it's very it's it honestly! It's nothing very special. It's really just mem store with, like some tweaks plus the vector object. So it's nothing! You know.

B

A

I was thinking about that I mean I.

A

It would be really easy to implement the vector object in them store itself, but the question is whether or not we want pet store and one of the things I want to do with it is make it a little bit more generic than just the men store, because I want to add on disk format, for for storing objects on disk and also in I kind of want to transition away from how men store works, to have different object, implementations and then different database implementations, including an in-memory database, so that it's it's not just like an in-memory object store.

A

It's an object, store that can store everything in memory or store things on disk or and store things in like rocks, TB or whatever so I, don't know if it yeah I, don't know right now, it's just a playground, but yeah sure. Let me get you a link.

A

A

This doesn't include some of the like the object, your stuff, your your MD config, T kasher, or the O string stream PR.

A

This is just the stuff, the the bass one that I applied. Those things to is actually based on the new store revival, branch that I had so at some point. This needs to be rebased and a master and everything, but you can at least see pet store in there, which is it's not it's not very interesting. It's just like a mem store really, what's with some some tweaks.

A

A

The vector object implementation is here, it's really really simple: it was. It was a little surprising how easy it was to make it.

D

A

D

That's tiny, yeah, exactly it's so much faster than the buffer last one. It's crazy.

A

So yeah, if you're, if you're interested in giving it a try, it's it's it's pretty fast. The the the single finisher thread will probably be a bottleneck on insert I.

A

Imagine so the next step here would be to make it so that it does multiple finish for threads, like blue store or file store, and then at some point I want to make like I said: I want to make it so that it can target the generic database interface, key value, DB interface and then make an in-memory key value database instead, so that you can kind of like mix and match both different kinds of object, storage, implementations and different kinds of database storage, implementations.

A

But that's what it is so yeah, hopefully maybe you'll find it useful. I don't know. I did for this kind of stuff. Anyway,.

B

It could be, it could be very good tool for focusing on always deep performance, not object, star, excluding objects. There yeah.

A

There's probably tweaks that could be done to make this faster too. This is really just the simple, simple thing.

A

All right, well, um yeah, I, don't have anything else this week. There's anything else, guys.

B

Yeah think okay, can we use some.

B

Underscore underscore attributes underscore and underscore that are implemented solely in DCC. We have support for insulin, but they provides nice functionality. I mean here, particularly the cult, and the cult attribute allows you to to request compiler to put some code far away from your hot path. I.

A

Don't know maybe.

B

I guess that, if, if if anybody needs really needs that we could get some very thin abstraction, just Mac, so it's just a few macro definitions to to not harm Silang builds, so I mean not know whether I don't know whether seal on keys of each be supported or not.

A

Yeah, don't know no idea talk to Sage I guess.

A

All right, well, I, guess that's probably it for today, then other folks should be back next week, so we should have Josh and Greg and sage back so I guess until then have a good week guys and see you next Thursday.

B

Thanks yeah all.

E

Right you guys.