Ceph Performance Weekly, 6 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2018-12-06

Description

https://pad.ceph.com/p/performance_weekly

A

B

B

I'm waiting to see, if sage can make it hibi you're saying if you could talk a little bit about the PG autoscaler.

B

But if not, we can just proceed.

B

It's maybe in the meantime, you started on these pull requests all right.

B

Radek I'm gonna pick on you here, cuz there's is the first in the list and I know that you've got a couple, others that they're in the works regarding bufferless and your hyper combined one just merged so want to talk at all about those.

A

Sure can you please provide linked, PR or sorry interprets or okay got it. It's sorry nor.

B

A

Okay about the about tracking infrastructure in the buffer list. Well, it seems.

A

Some infrastructure related solely to debugging tracking, to debugging buffer to tracking cstr's, to tracking allocations down locations, and unfortunately it it isn't cost-free. It appears we are spending well, just take a look on deeper on the out from profiler. It's solely about the cstr buffer pointer is just example of one thing.

A

Can we kill it I hope so.

C

Find anything that was actually using that so it's something that we're paying for without getting any benefit from I.

A

Have the same opinion I'm going to duplicate all that all dependent all dependent methods and ask GCC to find any occurrences and use it off of the staff? Hopefully it will be known, be able to kill it, but the buffer truck CSTR is only part of the world infrastructure.

A

There is more at least three things that are in general or tracking cstr's or tracking allocations, the allocations and for tracking CRC related stuff. All that things use atomic accounting and, of course they are protected with some with some jump. But it's teams that we have. First of all, we have some missed predictions and maybe see Pele is trying speculatively execute those atomic instructions. Of course they cannot retire, but I'm curious about the microarchitecture are details might be. Does that we have ping-pong ping-pong I'm, speculative, while CPU is speculating.

A

Anyway, I'm, what do you think about just killing just dropping everything related to the staff buffer truck environment variable? Is anybody using? It is any important. Let.

D

Me look at what it is I'm guessing that.

D

D

C

I was looking through the history and I sighed that Josh gherkin added the at least the C stir accesses. One was I.

D

Let's ping him the double check, but I imagine we should just kill it. I bet he added it. When we were looking at doing the like kernel, splice love to do zero copy, I see, but that's one, that's when he did it. I would discover it.

A

Will be, it will be really good. The alternative is just to disable in action. B is in the ND bark or something like that. I.

A

Just speaking about Plan B,.

D

And on in that thing, oh, you know what there is in the unit. Tester is a there's something that sets it.

D

So it just runs that you know put it on whatever yeah anyway yeah um and then tag very good on the question. You can take a look I'm. It reminds me that I think we should consider killing the CRC cashing to.

A

Or maybe this sect is some kind of buffer released derived class I mean something like.

A

D

Yeah, maybe I don't know if it would be usable. That intention is that we would avoid a double crc when we fear see things coming up the wire and then we do a CRC check again on the data. If, when we're journaling it in file store, that's the only time when we're like you're seeing the same buffer twice with the same seed and I, think we care increasingly less about that file store case, and it's just like a bunch of crap in the code.

A

Chuck's, a marine blue stir, it.

D

Yes, but it's done on um on like per blob basis, so it's slices of the overall data and it's so the granularity is usually for cave chunks or energy. W cases bigger chunks, but it doesn't um I, don't think it's actually hitting the cache. In most cases, the only exceptionally.

A

D

A very small right, if it's less than whatever the blob size, which is like 512 K or something. So if you put a small object, then it might hit the cache and get the same OC, but um that's a pretty narrow case and I think it's also a small buffer to CRC. In that case, so the impact is probably lower. Okay,.

A

So it's catching that's your situation is there, but the benefit coming from it is mouthing. Okay,.

D

A

We can just try from disabling and making some some profiling.

D

A

D

Ton of Atomics right and it's like it's like I'm, gonna, mute sticks or something I can remember. It super hits the gross code. I recall.

A

B

B

So for a bond, then for 25 for 21 there's a proposal to add a hard cap, minimal SD memory target then also manipulate the cash max and the target value based on the template, based memory and fragmentation. I, don't think we should actually do either of those sage, even though I know I couldn't fire there. B I, don't think so. What that does is basically increases the target based on the memory base, but the whole idea behind those two things.

B

The fragmentation in the base was to reduce things when you're near the upper boundary you you want to keep the memory kind of below the target at all times, if possible, rather than letting it cache shoot over. That's kind of how that works. What his PR does from what just the my quick look at it makes it look like what we're doing is saying well if the targets too low, let's increase the target, so that's kind of up around where the the minimum is based on the base value and the fragmentation.

B

We don't want that if the user sets the target low, we want it aggressively trying to keep the cash flow rather than like. You know, pushing it up.

B

D

Go ahead and follow them up here: I guess but I don't know, but I'm curious why he was yeah.

B

B

Yeah I'd like to know too I mean the the the kind of the one exception for it is setting the OSD memory. Cashman right, you say: okay, we don't want the caches aggregate Li to shrink below this value, because it's just it will. You know, break things. If it does right, that's the the exception that we make I, don't think we want to muck around with the target, because that's kind of what the user tells us they want. You know we can.

B

We can maybe maybe tweak things internally, but the the goal would be to keep it below that target. So we do should be your toward that aim. Rather than potentially you know increasing it, it cannot do what they say.

B

Alright, okay, eyes, Adam here, not the right, Adam.

A

B

Not to say you're, not the right, Adam Adam, but in this particular case we are looking for the other Adam.

B

So there's some stuff here for working towards starting over call them families a lot of iterator work in the key value database. There's multiple PRS here for this, but I think the kind of the goal here is to try to start playing around with that.

B

I still wonder if we should also be looking carefully at the difference between charting over column families versus starting over databases just based on other people that have done this kind of thing and have said that at least on nvme charting over databases is the way to go, but I will I'll defer to the people that are working on it. So.

B

That's going on Sage, while you talk about your autoscaler stuff, that sounds.

D

Really interesting sure this is a turn into basically a rewrite of the original module that John wrote, but it basically adds it has a couple new pool properties. There's a P genome in that's optional, there's target size, bytes and target size ratio that are also optional, that let the user basically tell the cluster, how big it thinks that the pool is going to get and so that, even though the pool is empty, we can scale PG s based on what the eventual size is going to be.

D

It's sort of sort of rating at VG's as we go, but in the absence of that or incorporating that information. Basically, the system will look at each sort of hierarchy of the crush, a crush that we're distributing data over and add up all the OS DS and look at the target number of Fiji's porosity figure out how many PG should be on those those DS and then it'll. Look at all the pools that are consuming those those DS.

D

It will look at how much space those pools are consuming and then proportionally divvy up the PG s basically, and that gives it what how many peaches it thinks the bull should have and then.

D

If that what it should have is off from what it the current fujinomiya is by more than like a factor of three, then it'll bump it up or down and it'll go. There's always choose a power to basically for that, and then you can set a mode on every pool.

D

That's either off, which means it doesn't do any of that or on which means will automatically adjust PG numb or that it's worn, and it will brings a health warning if it's off by if it wants to do something that it doesn't and then there's a new command that you run. That basically does a little chart. That shows all that information, the sizes thighs the target size. If it's set the ratio, the total capacity, the target ratio, the current PG nom the target peach in them.

D

If it wants to change it in the mode, so you can just dump one thing you can see sort of what the cluster thinks you should do. You can either do it or don't do it. You change the mode, that's it. So. The interesting thing I think from this perspective is if I did a little bit of like envelope math and it's basically, if you, if you create a pool with like one VG or whatever, you don't give it any information, and then you fill it up and say or to write. You know a petabyte.

D

What and the question I want to answer was how much data, how much time, how many times is it gonna, split basically and how much data is going to move when it splits? That's that I mean so. What's the overhead of like not tuning and everything versus like giving it a telling it exactly how big it's gonna get, so it doesn't have to do any splitting or merging, um and it turns out it doesn't actually really matter how much, how many times it splits or doesn't or makes adjustments.

D

That's sort of at the margin, because, what's actually happening is the amount of data movement is like so a series of 1/2, plus 1/4, plus 1/8, plus 1/16, plus whatever, depending on how many times outside it's in limited approaches, 1. Basically, all data will move exactly once, um which means that if you were to write a petabyte, then the data will basically write. What's going to write 3 petabytes because it's triple replicated and then it's going to also move 3 petabytes, because every object is going to move approximately once if it's automatically managing everything.

D

So because the nice result is that is bounded. um And if you want to do better than that- and you can tell the system ahead of time how big the pool is going to be and then we can avoid doing those adjustments.

D

Yeah, that's it I think. The only thing to point out is also that, when this makes adjustments, it just sets it to what the PGM should be and there's already a piece of code and the manager that will basically make small adjustments to the pea genome and it's rotters that, based on the percentage of degraded objects, they set a global threshold and that you want no more than 5% of your cluster to be not degraded but misplaced. Sorry, a max displaced, basically, that's a global setting.

D

It can and it affects what the balancer is doing and also affects how quickly the peach enum is adjusted and so I always sort of keep you under around that band. When it's making PG increases or decreases I think it defaults to at 5%.

D

D

Think it's basically ready to go. I just want need. There's some manager, interface, refactoring, stuff I, want to urge to merge that first and then merge this after Inc, then I'm fixing the tests.

D

B

Let's see so what else.

D

um The hyper combine bufferless merged yep. That was that's good. That's exciting! Radek I'm wondering if there were some like high level, what the high level net takeaway was in your and the test that you were running like how much should have impacted, let's make in your.

A

Well, I was just in the world branch we've all together with hyper, combining with.

B

Okay and test was that Eric for.

A

K random reached over on in Sirte still one I made it. It was only one note, so a lot of contention they're both clients and always these the same note: okay,.

D

And most of those other pieces are in write the append buffer, no, the pen zero, if that's something you're still working on.

A

A

With this stuff, let me provide the link.

A

Basically, I have I have the switch already made a my refining, this stuff and I hope you'll be able to to send to send it soon.

A

A

Here it is: it's not rebased. On top of the newest things, newest incarnations of hyper combined of up and power, it's the continuation of the of the initial branch.

D

Right anything else that we should talk about it. Mister update on that the LA branch.

C

Yes, I responded to some feedback and thanks for Casey for pointing out that live, FM T is authority available, though we've got that available now in our GW and basically all the kind of gnarly append calls that I wasn't very happy about. Are now live format or rather live FM teeth. So it's ready for another look. If anyone has free time and wants to have some fun, that's all.

D

Success is all energy W. Is that we're all that? Just because? That's where all the valets are or that's.

C

Where most of them were, and- and basically you know, tracking them all down- is just a bigger project and and I kind of picked this one, partly because I had the blessing to do it and partly because I thought it would be a fair case study of what what the typical usages and stuff look like and sure enough. Most of them are things we can probably replace with stud string and for the most part the rest of them can be replaced with vector.

C

There was I, think I think one case I ran into where I I did and Joelle pointed it out. I placed a an on VL, a with a vector, Colin and I. Think the reason for that, as I recall, was that I had just written some string transform function in a way that that wouldn't take that array I can I can revisit that or leave it. As is it's really, it's a 16 byte buffer in this case, so I, don't think it'll matter.

D

C

It's it's passing whatever unit tests we have it's. It's certainly ready for review the the kind of error that I'm you know. I would be suspicious out of this. It's possible I have introduced them off by ones or something like that. Just because juggling null handling, especially you know, when you're converting the string is a little tricky but I say I think I got it, but that would be please keep an eye open for that.

C

Thanks I'll give it a more detailed reveal great. Thank you very much.

D

B

Anything else we should look at here. I was again just mentioned sage that I'm the live already shared, persistent read-only RBD cache. We got new benchmarks where they did test in cases where they didn't have as much cash as they had size of you know, aggregate volume size and it it was consistently better. Sometimes not nearly you know the extent that the other ones other tests showed right, or it was just slightly better, but it doesn't look like in any of the tests that they did there.

B

It was ever slower, which is nice to see them. That's.

D

What we want to see yeah cool, should make sure that Jason knows but I think that's the one that's in in a state or close to a state that I can actually get merged. The right back, one is I, think bigger project I think that one's more closer ready, cool.

D

D

D

Anything else anything else. We should talk about.

A

I would like to ask, or for reviewing the branch. We've happened buffer. Basically, it's it's work. It plays very I hope it will place very nice with hyper combined buffer list because killing append buffer, which means that I would expect that the average value of the n rev-counter across buffer point occurs before rows will drop dramatically this, and it's quite important, because hyper combined buffers have only one slot for better content.

D

So basically, it's saying if, if we penned something, that's short, then we copy it in or is it what exactly? If the change here it.

A

D

Got it I understand? Yes, yes,.

A

There are operations less.

A

References to particular buffer, okay.

D

And so when you create a buffer list, it's empty and when you append like a short C string or something to it, it will allocate a buffer. That's for.

A

Cake just like a pen.

D

Buffer was but it'll just it'll be the buffer. Why.

A

The same behavior, basically.

D

Okay, awesome yeah, it sounds great. It looks like that micro benchmark was good to like 50 percent no 30 percent. My girl well.

A

Also, there is another branch: I just marked I just killed the work-in-progress prefix and put performance level. It's about optimizing, it's about optimizing atomic operations or for buffer rows can be useful or for the case where buffer is isn't shot, I mean something like using buffer list instance just to try to encode or something like that.

D

Do you use club anywhere right now.

A

That we do, it was in copy constructor to take care of those buffers that are insurable.

A

D

A

D

Does that happen when our buffer is not sharable I.

A

Saw some implementations of that uses that that's being it was maybe.

D

A

A DMA or something like that yeah.

D

I think we added this for the exciton messenger, but I, don't yeah I'm just wondering if we still need this yeah.

A

D

A

To kill that and fire and found says, unfortunately,.

D

Casey, do you have any thoughts on this.

C

So I mean if it's only Xiao messenger and Xiao messenger isn't interesting and I think we should remove it. I.

D

Guess yeah I mean yeah, but that that seems like that's the default path, um but probably just remove xio messenger itself from the tree at the same time, but the I guess the question is: is this concept going to be useful elsewhere? I?

D

Don't I, don't I, don't know if the current already made a mess and already ma stack for racing messenger uses it or not.

D

C

A

C

Think it's think it's an interesting concept to have something that won't be shared and copied instead, but I don't know if that's the best way to do it.

D

See the users in make sure they make sure hope I don't actually see any colors anywhere, even in the para current.

A

Picture below I mean buffer at least make sure. Ball was used on internally, if agriculture ectly in constructors and in in claim append by but I stripped the method from the interface and well used something different for those two use cases.

A

We, the the peer, is more about about initial ref, counting and decrementing the counter where you are sole user. Instead of just cloning.

D

Like it's applied to all data payloads, actually a messages I.

D

Just for this is adding complexity that we're not using, and if the direction is to try to simplify the buffering model for vi. Our messenger, then not sure how this is gonna be.

A

Well, the intention is to help the intention is to help with with is to help those use patterns. Don't use those use cases that doesn't require extensive sharing. I I mean we I'm, trying to minimize the cost of of sharing behavior of shallow copy of buffer lists to sell to some extent.

A

Let me replace right would love to have a buffer implementation that doesn't impose costs on features that are not being used by particular. In particular scenario. Sharing in is one of those cases.

A

I'm purchasing buffer list of some kind of scatter gutter is implementation with shallow copy and everything's fine. But at the moment we are paying. We are doing a lot of atomic operations even where there is no possibility of sharing. If somebody creates a buffer list calls up and on it, he gets in it to create a new buffer, oh the initial, a buffer pointer, owning that row, etc. Unfortunately, even when they're freshly fabricated Buffalo is is obtained by the very food by the owner by the owning PTR.

A

Well, the ramping up the NRF counter is made atomically competences early. That's the way, that's the reason for introducing the reason for conveying for adding term information to buffer sorry type system, meaning that in a bit and vegetable ownership, something like unique pointer, but without its managing behavior.

A

The same well, the same analogical thing in drink, destruction of buffer pointer it tries to it, tries to distract the instance keeping counter intact. I mean if the counter, if the NRF counter is equal to one. This means we are on there as we.

D

A

D

So I guess I guess my question is: is the it's the shareable sort of tracking or whatever in here? Is that helping or is it just something else that you always.

A

Do you mean yeah.

D

A

The first commit it's optimization for sit for, calling for ensuring that there is no buffer a requiring copy on shell. Like we do drink copy like we do while copy constructing a new instance of batteries, we were iterating over over those all all our buffer pointers to verify that to verify whether that clowning is necessary or not, it was made with make sharp. Oh I moved some those bits to the clowning, the cloner of pity art. Note we got in the indie hyper combined thing in the hyper combined er.

A

It means that also during cloning, the buffer pointer can be allocated all together with with the clone of of the pharaoh buffer, pointer, etc.

A

Okay, basically, it extends a bit the hyper combined behavior, also two clones, but I'm not sure about it. It really usability. Maybe cloning is, is quite infrequent.

D

D

Anything else I'll go ahead and review that first, but I mean it optimizing the reference it seems like kind of a no-brainer. My only reservation is just maintaining the pepper converse stuff or the shareable stuff, but we don't have to drop it now. You can always come back to that later, but the the other one, the logger one, the the pen bench and buffer one also looks good.

D

D

Anything else we should talk about.

A

Kind of contiguous reservoir I mean optimizing, the optimizing, the old encoding infrastructure.

D

And I'm hesitant to like keep spending too much effort, optimizing all this old stuff and it feels like it's diminishing returns. Well,.

A

Actually I I made that commit some time ago, mostly about the complex, the quickness we would add by magic. Let me point: let me find you our commit.

A

That's up here.

A

Okay, come the commit.

A

A

I made some comparison of before and after here is also the gist.

A

I'm profiling, it seems that well, the old infrastructure is still very popular, even more than the dank stuff and optimizing them well. The part of the difficult elimination idea, the disease Europe to me, the compiler, optimization called DCE.

D

I'm wondering if the old code, a lot of that code, is the assert false vs., because I know that a lot of those and methods just had a bunch of asserts I.

D

Don't we go ahead? I'm gonna benefit just by bring some of that.

B

Radix assembly.

D

B

Doesn't doesn't look like asserts in there really much.

D

D

See cuz I'm, not getting in mind, are doing our soul calls versus the anyone. It's all in mind right. Yes,.

A

The idea is to to make the assembly of encode stuff, okay short, that compiler wouldn't probably be prohibited from aggressive inlining at the moment. That's not the case. We have a lot of calls even to upend of buffers that cannot be in line because it's in the CC and, moreover, we are spending a lot of instructions, a lot of code, just on feeling on on the call invocation, basically on preparing and more on putting the stuff to ready to arguments to registers, calling up and etc.

D

Okay, so this is it working because because it yes.

A

D

A

D

You have a series of encode calls, each of which is like adding an 8 by encoding like an 8 byte field, and you don't know how many of those are. How is it in mining and combining the checks that it there's enough.

A

Elimination, the compiler trucks it tracks the number of bytes using using a chunk size known at compile time. This particular case it is 64 bytes and it.

D

Somebody encodes.

A

Compiler, in some cases, no it's.

D

Able to kill the, if all obvious, because it knows that it. It can actually a compile time to contract the size of the.

A

Exactly that's.

D

Pretty awesome, okay! Well, it sounds good. This said this would actually help a lot in the in ds2, because it's also doing a lot of all of its CPU time is spent doing.

A

Can't cost I mean whitelist just take a look on the commit. We are basically all reusing. The beam I mean the mother. Encode start macros. We are it's basically a hack to be honest,.

A

Let me let me point out.

A

Particular line.

A

A

It's a bit hacky.

D

Or it's just declaring a reserver.

D

D

What terribly offensive, but it's the actual continuously server class, looks like.

A

It's a bit similar to contiguous appender, it's defined in this in the buffer lot age. So.

D

It's got a reference, that's gonna get compiled out, it's got a one byte reserved counter, and this reserved T, yes, I, don't know I need suppose, could optimize the way, basically right. So this whole structure gets optimized away. Well,.

A

Then I guess that land will resist.

A

But still it can be, it can be much much faster.

D

This can you rig up like a unit micro benchmark of some sort that actually tries to encode a bunch of structures, I.

A

Guess I already made it's in the big branch with.

A

Results, I can maybe, of course, depending on the particular encode. What, if I, if encode he mixes a lot of C string happens with buffer list happens. Well, it's it's much much force, but if there are everything's, ok.

D

It sounds good to me and I would I could get to get something else's opinion here? I, don't know, I should look at this, maybe maybe key foo.

A

It it's in the development branch. It requires a dependency on linked up in buffer to have to have this possible yeah. Okay, the same with the same with the duplication of zeros doing during up and 0 it's implemented in one. It is in the development branch and but the branch is pretty big and I I would loved.

A

Not repeat the abstract story: okay,.

D

Okay, alright well start with the pin buffer and then optimize reference guess, what's the it's okay, next.

A

D

D

D

Right thanks, everyone have a good one. Thank you.