Ceph Performance Weekly, 7 Sep 2016

Previous Meeting

⏯

youtube image

►

From YouTube: 2016-SEP-07 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, everyone.

A

Good morning, sage.

B

All right, let's get going I guess you get a couple of new pull requests here, looks like probably the the to hear.

A

That they're kind of the big ones that at least on my list, our sages work on blue store for shorting the extent map. He gave a really big presentation on that yesterday. So there's a link down in the discussion section of this week's etherpad.

A

But if you didn't see it, it's really good. You can pack more, but after we're done with this, the other one that that is really neat is P. Odors a crush optimization. It's a fun little little way to get a nice performance boost for straw too. So that was exciting beyond that there's a couple of other were a new PRS here. I haven't really looked too closely at the other ones, but yeah see a couple of different things: updated I.

A

Think the I Oh engine for object, store, is really close to being merged, are ready for frigging merged.

A

It looks pretty good and.

B

Yes, canada end.

A

Of it page, were there any other ones here you want to talk about. I.

C

Think so, with that the crush fun, we just need to run it through gic there's actually setting b li test that just verify that the mappings haven't changed. That's the main 1i.

C

Just marked up needs qi, so it'll go through the whole test, so you think it's a good idea for any cross change. This hard extent mapping omar charlie.

A

They're used to seeing the memory problems that I was seeing sage I, don't.

C

Get I looked, ok, so I'll.

A

C

That set today cool.

C

Maybe I should RS there's that 884 at the bottom. The use course time of the log really like to get that one rich actually, but I want Adam or somebody with um some good C++ food to deal with that, so it could be a runtime option. um Okay, we talked about in code to code sure.

A

So in in my quest for trying to make all of this somewhat faster I stumbled a little while go on I'm, a group barrington coding, which basically is instead of like taking every single bite that you're walking through and and using one of the bits to tell what I know that another bite follows and kind of drinking the encoding. That way. The idea here is that you take a prefix and in the prefix you store a certain number of bits or whether or not a certain number of bytes follow forgiven.

A

Still like for 32-bit values, you can store for have length prefixes in one bite for UN, 60 40. You could do two or you can do five in two bites. So this is. This is nice for a couple of different reasons. One is it's pretty straightforward in terms of what the encoding and decoding looks like, so it's. This is quite a bit easier to do for the compiler, but in addition to that, one of the things that we can do is is like.

A

Take the the 24-bit case and say screw it we're just going to use a 32-bit value for that, because it doesn't happen that often instead in the prefix encode, that 0 is just a 0 value. So now, if you have like a UN 60 40 that actually has a 0 in it, instead of encoding that in the full bite you can actually just encode the 0 value in Richmond.

C

A

Bits yep three bits and prefix and you could do the same thing like for a one if you have a really common case where, like zero and ones show up, but you potentially need like the full 64-bit range. But you know usually that doesn't happen. Then you could make a specific in code, optimization that that steals from the higher you know, values and codes that we fix so.

C

You're, using like one bite for four values and use two bits per value.

A

For you and 32 tes, but for like you, n 64 to t you need three or you could do you could do to if you're willing to like you know not get as much granularity you could you could do that too, just like ham all about knowing what your typical range of values is going to be, and then you can be smart about how you use that first prefix.

C

C

Okay, yeah I think we need to look at the I think we need to look at the places where we're going to where we need particularly space efficient encoding now so and then figure out what the what the patterns are.

C

Ya are going to be so it in the new code.

C

Then map that matters, so it's already doing using a bunch of sort of specific bit flags for and for this, the fields of the extent reference. But then the blob is just calling regular codes, which is a plug. Google.

C

And that one we're cutting extents, which are using the LBA encoding, currently flags, little bunch of stuff, that's optional,.

C

No, no they're going to help a whole lot there. It's for sugar or what other places we have or the space sensitive.

D

C

I think the main one is like PG stats, but I don't know that we should be doing anything very smart there, because I need to think fast.

B

Well, the the hope with this right is that, if you, if.

A

You use the group bar encoding. You should be both fast and efficient because pretty simple in terms of what the overhead of the encoding is and also a dumb yeah.

C

A

Yeah yeah that.

C

A

C

A

You have zero values, you're only doing an encode once on, like you know, potentially like four or whatever yeah.

B

C

A

The nasty thing about Thor is that it there's all kinds of like potential combinations of things you could do like you could have maybe like to 64-bit values and three 32-bit values that you want to encode. So how do you? How do you hear that smart, Lydian, I'm I'm, hoping that maybe with some template magic and overloading the head, hung that I could have it like choose smart things behind the scenes and just kind of go off and do it? But it's baby.

C

A

C

Yeah you gotta each one, probably boom. Okay,.

C

C

Wonder if we should find a case where we're going to use it and then optimize for that case, instead of like implementing a library of functions that we're not going to use. Yet my.

A

My test case right now is it's actually p extent, T in the existing Bluestar code, which is just a really basic one, where it's 64 bit and a 32-bit value. That's not an ideal case for this, which is one of the reasons why I'm using it here to see if it's actually helping at all in a knot. I, don't know case, but okay.

C

D

A

Means that looking right now, it's from right.

D

But that, but that case the reality is, is most.

A

D

The time the length of the P extent is duplicated somewhere else, then you'd be better off optimizing that that way, then treating this as a micro, optimization, I.

A

Agree with you Alan um you.

D

A

D

Most of the places that we're dealing with here, I mean most of the literature, is about encoding sort of random numbers that are uncorrelated. Most of our space. Optimizations are going to come from, noticing correlations in the data and detecting those things like gee, there's no hole in the L extent array. So you know half of the view that I don't need the offset and the size and we need one of them.

D

You look at that. I suspect is going to you know it's the it's. The old story. You can do micro, optimization or algorithm optimization, and usually the algorithm is the big winner and I. Think that's clearly true here. You know: I think that we got to pull all the tricks out of the bag to eliminate essentially redundant information in the data itself and then once that's done, then we can worry about fancy ways to encode. It I suspect those will turn out to be secondary issues.

C

C

But the big one that worries me now actually, with the with the sharting in place, I think the encoding that we have right. Now, it's going to be decent.

C

The thing that worries me I think the next big space optimization is actually going to be the PG stati and that's a big fat algorithm, optimization where we have to just not update it every time or break it into the parts that are updated frequently and infrequently.

C

Unfortunately, that.

A

That's probably actually a good test case right now for this as it currently stands, because it's it's there's a bunch of like 64 bit values in our that you know exactly so, if you can just walk through those, if assuming that we needed to keep them or we wanted to keep them the current behavior. That would be a good case for this right, because you're really now that might be at a time yeah well.

D

This might be a case where a custom merge operator gets what you want, because you've only updated 3 out of 50 fields. So you know you record those three deltas and the rest are gone completely. You know, I think the real issue is: are those fields really needed to be in the PG stat? What's the update frequency of them? You know the question really is: do the fields need to be updated on every transaction yeah? That way? That's the real core of the issue, and you know that either they do or they dumb.

D

If they done that, I think you have a whole different set of possibilities for optimization.

A

Yep I'm not electrified group, the neater don't need characterization, though right I mean there's, there's need and there's you know well, it'd be nice to have yeah, but but.

D

But a nice to have, if you give it I, think it boils down to crash the consistency. You know if the stat is slightly off across a crash, that's either a problem or not a problem, and you know if it's not a problem, then batching the things up and updating them. You know every tenth operation. Every 50th, you know is a pretty simple optimization that you leaves all the rest of the code intact, but if it actually needs to be updated on every transaction from a cat crash correctness, that's a different problem.

D

I suspect most of those fields are really there nice to have, but they're not needed for crash correctness. You know and could probably just be updated. Lazily you split out into a different data structure or something like that.

C

Well, the problem is that we update them every time because when they mismatch that means there's a bugs, but I think we can still separate them into separate the frequently updated, versity and frequently updated I. Think that's the that's why it's.

D

Very, very if there's value in having them crash corrected that we should strive to maintain that yeah.

C

Exactly yeah, unfortunately, merge operators don't help because we're several layers of the stack. This is an attribute on an object, that's stored in the UH node, so we can't really put it in its own custom, CCA's key space about a bunch of like weird pakery, but I think it's going to be sort of kenya's, but it's certainly doable to just separate the fields that update every operation and from the ones that don't I think the way to approach this, and it's not going to be that bad.

C

But it's going to happen up in the OST, but even when we do update the part, that's not frequently updated and I, don't know, I, don't know if it's worth it, it might be this. The group encoding might help there I, don't know that it's gonna be that big of a win. If it's really is an infrequent update, thanks for some reparation, sila p comment anyway: okay well, I need- maybe that's private. That's next step, though, is to to look at that um how to get the PG stat updates separated into two parts.

D

C

D

I guess I chat I mean the backward compatibility issue around that on his trouble, sub debating I guess the concept of using the merge operator. I guess I would encourage revisiting that because that could be backward compatible.

C

Remark we're compatible between what.

D

Existing layouts, I guess people are on rocks DB in file store. You could preserve that code, preserve those existing data, otherwise you've got a version. Conversion, that's nasty! If you start splitting deed. If.

C

This is an this is. This is the way that the attribute is stored locally on the OST internal collection effect. So it's it's owned by the local instance of the OSD. It's never shared across those DS and does have to be recruited anyway. Once you oh well,.

D

C

D

C

Ok make sure you.

D

Need Jay I think you have a deal.

C

You're right back.

D

At ability issue that's going to get super ugly, okay and I suspect you're to end up pushing that down fairly low in the code. Anyways I'm just saying that it might actually be easier to do emerge operator, even though you might have more code. Did you change? The complexity is a not terribly high and it does support the old um layouts.

C

So you'd like create a new object attribute operation. That's instead of fettered set attribute. It emerged after view, and it's yes.

D

C

A lot of that that could be interesting or.

D

Update adder or something like that: yeah I, guess no one! Click on network merge.

C

Operator has to be like a clean summation over like a vector or something in order to be sort of Stanley to find, and that structure is anything, but it's I mean it does have a lot of you and 60 for us, but they're, buried and mix in about well amongst a bunch of other other objects that are variably size. So it doesn't really lend itself to that. Yet I.

A

Guess I the question I have is okay, we're talking about you know doing things like separating out things that need to be updated all the time versus things that don't need to be. You know going through all this to figure out. What does what doesn't? Is it really less work to go through all that or you know, less work to see if some encode d code optimizations are good enough for the time being, to kind of, let us execute Miller yeah.

C

Yeah I mean if you're in the art in millis, you might as well look at the objects. Some objects tat, some tea or whatever and just see how it does it. But I think was it's inevitable. We need to. We need to do that change anyway, in the ok, with the live updates. So maybe.

A

It is worth doing it now than highest, wasn't sure how how much we want to take on in one chunk kind of brutal yeah.

A

The big thing I mean that's still, the predominant big thing I'm seeing is we're spending CPU time. We are just burning CPU time on anything that requires small encoding, small I/o in general. So you know whatever we do, that.

D

That's that's all that that, by the way that will always be the case, I mean the small iOS are going to be CPU limited by definition.

D

C

I mean what I'm worried. So what I'm worried about is is that the we still are encoding have to update these shards the extent map, shards and they're. Still, you know a kilobyte ish of encoded data that needs to be re encoded on every I. Oh and those are full of Ahrendts.

C

So I wonder if we should look at just looking at the barn code that we have and either a optimizing the same. You know bit pattern layout or whatever or using something. That's slightly less efficient space wise, but more efficient, CPU I there yeah.

D

There are definitely some schemes you can do with the MMX instructions that will, if you're willing to abandon, say a minimum one bite and coding. Let's say that your your minimum encoding is is say two bytes for small values. You can do some things that are really fast with with the event X instructions. um You know where you create a bike mask that's one bit per buying. That tells you which ones are zeros or not. You know, with a few MMX instructions you can basically encode and decode. You know again with just a couple instructions.

D

A 64-bit value you can make those very cheap with what you give up is that you have a minimum of a two-body encoding for small values which do happen frequently yeah.

C

Yeah well, the most common from values are 0 yeah that 0 does fit in a block. If we already and drop that out, then right yeah.

A

Well, Alan Ellen. What you were just talking about sounds a lot like group, barrington coding, but only with a single value. If you take that prefix though, and basically use it for multiple values, then you can kind of shrink your encoding down, especially for like the zero case right.

D

Yeah yeah there's lots of variants where you can make a bit be a pair of bytes, for example. It in code 260 fours at once, the 11 you can. You can extend that paradigm. I agree. You know, I think.

C

D

I suspect, the the you know I, I I'm wondering if it's actionable so first of all, the existing variant itself could be sped up dramatically by unrolling the loot first of all guys. Let's, let's be realistic here and Kylie. You know, and that alone would be the huge performance boost relative to what's going on, but you know so, but you know I guess what I'd say here is the UM have we examined it in the context of the sharted Oh node and the the the appender improvements um to understand.

D

If that's still something that's a sufficiently large portion of the problem that merits you know extreme measures to go address it I would throw in unrolling the existing end coatings you know or into that loop also, but I think we're a long way from um requiring a revisit of that the-the-the.

A

Appender is basically the the huge win here. The way that we're doing the the bike encoding right now is just you know, crazy, slow and that's the big one. That's like a you know over twice as fast once we do that in the existing blue store, no IDs.

D

Art anyone beside the appender or the head coding, the.

A

Nav will so the appender when we use that, like basically throughout the entire blob. We we get like a 2x improvement by doing that, just because all these bite appends are so incredibly slow in the.

D

Existing cooking right, but that's kind of a case. We don't care about I mean the the new sharded code is going to be a very differently with respect to that and you could sort of measured after these things or apply well.

A

It did so in the the shard map, sages charred map code and what I was able to get to run because I can't get through a full test, unfortunately, due to their running on memory, but we, it still looks to me like, potentially that will provide some speed up, maybe not to the extent that it did, but it will see it was. It was faster, but not in which that which that the sorry done yeah, the appender I think will still provide some benefit. I have.

D

No doubt of that I think you have to assume that we're applying that I think you have to look at this particular question. After all, these things are done. I guess, I'm skeptical, that the low-level int encoding, what's the via the current y ugliness is fixed, is actually going to make that much difference in the greater scheme of things, because you know I think with the Charlotte encoding you're going to cut the shard size itself as the biggest way to go solve that problem.

D

Okay I doubt a kilobyte of chard is going to turn out to be the right number I.

C

Think I think we should. We just need to sequence these things um before we do all the stuff that we know that we're going to do before we do the stuff that we're not sure whether it's going to help I'll say: let's merge the starting stuff, um I think. Obviously the current smallint loops can be unrolled. That seems like a no-brainer and that's like a an easy base line to do so.

C

I would do that right away too um and then, let's focus on getting the new the new framework that includes fixing the offender stuff in place. I think those are going to be the biggest wins and then, after that, then we should look at where, where you swear our time is being spent and then so we can invest their time sort of intelligent. It yeah.

D

The history of these things is that when you do these things, you get a different picture after this stuff is done. Yo and the I agree. I think you need to do that.

A

Shouldn't say: Hemsley should 10 case, though I mean that's. That's the big thing right with like bar inch right now, well idea.

C

A

C

It's again most of the cases where we see zeroes, we already have a.

C

Special case in the encoding path that spends a bit somewhere else to skip that wrong, like in the extent net, encoding already reserved for bits and like the very first value to like skip a bunch of fields that are predictable values. Okay, that was whether we don't wait.

D

Exactly that's where the real wave is going to happen.

A

And like PG settee that.

C

One I think that the wind is going to be storing a separate structure. That's just the commonly updated fields and having those two sort of deal with that at a higher level, so that it doesn't have to.

A

Reinstate is where we need to have something updated regularly. That also would have liked having.

C

A vector of stuff, no I mean there are the like internal stats, and blue of us are doing that, but it's like literally a vector that isn't even encoded and it has emerged operator. So it's like it's already cheap, so I don't see any case right now. That's like going to be that. Might that isn't possibly to probably already going to be addressed by some other change that we have on the right now yeah. So I would I would wait, but unrolling all the encoding.

C

The small encoding decoding loops, I think is, is an easy one and also, as is as part of that, the the.

C

Instead of calling a pen for each bite, you can just put it all on a local variable in the stack, not even a variable, but I like a yeah I can't ever on sack and then and then do one a pin at the end. That also helped a lot even with the current could yeah.

A

I mean no, that was actually kind of the route that I had gone with it. That kind of the the optimization I had ended up with it. That I didn't like that. Much but was okay was actually switching to this model, where you have a prefix in, like the first two bits for like a UN, 32 or whatever, and then doing either six six bits to put it into a bite 14 to put it into a UN 60 or not.

C

Even not even changing buddies.

D

Give me the problem with those studies is that that they're all sort of randomly distributed numbers which don't really apply for us, for example, things that are offsets things that are disc addresses, have a different distribution of data there, a different, optimization, okay, so you know for in sin, a row is different than two sizes and two disc addresses in a row: okay, and if you're going to get down into that level, I think you've got to find that that paying attention to the actual usage data type rather than the declared you know in type.

D

That think is where we're going to find the real win, because I suspect you're going to find, for example, with a disc address that your disc addresses are almost always going to be five bytes, almost always you're not going to have any value in having even even supporting the case of something that's three bytes or two bites. It's just a waste of encode expense.

C

Yeah answer trying to put in the field and what that distribution about. Do we expect there right, which.

D

Which, which plays to the current scheme of looking at it on a per field basis, rather than trying to combine multiple fields with some kind of temporary wizardry, which I think could certainly be done? I'd like to you know again, I think there a fair of investment of energy there and I'd like to see us get these other things out of the way before we go tackle that which is it to say, it'll come that won't come back.

C

Okay, but yeah, it's I still think right. We should just focus on the stuff that we know it's going to help in a simple um first I'm.

A

Gonna go from the oats. Let's get the let's get your your thing. Merge sage muscovy! I love the moment where she worked out. Good immersion I can I'd be happy to go back and start working on that one again. Okay,.

C

I'm barato just sent me the his latest branch. It's not building it. He had some questions, I'm going to review that next and probably get Sam just take a look at it too, because I think that's probably gonna be the next big step, but I think in the meantime, that the thing that we can do that is a known good, win or whatever is just unroll. Those loops in this well yeah.

A

I mean right now as it stands, it will how.

C

Can I should do that anyway, right actually.

A

It it might be able to, you, know how it.

D

Badly yeah would wrap that off. Okay, but you've got too many cases where you're rewriting through pointers that just blows its brains, yeah.

C

Okay: okay, um okay,.

C

Run to anything else before we go back to work, I man.

D

This up sorry I missed the first little bit mark. You said you were running some of the tests with the with the Chardon code. Did I hear that correctly dude? Yes,.

A

Yes, we're leaking memory right now, but we are seeing some improvement over the existing code, not not as much as what we get when we use the existing code with like really a significant appendage changes. But it's it's doing pretty good because performance, okay,.

C

A

Ag like this, maybe.

D

That's a twenty or thirty.

A

D

I got memory leak, probably won't take long to track down yeah.

C

A

This is nice really really obvious one, but it might be obvious for someone, that's actually written the code so well.

D

It may not be a leak, it may be a data structure that needs to be trimmed.

A

Yeah yeah, it might not actually be an explicit week. It's just have uncontrolled growth.

C

All right charting emerged: okay, yeah.

A

Most agree lies, broken.

D

haha Percy's role.

C

All right check the map up to my end, I think that.

C

C

All right, I'm, gonna, run sounds.

A

C

A

Minute anyone else have anything they want to talk about before we leave.

A

Right have a good day. Everyone.