Ceph Performance Weekly, 22 Mar 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-MAR-22 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, everyone.

B

C

Are you farming.

B

I'm perfectly fine thanks and you good good.

A

It's finally starting to get warm here, so that's nice, I.

B

We have absolutely the same in poznan yesterday was terribly rainy. However, now now the word is loaded with some nice weather really very, very good.

A

Alright, so I think we're going to have fewer people today, since it's the ball conference this week in Boston, so number of folks are out there, but let's get started anyway. So, let's see we have a lot of new poll request this week.

C

Big one is a cleanup of Igor's.

A

Pr for Ashley that was based on an earlier PR, I, think from ma jian feng to separate the kV sinks, red and blue store into two parts and, as we saw last week, bass and Igor slides. It looks like that has a pretty substantial impact on performance, so, unfortunately that is not compiling right now, at least in certain versions of GCC. Possibly it works in newer ones, but we need to define that up a little bit more before we can start testing it.

A

Let's see, there's been some ongoing zero copy work for our DM, a nascent messenger and there's a new one from how my there and there should be I- think a couple of additional ones coming.

A

Let's see other random ones here, I'm with their very much a very big one before explicitly remapping pg's.

A

This is really really exciting because it lets you on a purple basis, be able to say that a distribution looks bad or doesn't look optimal and after the pgs have already been mapped. For that pole, you can go in and say: okay now, I want to remap these pgs to in some kind of user, specified way and there's a little script.

A

I think the sage rip that will go through and actually look at how poorly distributed the pool is the pg's in the pool are and then and then basically rebounds them so that you have a perfectly even distribution, so really really exciting.

A

This is, this is something we need something like this for a long time ed and have been thinking something different along lines of making it so that you could have been different, crush buckets kind of occupy a place in multiple hierarchies at the same time, which I think could have accomplished something like this, but frankly, I think sages solution is probably a year and maicer in a lot of ways. So yeah, it's very, very exciting.

A

What else here, um apparently we had forgotten to set the men Alex eyes, SSD to in blue sorta 16kb we had plant, we were, we thought we had done that actually a last fall, but apparently we had forgotten to or something so we we just a bet now it might be that we can reduce that down to 8 kb or 4 kb, as we reduce the metadata load in Roxy, be or if we switch over to something else like another scale, but at least for the moment where we're bumping that 16 kb just because it there's too much metadata, both in memory and it knocks DB.

A

Otherwise, let's see, though, there's been a lot of discussion on this one from brothersoft for extent, mapping code, some I think Igor has been doing a lot of review on that I could admit. I hadn't actually look at it really closely. Yet.

A

The fixed six deferred rights p are merged last night. In some cases it seems to be maybe improving performance a little bit, but in other cases it might actually be able to slower that those also fix a whole bunch of issues, apparently that we had. That says you noticed once they start looking into it. So it's it's definitely needed. We require should try to understand why the performances not universally better so probably look into that a little bit more and then this RC walking one knows by Dan lambright a while back.

A

It turns out that it actually doesn't appear to be very much faster or any faster than our current locking mechanism, and that actually has all been moved out of the hot path now from what I understand, so it might not really matter in the end. Anyway, there was a note in the PR that some folks are trying to resurrect some code that the cohort FS guys had or instrumenting the the OSD to get like french marker or latency stats out of certain parts of the code. So that would be a nice effort.

A

If that happens, then we could at least try to start better understanding what we're spending time here. The LP T&G work also my can't provide the same thing, but I haven't heard much about what's going on with that these days, so anyway, I.

D

Think it was I think it was two parts one part is to like. You were saying in part of our idea what it was to introduce a whole lot of fealty t ng tracing into BOS d and then also to have the direct messenger. / men's store thing. So you could basically just have the OSD top hat more or less by itself without confounding variables and then just focus on finding latency in those um those areas where it was on the top right on the bottom, more or less Oh, fantastic.

A

Yeah, that's great we've. We I think, at least in my opinion. We can't needed to do that kind of thing for a long time separate out the code as much as possible. So you can just do micro benchmarks like this, but very good, any idea that the atom do you think that's going to happen anytime soon or is it and then.

D

So it is in our interest like right now we're working on trying to finish up a bunch of rtw related stuff, but we are going to be trying to move into improving rtw performance, and I think we want to try to well to try to focus on improving OSD performance as a means to improve rtw performance as sort of something that is very like we're starting it right now. I've we've been thinking about it, but we are sort of hoping to reallocate people and push on that effort.

D

When we can yeah, that's probably had some time after luminous drop I would.

A

Guess, honestly, that's that, in my opinion, that was kind of a big big reason to for blue stores. All these things that we look at when we look at every DRG, especially rgw performance, a lot of it exactly 0 SE, so yeah I think I think you're on the right track. There guys.

B

If I may interrupt in the matter of brothers, greatly performance, I guess we can, we add them out. I personally, think that, for this gateway is a type of guy that simply orchestrates data transfers between two fiber strip, tears, 100 scripter is divine rated to reduce to the Raiders layer towards DS plus do an OSD. The second one is for HTTP client. At the moment we are doing a lot of memory copy their bulk, a lot of bulk memory copy, and maybe we could.

B

We could avoid that Adam Adam koopchick some time ago made a poor request, bringing support bringing buffer said buffer row implementation built on top of standard Linux 0 copy facilities, I'm in here that pipe and sis girls like splice and Tea Act.

B

This is quite fine, because today, at this moment we are writing. We are writing a proposal for 22 safe devil about extending xing the interface of batter list to get even more to get even more savings during the management of data transfers. I think I think that this might be quite interesting money situation. Stratos graduate tends to tend to move a lot of data. May be combining them may be pulling those data from Colonel only two full. Yet to put it back, maybe we could try to avoid that I think.

D

It's very interesting idea: I'd, certainly like to see it. I know the parts that I've been think I know the parts that I had been wanting to look at the most myself is that, since I'm aware of a bunch of inefficiencies allocation to sort of you know having very large critical sections where a bunch of stuff happens in the OSD client I was also hoping to both improve the implementation.

D

There come up with and come up with sort of a variant interface that would be a bit more will be a bit less allocated, basically for lack of a better word by.

B

Eradication of dynamic memory allocations, Dena logs dynamics, yes,.

B

Some time ago, there's even another initiative related to that static pointer. Actually, it's it's I would say it's it's the code. Is there its own place in place, but we will need to to even, but to start even thinking of product ization. This concept, you will need to write a lot of or a lot of unit tests.

D

Well, so I can actually think about actually have a bunch of unit tests for something similar, but for a different purpose in a different project. But saticoy is an interesting idea and it works very well for the poor polymorphism in things like the allocator case for most of the promotional stuff, I was thinking about in the OSD client interface per se. I.

D

Don't think that we actually have the kind of polymorphism that would make that that would necessarily require that are things it up recall any, but it does make a lot of sense in some of the other parts of our GW that I'm aware of okay got it. I would certainly be happy at some point to help try to get static, pointer, production-ready.

B

Might be interesting, but nothing's going to our love is with a lot of different cases. Yes,.

C

So so even it was all of that work and.

A

I do think this can be really good, especially since it seems like we just burned through CPU in our GW to do anything the other things that at least that I've been seeing. That more can point to the OSD side is the time spent in bucket index updates, but specifically can be relates to helpfully.

A

We can do reads and writes and then also more file store, specific related, although that's also kind of file care issue, but the the way that we split DG's, then the number of objects that is created, especially if you look at something are very sure coding with our DW, where you have lots and lots of chunks like it created potentially with with smaller objects. Energy w is just it's really nasty. Oh you either.

D

Of those persist in the brave new world of blue store, because I talked about the impression that we didn't really want to these from other episodes with a set performance. Call that we didn't want to spend too much time working on file store, specific stuff I mean if we do that, Tom and I just thought we were I, know yeah.

A

We're not fixing easier, post and false are to be honest in blue store. Those issues are not nearly as bad well. I should rephrase that in blue store some of those issues basically just go away specifically like the the PG splitting issue, but we do inherit me ones, specifically, as the amount of metadata and Roxy be increases. You have a lot of compaction overhead and so we'll just need to be mindful of the fact that potentially you might have if you've got what's little.

A

Map going on racks TB could be start becoming a problem and and in general, if you have lots of objects and lots of success and other things you potentially could have box to be again becoming kind of a a bottleneck. So that's in my mind, all the work that's going on with trying to reduce the amount of metadata in blue sore anything that we can do to reduce metadata, sighs and I know, Igor, isn't working on or Retta, so that's been working on that and in folks from sandisk have been looking in that too now.

A

That's that's all a very, very good effort. I think I. Think we will see the benefit from that both in terms of you know like small, random, rgr, RPG rights and then also as you get lots and lots of rgw objects. I think we're going to see that pay off, but anyway, that's my take on it. So.

D

From what you're saying it sounds like anything that makes a whole lot of heavy Oh Matthews could sort of degrade the performance of all aspects of blue store. Just by forcing these sorts of compaction, I think so. I suspect.

A

That will be the case. I know.

D

One thing that matt has been talking about was wanted to try to partition out either very very large sets of Oh Matt prefix keys or someone like that into a database. I, don't know how that would handle consistency, or if there was some way that we could sort of hint it to have compact to basically sort of have multiple compassion domains. Perhaps I'm not an expert on rocks PP. What.

A

So what you're kind of talking about their column, families, where basically, you share a right ahead log, but then you split it off until, except for SSTs, essentially for each domain that you have, and we did, we did play around with that a little bit.

A

There was actually a PR where we were doing it and it worked I, don't I, don't know why we ended up, never really merging it, but um but that's totally doable, and it might be that that actually does help us in in the long run, especially in situations like you just mentioned, where we get like tons. You know map traffic because.

D

It sounds like I have to look at the key schema, but it sounds like if it were actually some way that somewhere in the OSD or in oh, you know the thing. The method calls that we have there. It sounds like if we could have some sort of hinting to be able to create a column family for a for a part of key space that we expect to come with a to be a large bucket in dec say that that might actually be good.

D

So we can have different bucket indices or a bucket index and the general blues for a metadata and not interfere with each other yeah.

A

Yeah, you know that will probably I would think will help with the locking issues, because one understand in rocks DB, you still are only have a single thread for level zero compaction than all the other like one level to whatever compaction can be happen in other threads I. Don't know when you have comb families, I would think I don't really know they.

A

Then you have separate level zero compaction, threads for each column, family, so I think I, think you're better off in that way, but the other big thing that we see is just that you have so much right, amp in general, with Roxy, be that it you pretty quickly overwhelm the device here on at least during those compaction cycles.

A

So it you know it might it might end up being that that's the the biggest limitation is about right, amp and read, am but but you're just kind of hammering the device, that's kind of why I like the Sanders guys were or so vigorously looking at trying to get Zetas scale working. Just because you know it's you're not doing that kind of connection anymore. Instead,.

D

Of scale another kv database or yes,.

A

Yes, a bee tree based, if I remember right all.

D

Right, thank you. Yep.

A

So yeah anyway, flux, lots and lots of stuff, um though what else is on here, I think there are a couple other things in here that I guess I, don't know too much about this crc32, the for PVC architectures. It looks like keeping reviewed that and then sage is now testing. This arm range Keys operator interface for X to be which, hopefully is better I seem to remember: I, don't I, don't really remember, I guess, but rocks BB has kind of some weirdness is regarding some these interfaces, where they're they're slower than they should be.

A

So I just that either this is faster than the alternative, or it's closer than the alternative. I don't remember why, but anyway, yeah I, guess age assessing it now all right, so I think that's it for four full requests.

A

It looks like builder answered that James question here in the chat about our sages, PG remapping work is and yeah. That's to me at least. That's super super exciting. This is something that we've needed a solution for four years and ended up being way simpler than I. Guess any of us ever realized so yeah my it.

A

It's pretty amazing I was hoping that you'd be able to join, to talk about it, but yeah he's uh he's not able to make it to the vault. But let's see the other thing, I guess I was hoping. It may get sage here for would be talked about. The deferred right changes, I did go through, and I think I mentioned I tested bed and it is not really as doing quit. What we may be hoped we'd get out of it, but more testing to be on that I think. That's, basically all I have this week.

A

Would anyone else wait to jump in with anything.

B

One thing basically at talking the same area we was talking was talking about last week at the data structures we are using in in Brewster. We had some discussion are got basically related to then sequential structures, we're acquiring extended some copy copy operation, some memory moving on rights versus versus dynamic structures like red black tree and similar stuff we have at the moment I was I, was looking for for some benchmarks and found very interesting page, but guys just take a look on that.

B

I already put already forward the link H and pretend also posting on some on Fridays Bluestar meeting, but I guess it could be. It could be interesting for a broader audience. What is what is the interesting is that even is that using sequential memory copy, based data structures for for for hunting, small day data structures for for for hunting, small data items like 24 bytes is, like our extent instance might be faster, even on random insert that. Does it that's terribly interesting, I think the researchers.

C

Argue, I don't see any grass, yes, I know the.

B

Page is pretty is pretty old, unfortunately, to the script. I. Think I! Guess that's the script exclusively for making those graphs it's located outside the HTTPS domain, so we will need to tell you're below your browser to to get content from untrusted source, but I have my chrome. I have a small icon on the right of address bar. This page is trying to log scripts from alden from unauthenticated cells allowed unsafe street right under it. Yep.

A

Yep you're right fix it.

B

Guys piss take a look especially on on a random inserts disturb interesting.

B

For small items, there is nothing better than done that. Ask that simple, really stupid structure, preserving memory continuity yeah! This is because nowadays CPUs because because lot of CPUs are all about cheating, main memory is slow, so you have multiple levels of of caches. You have protected mode. That requires a lot of data, translate translations a lot of page working, but you have TLPs instructions will be data, TLD, shed, glb and another stuff all about cheating.

B

Unfortunately, when you are using when someone is using dynamic structures that scatters data around world memory, you are, you are avoid, for example, professors, predators, disabled themselves and you are along with with you. Don't have you are lossing cash assistance? You are along with the terribly slow main memory. I guess that's the case of encode. Some I was poking with it for some time and without exchanging data structures.

B

I think I think that the most of the overhead we actually have after after the encode, some part of after the ankle, simple request, is related direct to to data structures so that memory arrangement of data we are spending more. We have around of mrs.

B

on l3 I, think that I think that simply data professors at this level simply disable themselves because they observed almost, they upset a pattern, a pattern, excessive memory that I consider as a random one I know that also Adam Adam coop ship is working on d code path on reeds and upset similar behavior I.

E

Can I can collaborate to this? This way that I took the previous erratic part for encode some and I tried. He introduced that some kind of slap alligators related to each own owed, so he is so they were kept us together. I tried to play with that and make one big slap a locator for entire blue store and that caused it to locate extant in a definitely random random places, after some time and I've seen a degradation performance tenfold tenfold. So I guess I will follow that.

E

That leads that we have a problem when we don't have a continuous, sequential access to memory on data structures, yeah.

A

B

Guys this is especially true in the case of extant map. T accent, map t is a typedef over boost, intrusive set boost, intrusive set is based on our red black trees, and we are using this powerful machinery to just 200 just 24 bytes long data structure, including including the padding between members I, would.

C

B

That switching to resume something really stupid with mr. to boost their terrible theoretical complexity could be a better way and using those those fine algorithm, balanced tree of balance, trees.

A

Thanks least worth trying.

B

We plan to take care of that at the moment. I'm finishing the authentication. We work in a robust gateway, I promise to mod I promise to yahoo de, but I will take care of that and I mean I'm almost done with it. I will switch back to Brewster very very soon. Okay,.

A

Ho, when one thing I may be a since, you guys are digging into some looking at this little closely right now. What do you think the possibility is for being able to encode multiple required values at once, so that we can start doing some D, encode, decode, yeah and bugging me for a long time very.

B

Very good idea, but this means we would need to also extra to change the serialization formats. We are. We have right now, but it's certainly a problem. We did. We can just simply bump app version by one and preserve for current code. For for the legacy path. Yeah, we can try. The idea is to serialize blobs. First, then extents then Martin between the between them big.

D

B

D

Have situation.

B

Yeah I get quite reasonable idea.

A

If we do that, not only can we do Cindy, but then we can also use a better include decode formats. Invariant I get you that is both lower CPU overhead and slightly better encoding.

E

So yeah added well I, actually tested the performance of our of Arlington coding and I was surprised that when we have huge amount of variance to decode, I extracted actually feels mater testing, but by logging all the variants that were decoded by booster during a test, and I extracted 45 million of variants into the file that was 100 megabytes and on mine, not very powerful machine.

E

They all decoded in 200 milliseconds, so I'm not really sure that when we have them decode sequential and we will just reuse all the machine caches and that that kinds of things, then we will have to remake the variant around.

C

E

Decoding start encoding, for maybe encoding, is broken, but because.

A

E

A

That he was 45 million in 200 milliseconds.

E

Ej-Es 45 million in twenty two hundred milliseconds well actually I can I could I think shared us test because I I met at a part of a of a test in in unit test. Dmc I guess I thought of that. Okay, Adam.

B

Had you had any chance, you could attends to to run those tests under 12 I would guess I would love to take a look on the IPC factor coming from there. I suppose it's much much more than one may be. Free are even more.

E

Well, no, I didn't, but I can, when I switch to rush into the cyanide it on my laptop, so I, don't.

B

E

B

P.M. you, access from which one machine yet but I guess guys I think nowadays machines nobody, a CPUs, are really really powerful beast. They are able to execute money in strategy. Are super scholar able to execute money instruction per cycle? All you need to do is to pro is to fit them with data and dynamic structures. A button that matter.

B

Adam, are you saluted in your tests? Are the the other? The data is the data continuously memory? Yes,.

E

I'm reading from one huge buffer, pointer, okay,.

B

E

A

Well, rest of the I I. Don't disagree with anything you said. Well, it all seems very reasonable to me if we can, if we can make it work, I.

B

Hope so, but you know at the moment that is only a theory. We definitely want to have to have code to have change, sets and verified them in practice. Yeah.

A

And I won't speak for sage, I guess, but there's always the concern to that. It is much more difficult to follow. What's going on right, you know right now. We have all these these things, nothing on memory that that sort of makes sense and a lot of things that we're talking about meme that potentially maybe not but substantially. Maybe it could be a bit harder to understand. What's going on.

A

A mutant not sure pure insistently I.

E

Rather, keep aside to show okay.

B

A

B

I was, I was muted story, no.

A

Word alright: well, does anyone have any other things that they'd like to talk about this week? I.

A

Don't see nikka and I was going to see if he had any updates for us. But why not.

A

All right well, then, have a good week guys and hopefully, next week I will will have the multi kb sync thread: stuff worked out, barley, split, kb swings, drive better and we can get some feedback on that and then rest I buy also. I did I think I mentioned you. I did some testing on your bitmap allocator work and it definitely is improving things.

A

So I'd like to see us get that merged and then also get that tested, also with Igor's split kb, sink thread, stuff and just kind of see where we land him over at and then hopefully get some more perfer salts at that point, to get better understanding of kind of the way of mine looks like.

B

Cool, we haven't finished with the memory allocator, yet we definitely want to have to preserve allocation screens and also make the internal structure at least of the bottom layers, where which are very, very populated with with with with data we want to have those data continues memory, because, right now we are, we are jumping from from area to area to just to just take a look whether whether the place is empty or not. Yes,.

A

Yes, I think that that work, you're going to bitmap allocators been very fruitful, so I think continuing to work and that will be will be good. But so is oh dekho test. So you're hitting all the right areas by.

B

The way marc andreessen there are some some controls you can provide. You can feed the system to to show or hide some particular costs we have, for example, if you want to test the memory allocator overhead itself without without too much too much cause of of traversing data structures in encode decode, then going with smaller radius objects. Is it's reasonable small I mean I mean here around 40 kilo kilo byte age?

B

However, if you want to expose, if you want to need something completely opposite, if you want to expose the overhead encode decode, sorry encode, some you will, you can try using small, writes per kilobyte each across across pretty big car. The radius objects like for mega per megabyte long yeah.

A

B

Input there is a sector of importance between those those bottlenecks we have at the at the front, enters pipeline, yeah.

A

Yeah and we list from all the testing that than those have been, the two things they always struck out at me is: is the map allocator, locking issues who she jump right in on and then also reflective, encode, some and and all the associated things that come with our meditate overhead, ranging from a memory scattered all over the place to just be mounted mitad at we're shoving in rocks, TV and and how much work artstv ends up doing because of it. So ya, know you're you're, doing good keep up, keep up the good work.

C

Thanks, we will very good alright.

A

B

Looking right now on Kuwait sedan, small but free were very fruitful. Alum just speak for yourself. Well,.

E

Actually, I'm trying to to squeeze a bit structures in blue store and I got did make to fee. I mean well actually three things so I. It would be fully question. One is I changed a date. Dakota, covariant and and I got rid of checking if we going out of the input buffer with each iteration.

E

So it's just now checked after decoding very, very neat. If we somehow did not get additional superfluous bite and then we make a troll we're not trying to throw it and with any bites and all a bit realigned it so it made a shorter, but that's, but that was no one I squeezed out from excellent excellent reference to all node, because all the extent maps that we have are always located in unknown.

E

So it's possible when you operate on or not actually locate where you are without that additional pointer, which was a bit confusing for for the CPU I guess when it operated.

E

So now we just subtract the bike and finds where it was, but that the one thing and the second one I inserted a shard, blobs and I mean I made a lazy creation of short blobs. Now, when you create a blob, you you just have a blog and the first time you actually need to do something with shared blog you created later, and it made like eight percent improvement when a random write tests using fio. So I was satisfied with that. The.

A

First part of what you just said now about in decode, your do I understand correctly you're, saying that you're not previously we're checking every single time you you are entering through the loop, we're checking to make sure that we weren't exceedingly buffer yeah. Well, that's.

E

The current implementation of decode variant, but with each bite we taken from a input stream, we will check if we're not about to make an exception, because we throw an exception if we trying to lead after the last accessible bite. Oh yeah, okay,.

A

E

I moved it outside of I mean the the small side effect would be that if, for some reason our buzzer list was corrupted and and exactly with petrol Island, and we try to read one bite after after that corrupted buffer list, then we could, instead of making a fro that we out of data, we could throw a segmentation fault, but it can happen only if you decoding well broken data.

C

E

That's what about that's a trade of that I will I will I wish. I would like to take note of that, and if that's acceptable, then maybe we can do this I. Don't all right! I think that we we can go with with having segmentation fault instead of throwing exception, especially that we actually don't catch them very much.

B

Yep, even if we would cut I guess that the obvious decision is to just about the world programme de Valois d park,.

C

B

C

B

C

B

Segmentation, if he thinks they're so having a segmentation fault, especially if you provide a hunter 48 like we do at the it's not so bad I would say it's a comparable to just a barking. Well.

E

Actually, it would be easier to debug, because when we try to debug from the catching exception, you are actually nowhere.

A

Well, let's open Utley will run by the agency where he thinks but yeah sure.

B

Anyway, guys I think we have two separated items. We can work on right path and reach path. At the moment, read path is even if buf is not so well optimized, and but we can, but we don't see it in typically, because of presence of huge metadata caches. However, in in in production deployments, the having such a large caches could might not be acceptable. I suppose that in production people tend to, we tend to have money, many of em, always this even based on nvme devices.

B

On the same note, so I would say that we cannot spend.

B

We cannot spend thousands of cpu cores and mega and gigabytes of memory to overcome some some issues we have on lower layers. It will need to fix the map. Well,.

E

And to give some numbers, as when I tested it, and some is 10 times faster than decode some, so if we will get out of there exone, we will get the strongly degrade greater.

E

B

In other words, it looks that that the right path that we thought is the slowest one after after removing some support from from metadata caching Brewster layer is faster than the rate than than than the rich past. Interesting.

A

ah I would have expected it to be honest, ladies because you've been spending time optimizing at the right path. Yes, yeah crazy. All right, well, yeah keep up the good work guys. It's definitely good to have multiplies on this kind of thing.

C

We're gonna do all right.

A

Well, I would say then, but unless anyone else has any lesson that thinks they want to bring up, let's wrap this up and we could be next week. Anyone.

A

All right, we'll have a good week. Everyone back to you later, thanks.

B

The same for you bye, guys, bye, guys, yeah.