Ceph Performance Weekly, 28 Jun 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-Jun-28 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

A

B

Might become a quick meeting today, we don't have sage I do believe, though, that I will try to maybe show some memory allocator results today.

B

Me go my stuff together.

B

All right not a whole lot happening in terms of pull requests over the last week, there's a new one by ma Jinping that that looks interesting. It changes some of the behavior of the K vsync thread in blue store and the way that finisher, the finisher thread works. I have not looked closely at this. The only thing I asked Majin pang for was, if he could run it through the wall, clock profiler, because that might tell us a little bit more about kind of the change in behavior.

B

He did show really good performance improvement results, though, under with like high concurrency, so that that could be interesting anyway. Hopefully we'll get more information on it or get a little bit closer.

B

Let's see, oh there's Redick stuff here for the the static linking.

B

Else, oh and and then Adams work looking at bitmap, allocator fragmentation, Igor's map, allocator I, actually have no idea. What kind of fragmentation we see with it now I see go you're here did is. Is there any new news on that or what's what's kind of the status of that.

C

And probably the results are pretty expected since bitmap alligator cares more about Commons land fragmentation.

C

Well, I have some thoughts on maybe introducing some caching for well above bitmap, allocated to go fight this fragmentation issue, but you uncertain about that.

B

So one question I have is that it looked like stupid. Allocator at one point was maybe fragmenting things badly as well. Is it is bitmap allocator, worse the new one.

C

Well, actually, if your actual location of the store isn't well, it is not fragmented, then duplicated. Probably it was better because well it properly locates the the continuous segment first, while with map locator books from the beginning and it just collects any available extents resulting list might be fragmented and it doesn't care about that.

C

The question is actually what's the actual fragmentation in production system over the time and I suppose nobody knows that in detail and in in that case well, it's questionable if stupid allocator is better, not one.

B

One one mechanism, you kind of or one test to to look at this in a not a direct way, but have an indirect way that had luck with in the past is just alternating doing like 4k random rights and then like or megabyte sequential reads on in, like an RB d volume so that you have, you know four megabyte objects, but you're you're kind of alternating between doing random, small, random, writes and big sequential reads, and that's like with ZFS, with file store.

B

I'm, sorry, butter, fest with file store, was really really easy to see the the copy-on-write behavior and fragmentation that would result when you, when you did that.

C

Yeah, maybe but well again it it hardly depends on the right, read patterns and well all these artificial test cases I'd be completely different from the actual production system. You know.

A

C

And well gain Adams benchmark here is against artificial okay, but, as I said generally stupid, alligator is or.

C

Aimed for augmented allocations and to you than bitmap won by by design, it has some benefits on on that sure, have the cost of performance chips and yeah, maybe really make sense to to have different implementations for rotational drive and for price sure we don't I, don't need that high performance for rotational drive.

C

Our fragmentation is important. There.

B

All right sounds good. We it'll be nice to see actually comparisons between the two just because it looks like you know, at least in other ways. Bitmap is quite quite in a bit superior to do the stupid allocator so just be nice to know what what the trade-offs are.

A

B

B

Alright, there's this avoid traversing alder frags when trying to get right, locks I, assume, I have no idea what that is.

B

But yeah apparently emerged.

B

Let's see what else we have here, oh so, there's an outstanding pull request from Doug to increase the default, rgw thread pool size and unfortunately it doesn't look like Matt's here or or you who, but but I suspect that that would be good to do honestly, maybe even increase it larger, given kind of the way that rgw works currently but yeah.

B

That's best probably makes a whole lot of sense to merge.

B

Everyone has been doing a ton of work on optimizing, make check performance and just can't build performance in general over the last couple of weeks and just an IRC this morning he was talking a little bit about it. It looks it looks good with especially the work that he's been doing on caching recently it looks like he can reduce a build from like 30 minutes down to like seven minutes, though the the gist of it is well I'm.

B

The these make check performance improvements are one part of it, but the the one that I really remember recently was that we're really doing a horrible job of actually utilizing the the cache and make it see make because we we introduced like a lot of randomization, or at least some randomization, and so they can't really make effective use of it anyway.

B

So he's looking at all of us and- and my hope is that once he's done with this are built, are gonna be a lot faster, which, at least for me, would be a huge improvement because I'm rebuilding pretty constantly so anyway, that's that's exciting.

B

Let's see, Adam was working on aging tests for booster allocators I. Think this just kind of tied in with all the other fragmentation work.

B

There's a couple of other semi recent polar requests here that I haven't had a whole lot of movement. This shared persistent, read-only RBD cache. That's that's been around for a while I'm, not I'm, not sure what the current state of that is. But apparently it's it's continuing to get discussion and updates here. So.

B

There's this async recovery, P R from Aaron 85 and just from what I've heard from sage on this, it sounds like this is something that Sam attempted or something Sam attempted something like this a while back and had no luck with it. I mean you can see it in the PR there, but yeah.

B

It sounds like this may not actually be worth attempting again.

B

Braddock has been doing a bunch of stuff on huge pages. Looking at huge pages radix. Is there any movement on that? One.

A

Basically, it's ready to go. It has some drawbacks right it mostly to the involvement of administrator to configure the size of a huge page pool. However, this is I'm afraid we would be, and we wouldn't be able to overcome that this is just a system limitation if you want to use its, but he wants to use explicit huge pages amidst he needs to. He needs to tell Colonel about the you know.

A

The pool I will decide about the number of huge page he want to have it is it has it can provide nice boost at the cost of extra at the cost of extra maintenance burden, I would say.

B

So in in all the stuff I've been working recently, looking at memory, allocators and M advise and other things, it looks like.

B

You there are limitations once you start using huge pages, have have you seen that as well.

A

There are, of course, however, I feel I. Should they don't affect us us? We are not using huge pages and with with malloc or Gmail, or any kind of we are just have something very, very simple that goes directly to to M map and M UNEP.

A

B

Yeah, that's probably the way to do it.

B

Okay; cool that that sounds good.

B

All right, there's, there's lots of other stuff here. I, don't think Eric's around and talk about. Whatever is going on with the I/o throttler.

B

Radek there was a I said, I like to say a lively discussion regarding the app tracker, optimization I think you and Peter were having a while back is. That is that you have just is.

A

B

A

Think so yesterday, and we had a good and strong idea regarding, regarding the up tracker and testing environment, to note that Peter had a huge difference here. I was used on in Sirte I'm using CBT, which, by default disables suffix, you know, is very similar to what an entrance samsung card doing in their own test of maximum I opted scenario.

A

Peter was testing with restart deployed cluster with tough acts on board. So we think that that tough acts the bottle necking much earlier than the spin lock contention pop-ups. So.

A

The optimization around the inter the spin lock, we have been in ops, trucker history manager could be visible, mostly on very, very powerful devices, and we all all SSD configurations and something like that on edge. It is it's it gives. Well, it's not visible, it's just not it's just not visible.

A

Now it's not I would say the optimi discipline. Magician is not generic, but still worth adding, as we are targeting those very, very fast devices. Also, we have a discussion about work, nesaf of optimizing divisions, and it turns out that excuse me, it depends.

A

First of all, the custom of division. Operation on x86 is variable and can vary from just nine cycles in the developed mystic scenario up to ninety cycles night, for, for instance, 90 cycles is about its it's just the cost of l3 cache miss and me to go directly to to the system memory. Well, it's pretty well. In the exact case of tracker divisions and I I feel that I get. They are worth at least at least fifty or forty cycles, which is comparable to and to cache, to l2 cache miss covered by l3.

A

But, of course, the optimizations: it's not that it's not the most important part of the pull request. Definitely what is what has been done in the continuation of this pull request is integrating our our p2t wrapper for a power of two, it's a type for long language facility to convey the information that we are dealing with power of two using language system type x type.

A

The the outcome is that we are integrating. The p2t has been integrated with common config subsystem. This means that we have the validation made just once just just doing a parsing of our config, then we can safely use.

A

Those optimizations almost everywhere, so it's like it looks like it looks bulky I would say, are.

B

You going to keep that all in one PR or break those apart.

A

Well, I have one huge branch with all those things: basically, everything that is in the alt abstract request, plus optimizations related to read, to create request of abstractor plus some musicians for IO toddler, while using using making a bond between shards of abstractor with shards of our main work. You well a lot of comments there. There is a.

B

A

B

Could be really good if we could just find like you know, since that one was so contentious if, if you and Peter are kind of an agreement on what what's okay, if we could get something, you know with that, one merged.

A

We agreed that.

A

Today or maybe tomorrow, okay, my github, let me let me provide a link.

A

A

It didn't get him back forever for a second to the differences we also, we also spotted that there is. There is difference in the throughput. Our machines are providing insert those the values from related to performance. Comparison in the pull request has its are in kilobytes per second.

A

This means that the single insert a node delivered 3 and over 3 gigabytes per second, while the restart deployed cluster was able to deliver and not more than one gigabyte per second sure, though I guess that's the reason why why the contention on on spinlock wasn't wasn't noticeable noticeable in India NDB, start testing, sure def, X, +, + 1 socket configuration instead of.

B

All right here are my effect. My page here, I, there's, there's no hole. I mean there's a bunch of a random old stuff here, I, don't think any of it's worth talking about. Probably since Nova has been updated in a while.

B

Let's see the only other thing I was going to mention is that there has been some discussion recently about memory allocators again and whether or not it would be worthwhile to reconsider the Lib C memory allocator and so over the past week we ran some tests. Taking a look at that again here. I'll share my screen.

B

And folks see this.

A

B

That that is that better, oh.

A

Nice he's good now he's good.

B

Okay, so this isn't that interesting, to be honest, I mean it's it's kind of what we've already known, but it's it's good, because it's just this recent. This is on an a boon to 1804 system using the the most recent Lib C malloc.

B

That includes the the T cache optimizations, so the gist of it is that we're still yeah, Lib C malloc is still is still really still really bad at using lots of RSS memory, presumably due to fragmentation, you know we don't have direct evidence of fragmentation, but you know this is you know probably indicating that there is some TC malloc really I think probably continues to have kind of the the best behavior. Here you could make an argument that J malloc is maybe is probably pretty close.

B

The the vert size is is higher, but the actual you know RSS memory used to just maybe a little bit lower than TC malloc is, and this is Kevin old version of J malloc. So the the newer version might be better I, don't know, but both are using significantly less memory, both vert and RSS, than in Lib, C Malik and there's no real performance advantage. Phillip C Malik either in this case.

B

So these were our BD results for K random, writes using an nvme device with 3d by Bluestar cache one OSD, and then also we looked at our GW small object creation and you know the stories may be actually a little bit better for Lipsy Malick, but it's still not great and still using, not quite double the memory of the other two, but but still really high. So yeah I mean the the gist of it. Is that I? Don't think? There's a real compelling technical case right now for using whoopsie Malick.

B

The the only you know real advantage is that it's, the you know default pretty much available on any links. Distribution, no extra! You know setup is necessary for it, but you know I. Think, given these results, we need the Lipsy folks to figure out ways to help us reduce reduce this one. One thing that they recommended was disabling fast bins via LD pre loading shared objects, because that's the only way you can disable fast bins right now.

B

So we did that and then also increasing the duty cash count, and you know that that actually increased memory usage a little bit I think. Basically, what happened is the performance went down a little bit because we I didn't have fast bins anymore and the the memory usage went up because we were now using higher than default. T cash count, but uh but yeah I didn't it didn't really help. So we, oh I, also tried reducing the number of arena's kind of the the community wisdom out.

B

There is that by using only two or even one arena, you might see increased threat, lock contention, but potentially lower fragmentation, and that really did not work. I didn't even get to the random right case before the denoted run out of memory or whatever reason decreasing. The arena count made the memory usage spike very, very quickly, just pre-filling an RVD volume. It was up to like 11 or 12 MiG's of our SS before we even got to this test.

B

So in any event, I guess the the answer here is that if you, if you want to try using Lipsy malik, you should expect a very, very high memory usage, at least in in current versions and current versions of stuff, any any questions on any of this.

B

All right, I'm gonna, close it up here, then alright. um So that's! That's all I've got this week. Is there anything anyone like to talk about.

B

Alright well have a great week, everyone we will meet again next week.

A

Thank you, bye.