Ceph Performance Weekly, 12 Jul 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-07-12 Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

A

We might have a small crowd today. Josh said he couldn't make it. Man and I have not heard from sage yet this morning, so I guess we'll see what happens.

A

B

Braddock I.

A

Think we might have a very small crowd today, we'll see.

A

Right well, I guess maybe we should just get going here. I did see sage just posted on IRC, so I don't know if he's coming or not.

A

All right, given him one more minute, then we'll just get going.

A

All right, let's get this show on the road.

A

um There is a new PR from Aaron 85 regarding the racial coding, stripe, cache I, don't know much about this one at all other than that I saw. You is modifying the legacy, configure ops, which I not sure do pass age, hey, how's it going.

B

All right do we actually have anything on the agenda, not.

A

A whole lot I just figured. We should do it.

B

A

We skipped last week.

B

All right, um yeah I, don't have anything I gotta run in a minute here.

B

The pull request, update I think the only main thing going on pull requests is your bend, LRU cache stuff, which looks great but I just need to verify. So do I understand right that keep you fixed that I think she fixed the failure. Murph grunt turned up.

A

B

Right: I'll rebuild retest, okay,.

A

Yes, the sooner we can get that merged the better since I go on vacation in a week and I'm trying to I want to do the back port before that. But we'll see if it happens, yep.

B

Yeah push the build right now, there's a merge conflict. No, there shouldn't be anymore, I.

A

Just fixed this morning fixed it.

B

When merging in the whip rocks to be pinned, LRU cache and merging it with with posting memory target and whips emailing standards, did you did you update yeah I'm, just running the build integration branch script on which whip saved.

A

B

I wonder what merged? In the last hour I mean you tried out just like McMaster, but you can fight against Keith pull request uh yeah just on one minor and Howard Globes library, snippy, libraries.

B

Well, let me just manually resolve it now: I'll sort out: okay,.

A

He was saying if it's almost green, then it's good.

B

Yeah I mean I, could I could just test his first and then we can read based on top of it. Yeah I did notice that the.

A

B

A

Other, like whip three run, it's.

B

A

All right, I thought I thought I saw this in something else that you ran, but the.

B

Class rgw failure is, is all over the place. It looks very similar to that last time up: okay, that's probably what happened.

B

Why is Alec lives? What's alec lives? Did you make a change to Alec lives after I know of.

A

B

And kv / cmakelists text the target link, libraries for Katie; that's true to change that.

A

For the were you having it very moving, Alec lives, I may change there. Let me look at it. What exactly was I need to pull in stuff for the heat profiling, okay, I'm gonna, assume nothing.

B

A

Yeah, the though in the seeming list for the bindle, are you cash one. It should be just adding in OS adding a couple of CC files and anyway you, you probably resulted ever mind.

A

A

Alright, guys, there's not really a whole lot here before I can I can show some graphs and things for this. um This memory allocation stuff, if folks, are interested. Otherwise, if people want to get, you know more of their day back, that's fine, too any any opinions, one way or the other.

A

I'm interested okay, let me pull up a graph, then.

A

A

Yeah I, don't think, there's anything else real interesting going on with new PRS here. So what to discuss? Maybe the only one is mah jam. Peng has a is looking at, like latency of the K be finalized thread, which is good, but I. Don't know if there's any info to report there, yet, okay, so for my stuff, there's two PRS that are coming in that kind of work with each other one is to pull rocks Phoebe's cache LRU cache into our tree instead of using their default. One there's a couple of reasons for doing this.

A

One is that it doesn't work quite exactly how we want to work. They have a low priority pool on the high priority pool from for the LRU and the high-priority pool stores indexes and filters optionally, but but that's how we're using it, but it also can then store data that has been recently read. It doesn't by default, store that there, when it's been written, but the first time it gets read it moves into the high priority pool. We don't actually want that.

A

We just want the high priority pool reserved for indexes and filters, and that way we can look at it and say as much as its filled up. This is how much cash we need for Nyx's and filters to keep that guaranteed in cash, so in our own version of the LRU cache that we have in this PR. That's a change that we make. We no longer are putting just regular key value pairs in the high priority pools cache or that portion of the cache.

A

Instead, it's it's always staying in the low priority pool, but getting promoted to the top of it.

A

We also that that gives us the ability to add things, interfaces for that cash without going through Roxy B's public cache interface.

A

Originally, we were, we made a modification to that interface and tried to submit it upstream, drock's DB and it's been like two months and they never actually moved down it and I I suspect just because they really don't want to make modifications to the the cache interface and that's fine I, understand why they. You know it's it's. You know, there's probably no one else using rocks DB. That's trying to to get access to the kind of cache statistics that we are so by pulling the whole cache that oil are you cash into our tree?

A

We no longer have to use the public interface, we can add and remove our own methods for forgetting whatever we want ourselves and and is probably better for everyone. So that's the the one PR that does both of those things and then there's another PR. That looks at the amount of memory that the OSD is using the heap memory and then both and the amount of unmapped memory and from those two values. We can commute compute the amount of mapped memory, but the OSD is consuming. This is from TC malloc.

A

We don't have the ability to do this with Lib C, malloc or J milk, yet maybe in the future we will be able to, but with those we can start dynamically adjusting the size of the caches in blue store to try to keep the OSD s. Rss memory usage close to that value because TC malc uses M advised don't need and the kernel is not guaranteed to reclaim any memory that has been unmapped.

A

We can't limit the the irises memory usage of the process to a specific value, but at least we can kind of keep the mapped memory under that and the RSS memory tends to follow it sort of closely, maybe within, like 10 to 15 percent or less in most cases. So let me pull up some data from that.

A

I have way too many windows open.

B

Okay, here we go.

B

A

Zoom, this in I think can can folks see this at least is it? Is it present die yet not yet keep going better, a little bigger, a little bigger? Okay, maybe a little here, I'll move my thing to the how's that looking that's fine, excellent okay, so the the second graph here is kind of showing roughly the behavior of what we had maybe a month ago. It's not exactly the same, but it's really close.

A

That's without doing any kind of auto tuning, a three gigabyte, fixed blue store, cache, the the values are different, 4kv meta and data. We we were kind of trying to optimize for the our beauty case. We really really aggressively favored metadata over the other two previously, but but this is just kind of showing the behavior. When you don't do any Auto tuning and use fixed ties caches, it also isn't playing any tricks. It's giving you exactly what you request: a three gigabyte fixed cache size and those ratios, and it doesn't do anything else previously.

A

We had tried to play some games to to kind of like.

A

Reallocate memory when it wasn't used, but it didn't work very well and it was kind of broken. So it didn't look exactly like this, but but in the case where you don't do any auto tuning where it's disabled, this is now what you get so so that bottom graph, you can kind of see it it's doing kind of what is requested.

A

As far as the cache sizes go and the RSS memory usage in this case is is going up to or 0.8, let's say, four point: seven, something like that: gigs of of RSS memory used okay.

A

So with this PR with with both of the pr's that that are in the works right now, the above graph shows you the the the behavior when the auto-tuning is enabled the targets are the same, and he kind of uses those as guidelines for for what you you, it should have default back to if it can't make any smarter decisions, but when it can it kind of.

A

Will use memory in smarter ways if it if it thinks it can do so say if one cache doesn't need all of the memory that that you've specified via the ratio it will, it will borrow it for another cache and let it go over its its target value. So in this case, there's a new parameter, an OSD mem target parameter. That just says you know you just specify how much memory are you trying to keep this particular demon limited to and with the auto tuning and that target?

A

Now you can see that the RSS memory is sticking pretty close to four gigs, which was our target value. The mapped memory in in light blue is actually saying very, very consistently just below that target.

A

It can peek over it a little bit, but it will the more it shoots up over it at any given point in time, the more aggressively it will kind of try to keep it below that and then over about all it does a pretty decent job of it, so that RSS memory value can change depending on how much memory pressure there is.

A

My understanding is the kernel more aggressively, reclaim unmapped pages when there's a lot of memory pressure and will do so at and it may choose not to if there's no memory pressure, so you know potentially in a scenario where there's higher memory pressure it might stick closer to it or or might not I. You know capped the kernel, so you can see here that the the cache size is bouncing around a little bit to try to keep up with that and- and here this is Kevin.

A

This is one of the first times where we've really seen the rocks DB kV cache kind of peaking up and going back down. Presumably those are periods of time where it's it's maybe created a new level. You know it's stayed within the same level. At that point you know, merging and and adjusting ESS ST files and then I suspect that those points it's when it's now created a new level and is moving stuff into it.

A

Just a guess on my part, but but it's it's interesting to see that behavior, where's kind of when you've got these fixed cache sizes rocks TB is not allowed to to kind of borrow any cache like that. It just can't stick set of fixed cache size, actually, maybe the that's not quite the right way to say it.

A

It's always at a higher cache size in the the fixed case, it's kind of at that peak size continuously, whereas in the outer tuned setting it can get back down to really just what it needs and- and you can reclaim it for for metadata or data cache. So this is the rgw with Oh performance in this test was almost the same.

A

It turns out that this test machine was I, had updated for doing some memory, allocator testing and pulled in the the changes for Spector and meltdown when I did that, and how this system is not fast enough to actually show any performance differences between either of these. It basically runs at the exact same throughput. Roughly so I need to retest this on a machine. This faster I think because right now, I'm, basically CPU limited now hums this box, but any event they they look about the same or for our BD tests.

A

Again we're we're kind of staying right underneath the the memory limit with Auto tuning enabled RSS is just a little bit above that about 10% above that and we're dynamically adjusting the size.

A

The caches here when there's there's very little use at the beginning of the test when it's empty, both the kV and metadata usage is super low, so data spikes way way up and then, as the amount of data in the volume grows, when we're doing a pre filling stage here at those spike up, data goes down and we kind of end up at this equilibrium point versus and the behavior, when there's a fixed size cache and no no tuning, it's all kind of just static.

A

It grows in the pre filled stage and then and then just you know, Canada's level at whatever values you set. So that's it. That's all. I've got right now, the the in my opinion, the more interesting change is coming next, which is the actual LRU binning the age based bidding for the lr use that will. Hopefully, let us see changes in the cash ratios, depending on the workload that you're doing.

A

If you go from an RB d to an r GW back to an RB d workload, the goal is to make it so that you, the cash, is adjust based on what's happening at that time and my just back to kind of the previous state right now. This doesn't do it or probably doesn't I'm guessing last time. I looked it didn't so that's it any questions on any of this I.

A

Just have a couple quick questions: I love it by the way it looks really good. The only question I have is how big a change did you have I mean like is it? This is a really complicated change, or was it relatively straightforward and if the latter, what is the chance that we could get this into say? Luminous go ahead, don't mock! My goal is to have all of these changes backward for 3.2, we'll see if it how hard it is, but I, don't think anything is too terrible. It's not one change right!

A

There's, there's gonna be like four PRS, the the PR where we look at TC Malik's stats to compute the the cash the overall cash size is pretty straightforward. That should be an easy back port I think that will get us at least the the on Mac tuning of the the overall cash size. The the tuning of the ratios within that cache size is a little bit more complicated. I think that we can do it. The biggest change is pulling the rocks.

A

Db LRU cache into our tree our version of it into our tree, but the good news is that it doesn't look to me like the public. Cache interface has changed in the last year or two, so I think that our changes won't rely on a new version of racks. Db I think we can pull those in without pulling in a new rocks. Db. So I think that backporting, it might just kind of work so long as we resolve any other weird things like there was.

A

There was a G comp change just just recently in the last day or two that that will we'll have to go through and like fix everything for when we back port but yeah. Otherwise, you know it's kind of complicated, but it's not so much so that I think I think we might build a back port it at least to at least four three two luminous might be tougher, we'll have to see oh I'm, sorry, my bad I thought I got confused.

A

I thought three, two is luminous, and it's really saying it's something: it's a different upstream release, yeah! Well, that's a good question. Those I I should actually find out what that's actually based on, but yeah yeah. The good news I didn't want to put okay yeah sure, no, no worries, no I, I guess the gist of it is think we maybe can do it.

A

You know it's kind of a big thing to back port, but I'm hopeful that we'll we'll get it in four three two at least it seems very significant because some of the the problems that we have you know in the field have to do is, like you know, hyperconvergence, were you trying to run deaf and applications? You know in the same box, and this makes it a lot more manageable, though that's the goal. I really would like it.

A

If we could make it so that the only thing users need to change is how much memory they want to devote to the OSD and that's it kind of looks like you are pissed.

A

You know we can't control our SS memory usage, but you know as exactly right. You know the colonel is involved, so you know we can't force the colonel to reclaim pages. We can only kind of say and we've unmapped stuff and the Colonel's free to reclaim it. We can't guarantee it's not fragmented, but you know we can at least say that there's stuff here the colonel can grab if it wants to and I think that's the best we can do well, the only the only concern I have.

A

In addition to I mean this is great, but there's the also the question about what happens when you go into backfilling mode, and you know the that that's a separate issue right.

A

So that's a really good question, because I have not run any tests with this when you go in to backfill, but presumably we will be watching the heap stats and it should dynamically adjust the cache size down to whatever minimum is set. You know up to whatever minimum is set to try to compensate for any memory usage that we haven't accounted for in the heap. So theoretically, this might actually level out memory usage during recovery too.

A

The only I guess the only downside is that you know the cash might get shrunk dramatically during that it might slow it down. But you know pick your poison right. Do you want more memory usage or do you want? You know stuff to be faster or slower? You know so.

A

Yeah I mean I, don't mind getting a little slower, you know, but it it's like. When you run out of memory, then it gets a lot slower. Exactly exactly I would I would much rather eat the performance of performance it if it meant that we could. You know to the best of our ability, keep the memory usage of the OSD within certain balance, the the performance issue we can fix later on. Once we've got the OSD. You know set that it's behaving in a consistent way. Then you know we can start looking at well. Why?

A

Why do we need so much memory? Are there things that we can do to speed it up in memory, limited scenarios, etc, etc? But, but if we're just you know letting the OSD use an unbounded amount of memory, you know all bets are off. You know the user has no ability to do kind of understand or know what's going on at any given point in time. So exam I hope is that this will it's not perfect right? We can't we can't.

A

We can only adjust the cache sizes once you get down to the minimum cache size, there's nothing more. We can really do with this PR anyway. So if other things in the OSD are using tons of memory, you know this. This will compensate to an extent, but right now the minimum size is set at 128 megabytes. So if we get down to that, then there's nothing more. We can do and it will then we'll grow the process.

A

Yeah yeah well exactly exactly this. Will this will at least you know level things out in a lot of scenarios, but it won't it won't. You know it can only compensate so far.

A

They you yeah, no problem very any other questions on this.

A

All right, that's all I've got um any anyone have anything else. They want to discuss this week, or should we wrap up all.

A

Right well, thanks for coming guys, see you again next week have a good week here, bye later right,.