Ceph Performance Weekly, 20 Feb 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-02-20 :: Ceph Performance Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, folks, good morning.

B

A

All right, hopefully, will gain the core people in here in just a second I.

A

Like their a wrapping things up, my left.

C

Good morning, Josh good morning, Mike.

A

Welcome back, thank you. You too,.

A

All right, oh, let's see I wonder if we're getting gave the folks from the urge W team they sometimes make it.

A

Igor's coming to so I may actually wait for him to come to talk about his uh this PRC er I.

A

Suppose we could his talk like that after we go through all of them really talk about kind of those numbers in this presentation. Everything let's see here, oh.

A

I'll get started here all right, so we've got like three weeks of PRS here, although it ended up not being too bad. Although I may be missing some things in here. Possibly Igor has a couple of new PRS that are both really exciting. One here is using deferred rights to avoid blood fragmentation. When you have a small metallic size, I think he's going to talk about that later. So I'll just leave it at that, and then he also has one that's a hybrid allocator based on both AVL and bitmaps.

A

So likewise, I'll I'll leave him to talk about that and he's got some interesting performance numbers too. That I think he was planning on showing today, let's see so also, there is a PR here for improving SPD cape performance. It looks like it's just kind of changing some of the code in nvme device to use kind of best practices. I guess we were doing some things that weren't great so anyway. There's that would be really interested to see benchmarks on that clothes. Pr's we've got a bunch.

A

This kind of small one from braddock keep emerged or moon pools Shane to sofa Tomic another one for offer list just another small PR. Okay, a couple for upper list from Braddock.

A

Another one for nvme device, splitting the read io of the Iowa sizes, large I,.

A

Guess that kind of me sons, maybe given the small lies for nvme I- was typically at the OS level or the device level rather I. Guess oh.

A

My PR firm because the default number of our view rgw bucket shards, that was close, but it's just being incorporated as part of another PR by KC. So that's good and.

A

Even affinity, improvements, I guess for rgw here, not a whole lot of other stuff in line, beautiful small object, yeah and I, don't remember too much way and these other ones but looks like they were close by his tail bot: okay, updated, more bufferless, stuff, Igor's, simplification of the o node in unpin logic. So that looks like it's probably much better than what I was trying to do. But it is not passing tests right now, so just need to figure out. Why, but overall looks really good.

A

Or bad OCW admin allows own OS one from KC that incorporates my other PR. It just does a little bit more as well. Adams big objection, triumphant PR needs a rebase I didn't see. Some updates after Patrick was seeing asserts in the MDS, so hopefully, hopefully whatever he changed his resolved thought, but he's a rebase needs to be retested.

A

Igor's for reducing memory, footprint of data structures in blue store, that's been updated, I, don't know what the current status is. If there's anything more than used to happen with it or not, Igor.

D

Is that is that ready to be tested again? Oh.

B

Yeah, just you baste it and it's ready for Kay, but as far as I remember, we were discussing about postponing that you locked opposite Lisa, okay,.

D

Sure make sense.

A

Cool okay and then panic was also working on optimizing. The shared LRU code. It was like key foo, did some initial review on that and then also has done further review all right. That's it first stuff that has gotten updates in the last three weeks.

A

Our other atom I think is still working on charred blue store, I. Don't think I've seen any updates on that recently. Josh of you do you know, if that's, if there's anything specific, that we're waiting on for that I.

C

Think that he had updated with the latest approach might be different. Pr now I forget this point: I think it's.

A

Okay, cool- maybe that's just stale and quite possible I, missed the new one when I was doing my updates this morning, so I'll go back and look for it all.

D

A

Whole lot of other stuff in here I, don't think.

A

Yeah, nothing else really right now, so that's about it for the moment, all right eager, we have you now. I was wondering if you would like to talk about all the work you've been doing with both the deferred rights and your hyper delegator yeah.

B

Would you like me to share my new benchmark numbers.

A

Yeah I'd be great- maybe maybe as you're doing that or before doing that. Just talk a little bit about what you did. Okay,.

B

So the rationale about all this work is to bring 4k minilogue size for peanuts, which which will reduce the space amplification, that we might experience in the recorded polls or when keeping tons of small files and things like that.

B

Well, our initial attempt showed that we might experience negative comments. Impact on when, when downsides in allocation size and hence I, was trying to to fix it somehow and.

B

One of the major reasons for performance drop is the additional fragmentation which our default allocator, which is bitmap one. My producing 4k scenario and.

B

Finally, I decided to try I tried some tricks with map locator, but finally, I decided to reveal one which, based on real trees and we have it in master, but it's on default at the moment.

B

The major difference between these two alligators is that AVL alligator is pretty good in finding continuous blocks in fit, maintains a sort of tree and hence it can search efficiently. Well, bitmap needs sort of quench all search to find such blocks, but on the other hand, a real alligator is mighty consumed, pretty high amount of memory.

B

When we have significant space presentation, I try I, measured the size of the single range of the structure to maintain single arrange within this AVL tree, and it takes page bytes and hence, if we have plenty of pokey ranges or maybe long delay ranges, RAM usage might grow pretty high, and actually we experienced the same issue with stupid alligator before so in production.

B

We've got multiple complaints about this. If memory usage in production after a while, well bitmap rotator has constant RAM usage and very good in this case. So the idea I decided to try, which I called I built alligator just based on existing AVL one and you seat for fast search for long ranges as well as use it. Basically until we.

B

Term memory threshold, and then it falls back to bitmap one to keep everything that doesn't fit in memory constrained.

B

So that's it well in this hybrid locator, when we fall fell back to a bitmap one, we still use AVL tree or little alligator implementation as a sort of caching again for search in this continuous blocks and well. So we that's a sort of work around to bring of pad searches and limited memory constrained to allocation schemes.

B

That's the rationale for hybrid alligator and one more scenario that showed some performance drop is when we mean when we downsize in allocation size is long creeks like when the creeps from.

B

Objects which were initially filled with again long Crites but then partially written by some small blocks, like 4k and in case of 64 K min alaq size. These smaller writes, result in default writing procedures, which.

B

Keeps existing loan blobs.

B

Untouched, so they they still benefit from the continuty. Enhance father reads: benefit from that, but when we migrated to 4k min alaq sized touch of Rights broke, the continuity and subsequent reads are suffering from from that.

B

In fact, we have a sort of trade-off here, at least for 4k min alaq size. So on one hand we have.

B

When we do regular writes, we have asked the right but subsequent reads our flow due to result in fragmentation. On the other hand, if we apply default, writing performance drops, but subsequent reads are fast, so the result that I've got the I tried hybrid allocator with default right, and this is an attempt to preserve current performance numbers. We have for 64 K min alaq size.

B

So if we decide to have better rights, then we should avoid the default rights, but we will suffer from flow rates.

A

I don't know if you're trying to show your screen or not done no.

B

A

B

Don't need some permission to that I. Don't think so.

E

It's just reusing the existing preferred option. That's basically, if you defer right, then it will overwrite existing blobs instead of allocating new ones. That's really yeah.

C

B

E

That makes sense, but if once you're deferring it's always gonna be better to use an existing blob than to write a new one.

B

Well, thrown from from writing performance. It's not better its.

F

B

Well, it's the.

E

B

Okay, let a little let me share first and then I will try to I guess.

E

Mit more six, because.

B

Can you see.

B

Screen, okay, so just the row number twelve, our shows for key rights of the rights of the prefilled store and I can compare, for example, columns, B and C, and there is a pretty significant difference in writing. Speed four.

B

Four four caming, lock, sighs I suppose there are two things which result in this improvement: first, one lack of default writing and the second one is probably doing multiple rights. They entry single, so no more multiple rights become continuous and and not perform that fast enough.

B

But when we try to read objects written this way, the Big Cheese completely different or each are slower for for caving along sizes and in 64 case or so objects which were written using this work procedure.

B

Reading from them is faster.

B

We try to find concealing that.

B

Yeah the raw 18- this is the first long readings and you can keep great numbers for four for each of 64, K, metal, oxides and well, a major drop for all the cases which avoid default, writing and the last column, which is f.

B

It's sort of an attempt to recover two original numbers, but I haven't reached regional numbers anyway, they are much better than without one three well different.

E

No panacea, basically.

A

Igor, it looks to me like you're, both there you're eight to the 128 K, random, right and you're large right, like four megabyte right numbers significantly improved with both the AVL and the hybrid allocator. Is that.

B

B

A

We have any understanding of why.

E

A

Finding bigger.

E

Expense all right, you're, right.

B

Yeah, so we when we use 64, K, blobs and bitmap allocator, we tend to to use 64 extents, but when we use a real alligator, we tend to use extents in order to the.

B

To the rights that user issues and.

B

Probably there they are long in evolution. Hence reading is faster well, both reading and writing are.

A

The extents in the allocator like if you're running at a four megabyte file and you're you're, using like smaller extents, this make sense, tend to be sequential. Or is it that they're just kind of like random? Is that why we're seeing.

B

It, but it depends on the state of the store or on the initial write on the initial run when it's empty or we had. We just.

B

Summarized sequentially: it tend to return continuous blocks, but if we reset the locator, if the restart is d, it will start search from the beginning and hence.

B

Tend to return fragmented.

A

B

In fact, I recall that I found that sometimes ivl suffered from from the same issue, but not sure I can explain that right now,.

B

If you look at lines.

B

Line 34 and 35, which is pokey right.

B

You can see pretty significant difference in writing performance for a real 4k case and.

B

As far as a member, that was exactly because this part of the allocator results in finding.

B

More fragmented, extents and hence the the drop something like that.

B

Maybe I'm not very good in explaining that right now, but well again, it's so sometimes we might experience pretty interesting numbers when we start pretty interesting difference when restarting locators, both bitmap and the real one. So so, when I run initially they tend to to return continuous blocks. But after I started, some releases happened before they start to to to return more fragmented space.

A

What's what's the friends between 9:34 and 935.

B

Just enough of a file process, but generally that's restart of their locator, so it starts from the beginners well for bitmap, it starts from the beginning of the.

B

Of the space AVL a bit different but I'd say it starts to the Torah. Well mention the first run of the real allocator, which needs to allocate to 4k chunks.

B

At initial run, it will return from offset zero and formal set 4k. So the resulting.

B

Extent all these two extents are continuous and subsequent right will most probably benefit from this continuity. But when you perform.

B

When you perform some releases to this, allocator do to do two different writes. It tend to have a list of short extent. We just created all the dispersed or space or disk space and again, when you try to allocate to extend independently I mean that's not a single write, but two different rights, so they do occur due to calls to a like a real locator and the resulting well resulting extents are not.

B

Contiguous they'll either they might like eight different misc locations due to because previous releases happened, random at at random locations and actually these difference between two runs of the same: a locator on pretty the same store.

B

Most probably explained by by this this logic.

A

Sorry, just so I understand in the online 34, it's like the first round and then 935. It's like subsequent run after something. Yes, yes, yes,.

B

It just so so so we deform tried for well well for pretty short time. We just written 200 megabyte then restart the whole process, restart the locator and it's it.

B

It resets some it has internally and starts searching from from the beginning, and comments drops well again, it's it's it's it's absolutely true for but bitmap a locator, but as far as I remember, it's pretty the same for file.

B

But let me double-check that of Lion I'll, probably I'll, provide that explanation in written form.

B

What can you hear me yeah.

A

Yeah sorry yep, just looking at more of your numbers here, I mean some of these are really impressive. I mean the random right non 4k like the bigger ones, 8k 128 K, and he sometimes is uh three to four times. Performance improvement on the first run. Second run is still 50% like I'm looking at 37 and 38 or sorry, 36 and 37.

B

E

B

Just want him one remark here: generally, all these benchmarks are pretty official from production. They might be less visible. Both drops and improvements.

B

There are plenty of factors like fragmentation, right block sizes and things like that. The previous releases, how they happened, taught to tourism.

B

But generally I like the results which are provided by VL by hybrido AVL allocated plus differ and if your Triton, except the the case when we want faster, 4k right, but that's unavoidable, trade-off, I'm, afraid.

A

And why are the 4 megabyte random reads: half as fast. You know in the deferred case, like mine, 18, I,.

B

Don't have explanation for now, so so I know why.

B

These PG and E columns are that bad, but ill, not sure why we have an I haven't reached the additional performance for different rights. I'll check that I don't know what month.

A

Overall, really exciting I'm good job you going. Thank you.

A

All right so I guess the other thing I was going to mention this week is while I was on pto I kind of get to work on a project I've wanted to do for a while, but has been in a back burn and which is to run the IO 500 benchmark on surface there's we're on on the the list that they have right now.

A

You can I've got it in the ether pad here, but I'll link it in though this is the 10 node challenge from the supercomputing 2019 and, as you can see on it, the stuff is there. We were represented, but we're kind of down here at the bottom I think Sousa was able to get a score of around 12 and a half that puts us at place. Eighteen of the top 25 here so I started working on it using our new officinalis nodes and so far, I've gotten us up somewhere around place, 13 or 14.

A

Pretty variable results. I see us typically getting anywhere from about 20 up to around 34, now is kind of the the highest I've gotten it most recently. I did see it I. Think once higher than that and we did beat the 13th place result but I have not yet been able to repeat it. So we're we're close, we're kind of hovering. You know pretty close to 13th place, but that's that's kind of where we're at right now.

A

What I'm seeing is a lot of kind of hard to understand? Behavior I! Guess this way, but it I'm seeing slow, sequential reads with both the our body and these of us Colonel clients I, don't know why that is yet I need to figure that more.

A

That's one issue! That's really hurting us, because it's in this test is about 17 gigabytes per second, and we should be able to do far far better than that. With this hardware, with with highly parallel reads with lib RBD, we actually can hit about 60 to 70, so lots and lots of room for improvement on that front.

A

The other big thing that I'm seeing is I'm having a really difficult time, trying to get consistently get balanced, inodes across MPS's and then also balance. Requests across mb/s is probably due to that, but also sometimes ever even seen, kind of weird weird behavior that doesn't seem to match the distribution of inodes.

A

So even with perfect, pinning I'm seeing kind of weird I know, distribution, sometimes by perfect I mean we've got clients set up with to write into specific directories with an equal number of files, and each of those directories is pinned to a specific MVS such that all MDS s have an equivalent number of clients only talking to them for the directories that they're working in but yeah.

A

Theoretically, we should have an equal number of files represented on every single MDS and then doing you know, stats and other things across all of them across the whole cluster. It's often times when I do a small set of tests. It works the way it's supposed to, but in certain cases I haven't really know down exactly all of them. I'll see only one MDS, getting a huge number of I knows and then occasionally some other ones popping up here and there it's almost like it's reverting to the balancer behavior, but it's not good balancer behavior.

A

It's really unbalanced, so yeah I, don't I, don't get it. I do need to probably update my kernel client, so I can make sure that all the pending happened properly, because right now, I can't actually check them. Getting the X. Adder doesn't work. So there's there's hats you, let's see. Oh, we missed the femoral pinning PR. Thank you to Patrick. He found a bug in it that was causing even more unusual behavior, though that's that's at least figured out Thor, not using that at the moment. But potentially it was like it's fixed.

A

We can try that again as well, and I also did try the fuse client, and so the results are maybe slightly hilarious. The everything runs quite a bit slower. You know maybe 10 times slower, but the one case that was even worse was using or the the MD test, hard delete case and I need to actually verify exactly what that is doing. But I was seeing about 21 ups per second across all eighty or no I'm. Sorry, 100 mb/s is that I have set up and Ellis was hanging.

A

Everything was very, very slow, so I'm assuming he's just a bug but I, don't know what it is yet they're, so so cab.

A

But the last thing here I'll say is I, don't feel like I have a very good understanding of what actually is going on under the hood regarding the balancer and also even with penny I mean I've looked at the pending codes like more or less know how it works, but I don't understand why I'm seeing some of the things I'm, seeing though I think that's really kind of what I need to do next is just try to get a better understanding of this, and Patrick pointed at me at some good scripts that he's got for for kind of cataloging and analyzing, both the the the subtrees that you can look at through.

A

You know the stuff commands and also the looking through the perf data and correlating that so L, that's on my list of things to try to get or PM looking at as well. That's that's! Basically it right now! You know: we've got numbers that are non-qualifying, but we do have numbers that are looking much better than what's on the eye of a 500 list right now, I think we can do much much better.

A

We've got the performance in the US needs to do it. We just need to kind of figure out. What's going on here, or at least yeah I need to figure out. If it's something I haven't said right or if it's you know things that we need to change in the code, so that's it Patrick and anything. You wanted to add, based on what we've seen so far.

F

Yeah I mean the only thing is it would be. You know we definitely want to kind of grab some of the behaviors we're seeing that would probably figure out. What's going on in terms of you know, almost behavior, the ephemeral yeah, the term opening had some weird bugs and that an invalidate all the results I expect yeah, then the normal export pins to to work pretty well in terms of distribution and exercise, those pretty heavily in line owed and I'd expect to get linear, meditative performance scaling in the Intel cluster Alice.

F

Who reason why not I don't know if you mentioned, but one interesting thing that marks doing is he's uh got any MDS ranks in his cluster, which is not something I've done before. So it's conceivable he's uncovering some new new issues been in ffs by by scaling that large, so any crazy. With the what some of the graphs produce at very least we're gonna try to get some people like databases involved performance, thought that so that we can try to see.

F

If there's any strange change, behavior going on I'm cracking a might be difficult just because it's the number of MDS I think we would require what using the graph scripts that have currently written I might have to either look at only a subset or bright new graphs, but the latter usually takes a while to do and with so many Indians I'm want to be a bit creative without a represent Tom Baker.

A

And Patrick I can feel that down really easily to like I could in about five minutes. I could scale that down to building a new cluster. That would you know I've like 20, or you know.

B

A

Want to test, if that makes it easier well.

F

88, certainly an interesting number of ideas: I mean I would exist, I, don't have any reason to believe it wouldn't work. So if you can make it feel that much that would be. You know, I'd love to see what kind of results we get, but if it's, if it's just falling over then and the interest is actually getting results that are they're useful, um we might want it to go down and then be.

B

Basra, the.

F

Large number revenues out-of-band.

A

Patrick one other thing that I did look at was lock contention in the MVS when, when I was doing these tests, I was sometimes seeing like really high CPU usage. What high, as like, two hundred per demon so like two cores being used, maybe up to three, but when actually looking at like a wall clock profile, it seems like we're waiting on locks a lot and I I. Don't know that me has code that well yeah, so I don't know what exactly we're waiting on, but um the the per mb/s throughput never really got.

A

That's not entirely true, but it typically never got higher than around ten thousand ops per second. Does that match with what you'd expect, or is it an unusually normal.

F

They in terms of lock contention they, mostly just one big India sake that you might be seeing in your analysis that we have broken out some death and then yes, maybe the journaling, the first few. So if you're doing it on the workload, you might see that thread get pretty hot and then a few other small to keep past. But it should not be that significant. So most of the work is going to be done in the messengers thread and mdf block or whatever that's using MDF fog.

A

Yeah, that was what I was seeing is that the messenger threads were all quite busy and they are all they were also putting a fair amount of time waiting on whatever you know, probably that big walkman that you're mentioning and then there was in the two different wall talk profiles. I looked at each one had different time spent in a couple of other threads I can go back and look at them again, but that was that was what I was typically seeing.

F

Yeah, but more generally, we haven't done a lot of analysis of many called the Colfax.

C

F

For for them, yes, so- and you know your your kind of venturing into new territory- Jung has done some of that. But generally he's just been fixing small hotspot pieces here and there, but I don't think we've done any kind of systemic analysis of the India's.

A

A

A

Well so yeah, that's where I'm at right now, hopefully we can I'll try to get those try to get yourself gather, script working and then maybe we'll have our data for you to look at.

F

Yep, that sounds good cool.

A

All right: well, that's all I've got anyone have anything else this week that they'd like to bring up or talk about.

A

All right: well, thanks for coming everyone and we'll see you next week.

D

Thanks mom, oh good day, guys I.