Ceph Performance Weekly, 21 Sep 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-SEP-21 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, everyone.

B

Can't hear mark not hear me now, I can hear you. Oh sorry,.

A

I was typing, let's see who all do we have up HS.

A

I'll speak he's able to make it or not here.

A

Okay, sages basement is flooding so he's trying to take care of that. So he's not gonna make her today and I sympathize living in the midwest. It's it's not fun all right! Let me find my my window here where I've got the etherpad and we can go on all right, so world of pull requests this week, we've got a couple of new things that look interesting.

A

There's there's two different compression pull requests Igor's, which is has been kind of hanging out for a while and sage, did a review and give the thumbs-up, and it's is basically ready to go through QA and that's exciting- is purple compression settings in blue store and then a little bit related to that peter has one here for for basically making compression a little bit faster by not doing unnecessary, Edler 32 calculations, that's where we're already doing the equivalent stuff ourselves anyway, so yeah, it's!

A

Those will be good to see when those conversion I'm kind of excited, to see what what we can get once were we're compressing everything.

A

Paige has a another pull request here for reducing the amount of data we write when we dirty metadata. I haven't looked at that real closely yet, but hopefully that will that will help us in our quest for writing. Less data to rocks to be I have not followed David's patch much here for reducing deep scrubbed impact on normal operations, but I. Imagine that there will be a lot of folks that will be very interested in. That, though, might be worth ones time to check out.

A

If that's something that you're interested in what else has been going on.

A

Yeah I haven't really followed to these other ones that were updated here.

A

So I guess I won't speak too much to them since I haven't really been looking very closely. Oh.

A

One thing I will mention is that the FIO engine for object store got merged. So that's that's great. A lot of folks are starting to use that for doing blue store testing and other object start testing. So that's exciting. I have two rather embarrassingly admit that I have not looked at that yet, and I probably should so that's as good, though that that has been merged.

A

And beyond that I guess we just got a whole bunch of stuff that doesn't have much movement, yet uh I think sages, hoping to soon get the fast Dean code. We encode her encode decode stuff merged that that seem like it's passing at least our our kind of initial functional tests and performance tests here. So it needs to go through the more extensive s in technology, but I think after that I'll probably get merged.

A

So that's that's very good, yeah, alright, so the other stuff that I've got for this week is we were seeing about a fifty percent regression in random. Read performance at first I thought it was something that we had done with loose, or maybe the arting or something. But after going through and bisecting, it turns out. It was when we merged making the ASIC messenger default, and so I've been running through some tests.

A

With that increasing thread count in a sink messenger seems to help, but it's not enough to kind of get us back up to simple messenger performance and also, if you increase the thread counts too high. It actually makes the OS DS and the SEF commands segfault a GDB. A back trace is a little bit strange, I'm, trying to remember exactly where it was, but it wasn't something that was immediately obviously wrong.

A

So I'll probably need to just sit down upon line haven't o'clock, um but yeah chem in general and I'm trying to see if there's anything else that can be done here easy easily to to kind of get around this. Otherwise it'll pause take a lot more deeper investigation, I guess then, from those the one perspective at least is not least or so. um That's good, I guess.

A

Another news, the wit blue store that we are seeing a lot better performance than we had and it back in the the jewel timeframe. Love the the work. That's been done on encode decode and trying to reduce the amount of data. This game shoved into Oxley, be as paid off for dal random writes. So that's good news.

A

Really. The the big remaining thing beyond just the async messenger stuff is sequential, read and we've cam known for a while. That's been problematic and I. Think that's something that we're going to need to focus on again. There are other things here that we can probably improve on this camp general performance tweaks in various places, I'm still kind of interested in how much we're seeing blocking slowdowns in the bitmap alligator. Since that's at least Mike traces has been showing up more lately.

A

Now that we've kind of resolved other things, but overall I think on the right side. We're actually doing really well now we're typically faster than file store or as fast at least so I think probably sequential read is probably the next big thing that we need to figure out what we're what we're going to do about. So that's kind of that yeah. So I guess the scan vol I've got this week.

A

Nick it looks like you've got a the new testing to share. Do you want to talk about that? Yeah.

C

Hi mark I'm, nothing major, exciting I just did a little bit of testing and just sort of running sort of benchmarks, different I, all sizes and just looking at how much CPU was used for a certain performance and just sort of plotted that graph that I've put in the etherpad I just thought it might be useful for people when they're trying to size up sort of how much CPU power they need for a certain sized OSD node.

A

Cool this is just with file store rate yeah.

C

What yeah, okay.

A

Okay may be interesting to see how that those graphs look. If you did please store as well just to see how much difference we see in them.

C

Okay I'll see, if I can it won't be able to sound hard, but I can probably get something to fill up and then I'm do something similar. A blue stop. Okay,.

A

Yeah I'd be curious about that just to see if, especially once we get the the new encode decode stuff in, if if it looks similar or not, okay, I have a suspicion that blue store. Well, it may be it's not at this point. Maybe it's not more d you hungry, but for a while it was.

D

A

All right, anyone else have anything they want to talk about this week or questions regarding any of this stuff. I have.

B

A question about a sink messenger. If you don't mind, um so you said that uh it's still on a slower than simple messenger, but doesn't it use less resources than simple messenger and are there situations where that would become a significant factor? The.

A

Only situation I'm, actually seeing it slower, is for random reads smaller and amides specifically, and in fact, for like medium sized random reads, it's actually a little bit better from what I've seen so like the the kind of performance graph that you see, if you actually here, and that bad etherpad entry honey and you get that up on the screen me a second here.

B

While you're doing that I just the reason I'm asking is because my group did some testing of like hyper converged, you know SEF OpenStack storage and on the Sefo s, DS use up a ton of memory. You know and I was wondering if you know a you know, and they open a lot of sockets and so forth and I was wondering if simple messenger might lower the overhead.

A

Yeah, it's a good question um that so one of the things that I did notice when I was doing this is that adding more threads to async messenger seemed to help. So you know taking that what you will it might just be that were or you know, not not keeping them or that were or we're waiting on them. I'm, not totally sure it doesn't help enough to guess up to the performance of simple messenger, but that is 11 effect. I didn't notice, I've only kind of started.

A

Looking into this, though so I'm not I, haven't even done like a perk traces on a dirt and thanked it to see. If there's anything interesting going on what I did want, can you guys see the the screen here? It's.

B

Jumping around a bit that yeah, oh sorry,.

A

So right now, do you see this throughput megabytes per second graph with the store versus file store, yep an Indian? So if you look at the yellow and the green line, it's from master on in late July versus the this whip bank branch plus some other patches I, had added that yellow versus the green line. There, really what that's showing you is actually a sink messenger versus not a sink messenger.

A

Even though there's all this other stuff also going on really that's the performance difference there and you can kind of see that at those middle io sizes from like 64 or 128, megabytes r kb up to well really up to kind of like large io sizes, the Green Line is higher. That's a sink messenger. Last week.

E

For sure mark week, against the five store, sorry that simple messenger one right indeed it with the bump up to see my love, its cash value at.

A

Tc Malik start cash is set at 128 yeah.

E

He'll chill for simple for I think right: we don't need to do that, so that's definitely consuming less memory over it, then simple, simple, probably with curriculum, be default, will be performing the water I. Think yes,.

A

I agree with 3200 would be, but given that we're doing, 128 is kind of the the default now it seemed it seems to me like that's where we should be testing them yeah. You disagree that.

E

Question that actually, like we not say so, that over eight of the resources, so that actually tells that acting keep on doing english and memory and resources. That simple, probably.

A

Yeah I do have that data I haven't wanted to yet what the memory usage was, but I've got that okay, so so.

F

Mark I presume this is all small-scale testing right.

A

Yeah, this is basically like 40 s, DS per node on four nodes. So it's not not real big at all. Well,.

F

How many clients are you running sorry, I, God.

A

F

A

This is 88 client processes with like an I/o depths of 32 per client yeah.

F

I I have a suspicion that if we were running a larger scale system, we find out that the icing messengers and slower on the random read so as.

C

I'm, noting, like.

F

If you fewer resources and two or threads in a large system, because simple messenger creates a bunch for every connection, um but what simple muster does get to do is it has one thread that just sits there and raises docket and shoves data and VOSD. We as the async messenger, has to like switch through and do equal and stuff to try and watch a whole bunch of threads for a whole bunch of socks and then pass them off to a worker thread.

F

So I think it's just a teeny bit more overhead on a single saturated like connection. But if you actually had had incoming data from a bunch of different clients, we'd see that it's it's much more efficient. Don't.

G

Be so sure good about- maybe not, but it's not passing them off to other threads, at least not in any more least, not any more than the kirtle was passing off to waiting threads on blocking syscalls I. Don't think.

A

G

A

Is have a scenario where we've got like pretty fast Oh s DS and not that many of them alright. So in this case this is kind of testing how how quickly a small number of OS DS can can take in a fair amount. I mean not not a ton, a ton, but you know it's. It's like ODP and basically 64 incoming ayos random, read iOS at a time or in flight at a time, I guess per OST roughly, but is that right?

A

Maybe I don't have my numbers right well, in any event there it's it's basically testing fast osts! So take that for what you will.

H

Can you put this graph up or we could get to the wrong data I.

A

The the ODS file is in the etherpad. If you are wanting that, do you want the FIO data, like the FIO a put, or is the spreadsheet data good enough? It.

H

Spreadsheets great okay,.

A

Yep, that's in the etherpad. Do you have a link for that.

H

But the ether category for looks old.

A

Okay, this a second here, okay,.

A

Is that what you're looking at.

A

H

A

Okay, the link is on line 164.

H

You don't want to consider breaking the etherpad up. It's got some really old stuff in it may be archiving. Some of that would make sense. Yeah.

A

It's actually a probably a good idea. It also makes the etherpad really slow have all kinds of kind of issues when the connection is no good break.

H

It up by year, or something like that would be a big win. Yeah.

A

I think you're, probably right.

A

I'll uh I'll see, if I can do that here.

A

But so so anyway, a bend to answer. Can your question about performance? So it's definitely not always slower. You know, at least in this particular case, but larger io sizes. It's actually looks like it's probably faster kind of for these middle ones and then, for whatever reason below at 64k and below that's where it seems to it seems to do worse, but it's it's not necessarily because a I guess the reason why it seems to do worse is because, for whatever reason, simple messenger seems to spike back up in performance.

A

At that point so I don't know. This is a trend that we've seen for temp for a long time where we kind of do oddly kind of badly in these middle io size ranges and then at 64k with simple messenger. We kind of bike back up in performance and then kind of go down. Oh I'm, not I'm, not sure.

A

It's interesting that we see this kind of difference in a pattern here and it did was. There was some kind of like differences here when we change TC Malik settings to like. If I remember right, when we went up to 120, it makes of thread cash with TC milk, we actually saw the degradation in these middle io sizes.

A

No, if that tells us that simple messenger let the alligator is just having a lot of trouble with simple alligator and that's why we saw that and we're not seeing that with a sink messenger, but its overall just kind of a little slower, but it is kind of interesting behavior. So there's there's probably more of a story going on here than we know right now,.

B

A

Yeah, sorry I, don't have a better answer for you. Hopefully we'll find out more as we look into this.

B

Well so, like I mean since around the subject like, is there a difference, for example, in the way that they use sockets when recovery is going on? Because, as I understand it.

B

With a simple messenger, when when recovery is going on, there's a lot of a you know, a lot of you folks are going to have to help me out here, but the a lot of the sockets are not in use, or something like that. I think I remember hearing except when there's recovery going on, is that true.

A

C

A

There should be a lot of, though during recovery. I guess there'd be a lot of communication happening. Certainly right not.

G

Any more so than when regular iOS app, I mean more so in that the cluster could conceivably be busier, but that's it's just a function of the business.

G

Well, I guess you'd be talking to a few more yeah. I owe gates so during recovery. I guess you'd be talking to more other OS DS / OSD done otherwise, so yeah yeah. It would be worse.

B

Because I'm just wondering, if, like we compared simple messenger to async messenger, when recovery is going on, whether there would be any difference well,.

G

I dropped out for a moment, so I wasn't totally paying attention, but I think this is less. Do we accept performance degradation, amazing, messenger and more there's an obvious performance issue with async messenger we should fix I, think we're gonna still default a sink.

A

Yeah I think that's the right right answer.

G

Like we're not gonna, I, don't love us. We have evidence that this is some kind of unavoidable problem. We're not just gonna eat it. We're gonna fix.

B

It that's fine with me. I didn't realize you'd made that decision. So that's really helpful. Yeah.

G

I mean civil vesture has so many problems that, even if there is a small regression for small clusters, people can manually choose simple yet.

B

A

Alright, well I think that's all I've got so I'll probably stop during this now.

A

Does anyone else have anything that they want to talk about this week.

D

Yeah, just briefly mention that I've been recommend running cbt via technology and they're, going to start adding some more data points for the CBT to collect, regarding, like in different intervals, for recovery, for, like example, have one different phases: recovery get our. How long two faces appearing are that sort of thing? So the kind of an interesting hearing folks give us lots about. Other things will be interested interesting to add in that sort of framework cool.

D

A

Really good um Josh. Do you? Are you using the can of the recovery stuff in cbt? Are you doing something else through to thought you self? Yes,.

D

Playing on using the recovery stuff in cbt, because I can make sense to have cpt just be able to do that directly. Cool, very.

A

Cool yeah, one of the things that um but cbt doesn't do as well, as is like the thrashing test. Is it it kind of? Has this canned.

A

Kind of recovery test right now, where you can, you can repeat it over and over again, but it's not nearly as like random, as the thrashing in tooth ology is. Do you think, would that be worth while to kind of make more like the pathology one.

D

Yeah I think they're, probably some things we could add to it. The apologies I had to make things like not beyond just simple like taking down 10 SD things like how various kinds of other background operations like scrubbing or increasing, PG couch or increase your application count. That kind of thing yeah.

A

That's what I was thinking too because, like right now, I just you can you can take down a set of OS DS or you can take down. You know one or whatever, but it's pretty it's pretty basic right. You know it does a little a subset of what technology can do. That would be really cool.

D

Cool need this, like I, think it make sense to keep adding that cbt since to be running stand alone as well, and it's a easy enough set. Add that logic there yeah.

A

That reminds me Ben I, wanted to ask you did sorry. I should be on top of this, but did you end up submitting a pull request for the the monitoring changes in cbt.

A

Oh, maybe we lost.

B

Ben-Ari I I, pretty close to having that done. I think I've been finishing that up, I'm hoping to get something for you to look at today and see what you think uh I, basically for people that aren't aware of what was going on there I was doing some work to try to make the sort of the monitoring a more configurable, so you could pick and choose which monitoring tools you wanted to run.

B

You know by editing the cbt amal, and to do that, I made it sort of more object-oriented the way it works, and uh so I haven't I kind of got. It part way there, but I hadn't finished it and I'm trying to finish it up now. Does that help? Yes,.

A

So so what I the the reason I brought it up was because I was thinking that, with Josh's stuff that he's working on here, it might be useful than he could tie into your work. Do basically if we were going to start playing around with like carrying or other things that, then we could use that new framework. You've got 44.

A

Actually, writing monitoring for all this, like looking at at different different statistics that we maybe want to grab during during these other kinds of recovery or appearing or whatever it is that that we could get into point to look at right.

B

So I mean I think you can get the basic idea of it from looking at the UM the I think, there's something in the cbt pull request area or something that there's a visible request.

B

That's there, it gives you the basic flavor of it, and so, if you want to add a new monitoring tool, you just write a small class that inherits from a dbt monitoring base class and it it's pretty simple to arm right that you know, and then you basically just have to edit your yamel to reference it and it will pull it in. So that means that is that what you were talking about.

A

Yeah yeah I figure that so like right now and at least in the when you were on a recovery test in cbt, there's like some hard coded thing that basically just like I think you runs like stuff health there's something over and over again and dumps out the lines into a flat file somewhere. But presumably we move that and any other new things over into like your framework, where it's just a module that you load, that you know, grabs health information or whatever your degraded object.

A

Information from wherever we end up wanting to grab it, whether or not the stuff out there have something better um but yeah. Getting getting that kind of monitoring stuff moved into your your framework, I think will be really good and hopefully, hopefully we can use that for new stuff too.

B

Yeah so yeah, sorry, it's taking me a little longer than I had hoped, because I had to juggle a bunch of things, but uh so I posted the pull request um that it's the sort of not up not totally up to date, but it gives you the basic idea of it in the chat window or anybody that's interested and, and what I've been doing since then is uh I pulled some of the common code. That's in all these different benchmarks into the arm.

B

You know the base class for benchmarking, so that yeah it's easier to kind of make the changes that we want to make and I. So I, don't know if that I'll see what you think of it. When I get that posted, but a home, there was a lot of sort of repeating code in the sum of the benchmarks, cool cool.

B

So that's it. Everybody.

D

Clear looks great I like that what you pulled that out of the individual benchmarks and put it into a base class or good con. Please I.

B

Mean it's very complicated to do this so I mean if we do it, you know it just uh arm.

B

You know we're gonna have to kind of test all these different benchmarks and make sure they're still functioning and I'm trying to do that for as many as I can um and what, if.

A

It were honest: some of the benchmarks may not even be functioning right right now, like the sum of the the colonel rbd one gets tested so seldom leave it. We.

G

A

Looking at it, you'll probably make it better than than it is right now, even if you're you're testing it at all. So, okay.

B

Thanks / thank.

A

You for great that.

B

But yeah I think it will make cbt a little easier to you know to customize for what you want to do. Yeah.

A

Yeah well cool uh yeah, Thank You Ben for working on that one. Thank you for Josh for the stuff that you're looking at, because it's they're both can be really really nice.

A

All right um anything else, guys.

A

All right: well, let's wrap it up, then I. Imagine next week we will be probably will have the new loser, encode, decode, stuff merged and hopefully we'll know more about the async messenger stuff. So until then have a good week and we'll all meet again next week.

A

Thanks guys take care, everybody.

D