Ceph Performance Weekly, 27 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-06-27 :: Ceph Performance Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, sorry, I'm, late or meeting went a little over.

B

A

Good morning, Casey.

A

All right don't have too much on the agenda for today, but we get started.

A

A

Choruses yeah before I forget um David was just asking about changing the debug level of the of trim and boost or to like something higher than 20 so go.

C

For it, yeah I was.

A

C

Yep yep, it was good one, it was buggy, but it's not buggy. So let's turn our way up. Yeah.

A

Alternatively, we could actually put the whole trim piece in like a separate debug thing. Why'd, you just delete completely.

C

Yeah I'm not sure that that buys us much. It's the only thing, really a higher debug level. So either you turn on the new channel or you turn up the existing channel. It's by the same thing: okay, but yeah yeah.

C

C

Okay, so there's a pull request that suggests disabling the apptracker I think they showed like a 20 or 30 percent performance improvement. What was it on smiley? Oh yeah, it's.

A

A big wait so uh I actually started looking at this morning's I've got really. Recent wall talk profiles and um a lot of it is like destroying strings an event. So I'm right now, just like switching over to an enum and like that referenced one from like when you register in the OP tracker and then just like, losing it.

C

Yeah I mean if we can make those like Const car pointers instead of actual strings when the case where they're fixed, that would definitely help yeah, because when you I mean the problem, is that when there's a stuck up you want to know about it in order to know anything about the stick up, you need to track this stuff.

A

But the the strings don't actually have to be stored as values. In the event right I mean it's just a name yeah.

C

Hopefully not yeah yeah. We can provide you better. So, let's I, probably a rape target to optimize.

C

C

It's it's an easy thing to turn off that it is pretty useful when things break so yeah.

A

I think that will help a lot I, don't totally get rid of it. We still have a little bit of contention between, like the the messenger thread and our threads and the ETP Oh SCTP threads. But the big thing, the most obvious thing is just.

C

I'll see how much that helps: okay, okay, so there's that there's not a pull request from Igor. It's that makes the you know, deletion stuff in blue store faster, and this one surprised me a little. She must have being this where the OST is spending a lot of time deleting PGs up, should they get rebalanced or something, but it basically a little bit more clever about researching when it's scanning prefetching and pre loading stuff into the cache. They don't go back and read it again later and so on.

C

So it makes sense, but I've reviewed it. A closer look um couple things merged, adding a steno, SD or SDHC TT variants for the recovery max active based on our discussion last week that merged the open question. There is whether I'm turning up just the recovery max active.

C

And show the performance, though, that you were seeing mark or whether it is also necessary to have concurrent backfills. It will help.

A

In the little help up to a certain level for like, based on the number of those T's you have for when you, when, like an OST, is marked down when you put an OST back in, it will be based entirely on how many, if I, understand Josh, currently on how many threats that we have or.

A

Procure right, it's.

C

May or may be because.

A

We won't use Mir that might do it looks like one. We won't get any better from it.

C

Well, no I mean there will cuz it'll still have more data in flight over the network at once,.

C

What's that sorry.

A

Would it if it, if you still have oh s, D match back feels like for one, you won't get any anymore in flight data. Will you yeah.

C

That's not what this is OST recovery max active is a member of in-flight recovery rights essentially messages at once. So if it's one, then there's a one message with like 4 Meg's or whatever going over the network, you wait until you get a reply and then you send the next one. If it's 100, you got literally a hundred messages and four mags. So four hundred Meg's in flight, like the queue depth pipeline or whatever of the recovery.

A

Sorry and OST max backfills won't.

C

Max backfills is basically how many PGs you're spreading it across yeah, so max backfills would be like a worth pack going to PGH have 50 in flight operations, as opposed to one fiji with 102 the operations. But the only real difference is that, if you're bottlenecked on the source like if you're back filling to the scene or yeah I guess it does actually matter.

C

If you think, if you're bottlenecked on the read side, then having multiple backfills means that two O's geez might be reading and writing to the same location but might go faster versus having one OS, do you reading and writing to the same location but I'm a little bit I think there are cases where increasing max backfills will speed it up that most of the time, just increasing max active should enough. It.

A

Didn't seem to work that way when I was testing it. Okay, it looked the the behavior seemed to be that we could get with with max active bumped way up. You could have roughly like OSD, divided by two active PJs being being worked on in the cluster at the same time through that was with. That was with the OC max sorry, OST right.

C

That's what Max.

A

C

Drove right, yep.

A

The only vendor controls.

C

How many peaches are active sure.

A

Yeah and then, but and of course, pumping up max backfill, then let's you actually get up to is d max active active whatever the active one is recovery max active, alright I connected whoever whoever.

C

Is that I forget I mean that the goal is to have? Is the overall throughput of recovery? No max backfill controls how many PG czar recovering, but that's orthogonal to the throughput or is most survivor right so that we don't care about how many peaches are are actually recovering like we actually want that to be a smaller number so that they can sleep sooner, but we want. Is the recovery throughput to be as high as possible? So that's don't pay attention to the PG can't pay attention to throughput.

A

What what I think is important there, though, is that the PG count, if you're, if it's like half the number of those DS right, you don't have every single OSD and the cluster actually doing one at that point, yeah right, because you you, you probably like hit something where you're trying to do it on one is already doing one yeah.

C

A

So that's the problem that we could maybe try to solve there. I guess: there's some lead up to make it.

C

A

You get even distribution without mean if.

C

The only reason you won't get every LSD participating is if they overlap the Fiji's overlap, and so they would be competing on the same LSD right and so and.

B

C

Assumptions we want to avoid that because we want that OSD to be working on one PG and get it done as quickly as possible. Exactly.

A

C

A

C

The minimum is actually two because it's the primary role and they broke with the role of it anyway.

C

If we should repeat that experiment basically and crank up max active, really high and see how high the recovery throughput goes. That's the that's the thing we want to maximize. We don't care at all about the number of teaches you want to maximize the throughput and see if it's really necessary to increase max backfills and already get higher throughput.

A

That test did kind of look at that right, like once increasing the OSD max backfills in addition to the active recovery did speed it up, it did make it faster. Did it fit yeah, but this is also an nvme right, so you write inherent parallelism yeah and the way that DME works, I guess.

C

The question is, should we add a HDD and SSD variant for OC microcells I mean how high did you have to make it? Did you make it like 10, or did you make it.

A

I think I made it like 64.

A

And then like 16 for yeah.

C

We don't want it to be.

A

C

3, maybe all right, although.

A

Having said that, those settings made it like a 30, 70s or sorry, a 70/30 split with like client to recovery. I, oh I, know that's kind of high on the recovery outside for what people want, but it was not nearly as like weighted towards recovery, easy to expect yeah.

C

Well, can you repeat the experiment with just backfill equal to 1/2 and maybe for like just with small numbers, I, don't.

B

C

Higher than that that it that, if it, if it's a significant improvement um like two or four then maybe.

A

We should do that I think for would at least get you to the point where you'd have most OS DS participating in one recovering one PG right. One.

C

Will do that paying it to one will have every OST participating and it.

A

Wasn't half well.

C

It'll mean that no OSTs will participate in more than two, though, if in order to get every purchase every OST participating in one, then they're going to be some of these that are participating in three or four at the same time. Yeah right!

C

That's why sure, and that's going to mean that those PG, so probably what we really should do is do that experiment, but actually measure the start and end times of recovery for each PG and then and then look at how much slower the recovery time for individual PG s are, or actually even just look at that overall recovery time for the whole cluster, because maybe TTS take a little bit longer, but because they can start sooner. The overall recovery is sooner.

A

So sage, the my talent proposals, rather than doing all of this work and like trying to figure out optimal values for these- that don't actually may not make sense, based on how variable different in DB drugs are. Let's just make the like buttom attic like trying to auto-tune this based on client IO wall clock time. Let's, let's do it that way, and then we don't have to try to like figure out stuff. We just have it doing itself. Yeah.

C

I mean yeah, it's just an elusive goal.

C

I'm doing something Oh like I'm betting, backs max back those two two on ND any might be significantly better and a trivial change. Yeah.

A

It's just you know at some point, the the testing starts becoming.

C

Okay, alright, oh.

C

C

In our buddy will request dental performance that was merged, that's good, there's, a h, ma jumping that adds a new up create. This is basically a blue store, related optimization that allows us to avoid somewhat cuts in the cash when we know we're, creating an object that doesn't exist and that's merged. There was one dog that fell out of that, but I think that's getting fixed now so.

C

Yeah, that should be good, and then your blue store cash. Refactor I did a bill for that. I think I, scheduled Q I need to go. Look at that today that I'm still working on some of the object. Her work.

C

Let's see the NDS cash tuning review.

C

Roman had a messenger change, it was sort of a micro benchmarks. It's unclear whether that's make a have an effect on real workloads. So that's some discussion there and I'm clear whether it's worth proceeding with and then your cash spinning, which I guess is going to follow after your that's going to be based on top of your cash reflect remark. Yeah.

A

We should also look at it once on top of the the avoiding double caching we are. The 2705 is right with some of those performance improvements. I want to make sure that this isn't slower now.

C

C

Sounds good there? What should we talk about? Let's see, we have tracker anything else. I.

D

Have something.

D

Sage, mark and I talked about a little bit about performance testing at silicon and.

D

We talked about trying to set up a sort of like a continuous benchmarking performance, benchmarking, force, F or deep development, so testing against things that are emerging in release branches and set up like a what defines a an actual regression and I think there are some machines available for days. You would involve some CBT work for sure integration with I mean ideally integration with github, so that we would be able to publish immediately results like what what what the bench, what the the test is actually spitting out. I think it's very interesting work.

D

It's it's going to take a little bit of effort like so I have I've been giving some availability to go to start gathering all of the pieces so that we can put some design or objectives to get this going me an email with a lot of interesting things that he wants to have for crimson she's, especially interested well Kiki, foo and Radek are both like jumping up and down to something like this.

D

Typically, you know trying to target performance so in the next few days, I'm thinking no longer than next week. I'll put together these please plan on what I think the next step should be and and try to define the scope of the work and start working on these. It.

C

Sounds awesome how it? How does this relate or is it does it relate to the new CBT tests that are part of the array dose right now.

D

A

Know mark, do you know that so sage I think the idea would be that on the for the stuff like Keith who in and Radek want, it would be entirely through jenkins, doing the basically just for builds and then the stuff and thorgy would be potentially even the same. It could be even the same test that I run through T beat CBT or a different set of testing. It doesn't matter, but they would be. You know the the stuff. That's in teeth, ology would be kind of.

A

You know, smithii more just kind of you know when we're we're running through tests there. We we have some kind of thing that we can look at yeah.

C

A

C

And then this would be like Jenkins workers directly on insert a simple running. Some specific yeah got it: okay, okay,.

A

So so so this would be much more of a we've got a dedicated set of nodes that just do this and they're they're. You know, hopefully returning fairly consistent data, whereas stuff that we run in smithy and technology I mean who knows the network.

A

You know you know I.

D

Like that, that way of saying things up, because we're we're trying our best to be consistent and aim Hardware over and over and over again same spec same everything, so yeah and I.

A

My guess would be that the stuff in in Jenkins would be single single, like demon tests really focused on like performance of a single OSD, whereas, theoretically into college II we'd be doing you know, mold I know that's.

C

Yeah early, something that's simple enough: it'll run on a single box, a multiple of STIs, but one bucks, no, no network variation. um So if this is through so we're talking about doing mr. Jenkins and it's like hooked into github, um all of the current plugins have like a one-liner, basically that you can show up after the test name. You know something like currently. It's things like all commits are signed.

C

Sub modules are in modified blah blah blah thanks check, succeeded I, assumed that we would want to make that something like you know, plus 10% or minus 30%, or something like that there, like a baseline, that we could use to have. What would that be if it's just a number like.

D

C

Not going to know.

D

How that compares.

C

D

I think I think that the statuses reporting for github can't be programmatically changed. They're like it's like the job that says. What's the passing message: what's.

A

The other message.

D

And you can change them on the fly, but we do have access to the github API. So we could set. Labels are just comment. You could just add a comment.

C

We could use for weekly.

D

Just comment: yep, absolutely, maybe even better now, I think that but uh yeah sorry, I I think that we just need to concentrate on getting the benchmarks working in there and and for sure I'll bring it up so that we can discuss like you know. What's what what do you guys think you think is best thing.

A

We can add attachments to like I, don't know if you can do that. But if you can like insert attachments into the comments with like 80 a bunch of like just get Wolcott profiles, we could have automatically generated well-thought profiles for every single, like performance tag PR. That would be amazing, yeah.

C

I mean you can go into details and link to more stuff, they're, just entropy in a comment. Yeah.

D

These are all things that I had no idea about, so I little think understand what the needs are and and make sure that I'm I'm, following through with whatever you guys think, is best well I. Think.

C

D

Any case just wanted to bring this up and make sure you guys were aware. They'll start digging into details and putting putting putting some plan together that we can begin them review and see what what fades, what words, what doesn't.

C

Yeah there are like they're like a handful of I, don't know two three four tests that run um or even just one I could imagine that the the comment would be like a little table that has the the test name and then with a delta to the latest master and the latest stable branch that it's your based on. So if you like latest master latest Nautilus, you know plus 10% minus 2%, whatever it is, tells.

D

C

Gonna be some variation I guess, but you can there's.

D

There's some that's a that's an interesting thought, because mark and I were discussing these in ask like what what determines regression is like a variation of 5% enough or 10% or 50% like what's what's the magical number we have, you do like you know and and I think it's.

C

A hard question to.

D

Answer right, yeah.

C

It depends on the variation and the tests yeah. He even could just 10%. Then it's all nice, but even.

A

If we could just find gross things right, I mean that's like the first starting I guess. If something hurts some some performance by like 40%, then we just knowing that is like huge.

C

A

I was thinking if we can do like different tests based on the tags. The mistake is performance in our BD we run some our BD tests. Let's take performance and rgw, we can do different tests like a thing. Yeah that'd be pretty cool, I, don't know that's hard to do or not, but it seems like that would be something that hopefully like Jenkins, github integration and stuff can do.

C

Yeah or definitely bike shedding, but.

C

If, if we had do you had both a baseline with the current master like if you just keep the keeps track with the latest master run was and also the latest for each stable release branch, and you showed both then also it'll. Just let you sort of keep a running tab on like how you're doing overall, because, as the development cycle progresses, you'll see you're like plus or minus 1% on master, and you know plus 20%, on I'm, not Elissa. Other thing: that's nice to just have to see your progress over the over the courses.

A

I'm super excited about it, I think it will there's a lot of really interesting stuff. I think you'll be able to do yeah.

C

A

D

Yeah I'm excited too. This is good happy to be able to help out here for sure. Ki, foo and Braddock were really adamant on having something like this, because I think right now, they're they're kind of like in the in the dark and yeah.

C

All right pretty.

A

Happier Alfredo Orlando expressed some interest in possibly also hoping of working on the like data, parsing side and graphing and making you know nice nice. You know pretty results from it. He's done some of that before I think doesn't look like he's here today, but I might try to get you guys hooked together too, so that you can I.

D

I did see an email coming through so I'll dig that out I'm buried an email right now, but I'll dig that servant and reply oh and follow up with him cool it.

A

Doesn't work for he does know, he's from Intel right can't the the goal on that we had for CBT output data was to have like a hash encoded directory structure for the output data, so that you'd be able to have that kind of be the unique key that defines then this based on all of the parameters that were used to run the test, and then you could build some kind of index for it have it. You know, do some kind of query against your your set of output directories.

A

That then gives you you know wherever is you're searching for for K random rights or something over that.

D

It feels like a database.

A

Or yes, yes, the parameter space is to do anything else, though very much I'm kind of like database like query system for looking at results.

A

But but not a big centralized database necessarily. This is actually used like on the command line by people, so maybe more young index.

D

We'll figure something out cool, though that's it I, don't have anything else just wanted to come in and mention these.

C

All right, that's great.

B

Right I've got it. Yeah I've got a quick note. We we did discuss this meeting with the MOC students and invited them to periodically drop in and maybe even present, some of their stuff or ask questions. I mean I know you know we are trying to be a little more engaged with them. I know, mark has been working with, or at least talking or a little bit, but it was just sort of there's been a lot of discussion with bu and northeastern faculty.

B

Now, what's the right engage model with the students, they I think they're a little bit protective of them on one hand and, on the other hand, Matt and I were stressing if they're gonna actually be doing things, and you know, even if we take the code and ultimately get it, get it merged. um You know the right way to engage is with our upstream meetings, so that might be might be happening.

C

Awesome: okay, okay,.

A

A

One thing I did why I mention is that Intel is like ready to ship us the notes for the new officinalis cluster David. Even the shipping address and everything so they've got everything they need, so we might just be showing up at any random time here. I, don't know what the status of the network is. Yeah I think we got hardware in, hopefully I think we have it on. Ok,.

C

That's my recollection. Oh.

B

And that that node configuration with it with just the swapping of TLC for qlc is probably what Intel is gonna recommend slash provide for MOC for both rgw caching, for, like the read-only s3 level, caching that were trying to get merged as well as just straight RBD storage, Peter destroyers did some math and he was convinced that the qlc is gonna, be as good as far as data retention and reliability as the high-capacity drives he's got. You know he almost gave this man that wasn't quite sure of his math, but um but anyway, they're gonna.

B

Try that out so, let's sort of be I think pretty much the same configuration I just did I, don't think they had the qlc ready for us. You know I understand, but we're getting a TLC right. Aren't we mark the.

A

The new nodes, if I, remember right, they're gonna have the the opt-in drives for you know whatever you want use them for 146 tens I believe getting whatever.

B

A

B

Tlc yeah, that's what I talked thought. I knew that anyway, before they're kind of interesting.

A

So if we those come in soon, we're gonna have a ton of work to do to get those set up, and then we've got some tests. That Intel had had been shown a lot of interest in running so testing on those and then started migrating the inserter nodes into probably due to thought. We can lock them and then assign them for Jenkins, then Alfredo. Those will at least probably at least someone we can for mystics of those will go to you burn for yourself.

A

Brett you're, muted.

B

Yeah but I was asking a rhetorical question: I guess probably so those would be the constant nodes that we can use for performance regressions that make.

A

Sense that would be yep. That would be. The goal is to migrant in certain ODEs, probably like three through eight to those, because those three three eight probably have had less wear on the the nvme dress, one is going to be ones probably get die soon, because that's what I use a lot but uh but we'll keep, maybe one or two of them just around four or like one off testing I. Think yeah, maybe like six of them for Jenkins.

A

I've heard you do that, that's, unlike a be a reasonable.

A

Amount to to be able to use for it.

D

um I, don't think we'll know for sure the answer until like we're basically doing the work I guess it's good to start with.

A

Intel also expressed.

D

A

It to also express a lot of interest in having some of the efficient all's nodes potentially participate in like PR testing, so you may get a couple of those as well. We have ten of them, which will will have to figure out the right way to divide them up, but but we might at least be able to get you a couple of those.

A

Maybe we use those for really specialized high performance testing like if there's something that that we see that we expect to only really show performance gains on really high I have to put set officers.

A

Alright, well, that's about it.

A

Anyone have anything else that they want to bring up.

A

Alright well have a good week. Everybody yeah.

B