Ceph CDS Jewel, 3 Aug 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Jewel -- RADOS Tail Latency Improvements

Description

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

A

Listen Sam's here all right onto the radio stay low, latency improvements since you're. Already here Sam you want to give us a rundown on the blueprint.

B

Sure, okay, so this is a couple of things that have been proposed for reducing latency on the slowest iOS, though just tightening up the distributions that iOS generally happen faster.

B

One of the problems does that a notice to you, when it asserts, goes through the same sort of failure, detection as a host physically dying, then we can do a little bit better than that. We don't actually need to wait for the cluster to detect us as down with a heartbeat, I'm out process. Everyone report to the monitors the monitors decided enough. Reporters have reported published, knows d map marking us down and finally, we can start doing I/o somewhere else. That process takes, like you, know, 30 seconds or something.

B

So if we're going to fail an assert because of an e io or because something bad happened, the software we could instead do the same thing. We do in the graceful shutdown case, which is send a message to the monitor on the way out, saying I, certify myself as being dead and I will not respond, 20 further messages that allows the monitor to mark it down without waiting for reporters.

B

So that would be a small but very tractable change that someone could do so. There's going to be a later session on some peering speed improvements. So I'll just summarize that, as there are some peering situations, we can do with fewer messages to make currently spend so well actually set the next session.

B

Love your roots now for some reason: / the break. Okay,.

B

So the big one, though, is automatically detecting slow, OSD use, um I, don't if you're, if you're used to running a decently sized SF cluster you'll notice that if an OSD gets slow, the whole cluster will tend to slow down to that speed because crush partitioning data. Based on whatever step wait, you gave it not based on any kind of real-time evaluation of the OSD performance.

B

That part probably can't really change, because you don't want to be dynamically moving data around um just because at OC happens to be a little slow right now, but we might want to shift primary pneus away from such OSU's and we might want to preemptively mark them down if we think that the slowness is a possibly a symptom of a failing disk.

B

um There's also a patch out to read: I, don't know if anyone read the yahoo I think paper, but for an ec pool to read all of the chunks and use the fastest k return chunks to reconstruct the read um and they found there was a pretty substantial improvement. uh One of the problems is that it still goes through the primary. So if the primary is slow, it'll still be slow, um it doesn't really add a benefit for rights and there's nothing analogous. We can well I guess what about yeah? There really isn't anything analogous.

B

We can do with a replicated pool by the time the pride the primary has gotten around to processing the message. It's a fair bet that the.

B

So, fair about that, it's faster to go through the file store, even if it is under load, then it will be to go to the replica. um What we're on that might be to concurrently read from the client from multiple replicas I'm working on some patches to make replicas read work properly, so it would. It wouldn't be a large step from that to being able to uh at the client, read from all the replicas and use the fastest response.

B

ah So this blueprint was also in the previous EDS and not a lot of work has happened yet so I guess we're looking for volunteers and feedback.

B

Okay, so that's the summary of the blueprint part. Does anyone have any comments, feedback questions.

A

Looks like Mark out a questionnaire I.

C

Sink except the slow disco. Later we found the most of tell slow requests, are confirmed the first door into no employment apartment in technician, because something like peace rather to read all of the info. It will look at multi, I, know the from disk, and we found at most it her stand it using nearly 500 million. Second, the to read or another to a complete her general reader I'll be info request, so may accept this, the spool cap. So this problem I think maybe week that, with more focused on something like this, like.

C

A restore itself like a test, or do we have any in Russia about it, is.

B

Well, that is, sage, is actively working on the replacement for the file store. So it's true, the final store could be faster and we're going to fix. That also are. Do you feel that this is causing this generally slower reads, or a big spike in 99th percentile rates? That is some like that is the slowest rates are much much slower than the average case. You have a sensor which of those is true.

C

Yeah, could we consider to increase the fair / canary, for example, we could first open open, open functionary when L we know by default.

C

A traditional Redux is the max and we can open for gaps a dictionary at first then we can do tues to search from from above or audio or or hand. I'm.

B

Sorry I don't quite understand.

C

Yeah, this is little bit later to how to how the hash index do the tuna wrap yeah.

B

Sure ya know so I'm asking is, is it? Is it you're feeling that this means that all reads are slower or that thumb rates are much lower than the average yeah.

C

B

C

B

So if it actually is the case that the object store itself is, has an extremely large read, latency distribution, then we could get some benefit by sending the read in parallel to the replicas from the primary. So that would be an interesting approach as far as actually improving the file store performance, we may not choose to do that. I think what we're going to choose to do instead is rewrite it.

B

Much of the complexity in the file store comes from trying to use the file system directories to facilitate collection listing which wasn't a good idea, as it turns out. That's why we have the hash index. That's why we have to do all those directory traversals, so it's unlikely that what we could probably make the existing implementation better. It would be better to not have to do that in the first place, which is why sage is working on new store. I know.

C

Yeah I know: okay,.

B

So what were you saying.

C

mmm No more no more yeah I just come up with with a common questions and I. Don't know how too much sense to you being so it before you store.

B

B

Do we have any other questions comments? Revisions.

A

Comments, the quote are the questions in the chat here.

D

Hey Sam can youmy there.

B

Oh yeah, oh I, was looking at the wrong one. Sorry I was looking at the Ceph southern IRC channel Oh, see here. How will it o is? Newmarket self is slow, so that is one way it could do it that's a little bit tricky because it needs to know what the other OSD is considered to be a normal normal throughput check the rage from private guau. Just a quick thing: Guang I was talking about replication.

B

In that case, doing it for ec is harder, because the client needs to have access to the Richer coding library used, which isn't impossible just harder, so that would be different. um The the client is already capable of performing replicated reads from the primary or from replicas. It's just that there are holes in the implementation that make it not a good idea to use in the general case, but once that's fixed, which should be relatively soon.

B

It wouldn't be difficult to add them a mode to the client where it opted where it reads from multiple Rockets, the same time use the fastest one.

B

So, let's see as it as far as marketing itself slow you it's! It's odd because you could also have multiple classes of OST so we'll have to make. It will probably have to create some kind of a process where the OSD benchmarks itself and stores what it believes its speed to be and then over time it would be able to detect the degradation. Perhaps I'm not sure we'll be looking for input on that. Certainly, um do you have any thoughts Anakin?

B

Can you hear me Sam yeah I was gonna, I was going up instead of down. I probably should.

D

Oh yeah, no, no, that's that's fine! I just I want to make sure you get so so on this topic specifically, couldn't we take a look at statistics regarding things like the io size and the throughput and kind of get a general sense of what we think the the back end is you know there are certain classes of drives right like 7200 RPM drives that have fairly consistent performance characteristics, you might be able.

B

Are they kind of? Are they that that uniform? That's what I'm wondering well.

D

B

I'm wondering whether we can afford to assume that there are sort of fast as this D slow, slow, slow, SSDs, endo and spinning disks and whatever you are. You must file into one of those performance regimes talking about. Yes,.

D

I mean, as is ya SS. These are definitely harder. Spinning disks, we could probably get away with it. I think, but probably not. I.

B

Don't know, but even even spinning disks, if you have an over provision data back plane, yeah.

D

B

Mean you could get weird stuff than I, don't know yeah.

D

Fair enough, fair enough I mean.

B

We could we can make that the user needs to like add this as Oh as metadata they put in for the OSD and compare against that, and we could give them guidelines. But even.

D

Though I suppose out.

B

D

B

Because I don't want to give users the power to and put the wrong thing sure.

D

D

I mean I suppose you could look at at historical trends right because you both want to know like what the asst has done and what it is current or not I see, but the the back end is has done and what it currently is doing.

D

B

ha Might that's a little bit well, actually, let me get prioritize primary request to at OSD. What do you buy prioritize.

C

Yeah I mean that if we have four very large right and we have a small aisle right- and it is common to the general queen at the thing.

B

Yeah so you're you're, afraid too small right, we'll wait behind that big one right, yeah.

C

B

With that, anyway, that's possibly a feature, not a bug for one thing: um ratos already limits the size of objects you can have, so you can't have a one gig right and a 1 kilobyte right. You could have a 1 kilobyte right in a formatted light right, yeah.

C

B

I would argue that those sizes that the sizes, you naturally want your rato subjects to be are such that you wouldn't want to break it up because it'll be dominated by the seek time and not by the throat book. Part, though by that. So, if the, if the large right got to the or got to the journaling code first, then it's unlikely. We want to break it up, it's likely that we want to simply finish it and the part where it actually applies it to the file.

B

Store generally will happen in parallel, and the final system will get to schedule. It.

C

B

Concurrently, non-parallel, whatever guns hi mark, what did you mean by prioritize primary request? Toasty, so.

D

At the time you were talking about about potentially the issue of a primary being slow still, even in the case where you have like you're grabbing just the you're grabbing, all of the Irish code chunks and trying to just use the the most low latency ones. So I was wondering in a case like that.

D

Does it make sense when OS DS are backed up with a ton of stuff to try to give priority to do things that are? Potentially you have all the other pieces do that you need.

B

B

D

Is that really only two.

B

Kinds of requests: there are the kind there's the kind of comes from the client where you haven't committed locks or serialization resources. Yet, and then there are requests you need to complete as soon as possible. That would be requests from a primary or from the replicas back to the primary that you've already committed resources for um so no you, wouldn't you you you, you would always prioritize requests from the primary 2 and 0 2 and 0 state.

B

You would never prioritize the client requests over those, because the request from the primary our client requests, but they're super requests through quiet. Their client requests that are blocking other client requests. Is that what you meant.

D

Possibly I'm trying to perceive said sorry.

B

So, for for a right, for example, when you receive the client requests, you send off some sub ops to the replicas, absolutely.

D

B

Replicas then send back so up replies which you then process through the same queue, and then you send back to the client and opera ply the first one has a the first original client request has a priority of 63, which is the highest non-strict priority, everything after that has a per a priority of highest or something it all they they. They always prioritize ahead of client requests that haven't been seen yet because and eggplant request is actually blocked on them and, furthermore, we're holding a lock. So once we've.

D

B

D

What, when you have a situation where you have requests at the same priority level? How do they get ordered.

B

It's round robin among the clients that sent them. Okay, that make sense yes, okay, yeah, it seems like you'd, want to prioritize client requests, / request for other roses, but the requests from other rows. These are almost always on behalf of a client request, so you want it to those yeah.

D

Possible yeah, I, think I. Think in my head it was. The game is okay, assuming that we have all of the sub op replies come back into the primary for, like an erasure, coded object.

D

Presumably then, as soon as that happens, you'd want to send off the response back to the client. You know maybe over dealing with other things, but I'm not sure if that actually makes sense or not that's kind of where I was oh.

B

That's that's what we do. What is.

D

The individual.

B

The individual messages that come back from the replicas are all a high priority. We would rather work on those than on new client requests. Okay,.

D

B

Whatever possible, we'd like to work on completing a client request rather than starting a new one,.

D

Yeah I was sick in your case where you said that if the primary is slow, what is the way to kind of mitigate that as much as possible? The.

B

Only way to do to the only way to really mitigate that is either to reduce the work. The primary is doing period, which means not sending the request to your own store, but sending it to a replica instead or sending it to both and using the fastest return that, by the way it doesn't make the primary less slow. It just means that the client doesn't see it, it actually increases overall load or the client could send it from the client to multiple replicas and use the fastest return, which again same thing.

B

It doesn't make the primary less slow, but it does mean the client is less likely to see it.

D

One thing that I've toyed with in the past, which I I don't know if it's even worth getting into here. That is the the thought of having some kind of local decision-making process. If a particular disk is slow, having an OSD be able to potentially kind of spill over to another one, but it's it's kind of a it could all probably right below SEF that even have no involved, but.

B

We can do it above stuff actually and that's there. There are two kinds of things you can do when you have a slow desk, you can do things that require a lot of effort and things that require a very small amount of effort. Things that require a very small amount of effort include letting someone else be the primary for your pgs and we haven't.

A

B

Have machinery in the monitor for that the primary affinity? You would such a primary affinity 20, you would still hold the same data, so you wouldn't shift any data, but crush would tend to output a different one of the for all the pg's you're in the acting set for a different OST would wind up as the primary. So you wouldn't see requests that's the kind of thing we could do. What are we looking for quickly as in response to transient shifts in I in well somewhat quickly in response to relatively short transient shifts in bonus?

B

Perhaps because the nurse T is overburdened for a while because it doesn't require data movement?

B

But if we see a longer-term trend, let's say on the order of tens of minutes or hours, we could start to dynamically lower the OS DS weight and move data off of it and I think that's the other direction we'll have to go to eventually, that's in it kind of has other nice properties, because SSDs will tend to degrade as they run out of working space, and you actually want to move data off of the many white, because you want to give them more space to uses whatever the help SSDs do.

B

um That might be a really interesting direction to explore, though,.

D

The REE waiting, though, only helps on a like a assuming that you don't have lots of pools right, because once we have lots of pools, potentially you've got like a lot of you. No longer have uniform, skews the skus start getting and random. So you, you don't gain much by waiting as.

B

Long as all those, as long as all of the pools have uniform, distribute or as long as your IO is partitioned all over your pools such that it's uniformly just distributed over the pgs, then it still works. The problem you get into is when the pool that receives all of the I/o doesn't have enough pgs and then that's not a skew problem. That's simply, you don't have enough pg's, it's not that you don't have too many people. It's not that you have to n equals.

B

Is that you don't have enough pgs in the pool, that's receiving all of the I/o.

B

D

Io and data.

B

Needs to be well partitioned over placement groups, assuming you up the hash, PS pool parameter set, of course, yeah.

D

I sets up a problem, but yeah.

D

B

Any other questions comets.

B

This is kind of a big area. There are lots of ways we can. We could consider doing the OSD speed thing. It's going to be kind of a lot of work, and the nice thing about it is that it's not that internal to the OSD or the monitor it'll, be in the code, but it'll be a new chunk of heuristic that will only interact with the existing code minimally. So this would be a pretty good project for someone who had a good design, they thought might be worth implementing it'd, be pretty easy to prototype. Also.

B

B

Do we want to move on to this one.

A

Chair we can do that.

B

Where do we want to let the schedule catch up a little bit, Castle.

A

Say we're 20 minutes early now we.

D

Can I Sam I'm sure we could we could fill the dead space with random conversation about performance and I'm cool with that? If you want.

D

So Sam in terms of long tail latency, and what are you, what do you see as being the kind of the biggest things that are hurting us right now in the OSD.

B

Probably that recovery scrub and snap trim run in their own right used to run on their own pthreads. As of a couple of weeks ago, scrubbed and snaps room have been moved into the main project into the main priority queue and I have a branch that does that for recovery as well.

B

So that should help a lot, because before we were relying badly on the P threads on the OS is own scheduling to decide when we were going to do recovery versus when we were going to work on client I, oh, and that was never all that good. An idea, so I think we'll get better performance under contention once once people start testing out those. Those changes after that it'll be I mean as how my is pointing out.

B

The file store is incredibly complicated in the number of things that needs to do to complete a read, and that's that's not great, so new store is going to help a lot there. Although mark you're in a better position to tell me whether that's likely to help I suppose we.

D

Yeah, we are seeing that actually, interestingly enough with with with large iOS with like large sequential reads nice or is significantly better than the file store, um the the place where new store really falls apart is is on our BD over rights priority, but just on object over rights in general, like large object over rights. So that's that's kind of what we need to focus on there, but then a lot of other cases, actually new stores, looking really good cool.

C

They're graduating.

D

B

D

You know I've got data, I haven't looked at it. I should go back and do that but I, don't I, don't have an answer for you right now and what the CPUs it looks like, although I suspect that it's from up from a performance perspective, it's no worse. This machine that been doing testing on this is not super fast in terms of CPU, a noose or does better.

D

Maybe maybe 11 question here would be so like both in the file store and with new store. You know the level DB or rock CB, or whatever we have four key value stores is, can become more important and and latency is a really big deal, especially during compaction. So that's the in terms of like reducing our our long tail latency.

D

This is an area we probably need to be thinking about more and more going forward.

B

You're saying that when rocks DB is doing compaction, we see latency spikes and drink lockups.

D

Absolutely and leveldb is worse.

B

That's distressing yeah.

C

I have the idea: could we reduce the number of key log to the normal state, such as that by default? It is one one thousand p log, if I remember correctly so we can make p log buffered at the Messiah david EBY, powerful default is icing, is 888 a.m. and be and we can rat o P Q P load state at channel and don't flash to into disk. So maybe we can reduce some yellow writing.

B

Yeah that probably wouldn't hurt um you you convert, so the number of PG log entries that are configured is not a magic number. You need enough so that you don't spurious Lee go into backfill when you reboot a host beyond that. I'm, not sure it matters that much. You could also tweak the leveldb settings so that it's caching a much larger amount, which would be a good idea, nothing magic about level settings either yeah in.

C

Fact, if you think, if you.

B

Can make an argument that we're simply using the wrong settings will probably just merge a change to the defaults.

C

Yeah yeah yeahs, the butter from Perth I found this man, nice EPO be obv. Time is consumed by the MDB too much and the blue half and others. Maybe we can modulus at pesto on or Nevada, be stopped and maybe something and we can reduce double DB, cpu Lynch, actually.

B

Another option, um if you something you want to try, is try writing rocks TB under the OSD instead of leveldb I. Believe, there's a way to configure it. I dont rember what it is. Not that is don't run new store, run file store with yeah yeah rocks.

C

To me, and not level.

B

To me, I'm antsy.

C

B

One of the things that rocks TV is specifically supposed to do less badly.

C

Yeah I'm gonna other to use that ghosty be unused WB.

B

Yeah so I'm saying that they're very similar in many ways and that they descend from the same code base except that rocks TV is specifically designed to help with that. That's one of the things they put a lot of effort into improving okay.

C

Though they think, then, that problem.

B

Simply is no longer a big one.

D

I, don't remember if leveldb gives you statistics like this, but with rocks DB. If we, if we run on the OSD using using rocks BB, it will give us end of semi regular statistics on kind of what data is hitting, what level and and kind of the size of the the key value pairs and all this other stuff. um Do we have a good sense right now on kind of what the the workload looks like from the OSD or we can have in the dark, still think.

B

We're still mostly in the dark, how'll I. That would be a good thing to look into.

C

D

Bem, do you know, have we seen any evidence in the recent past that that level DB is is causing any kind of like um latency or or or other issues than me? Ost I, remember kind of looking at this looking couple years ago and I didn't think that was but I I'm now questioning myself a little bit I, don't.

B

Think that it is either but then again you aren't testing with us as T's at the time, so it may be. That was faster storage. The cpu problems start to become more obvious. Yeah.

D

B

Don't I don't think it's the overriding problem at the file store because of how wasteful we are on the files were generally.

C

B

I don't really know so that would be thing to look up sure.

B

B

Alright, shall we move on to sloppy reads or do we want to wait for 10 30 knots.

A

Fine, we can move on, hang on, muta switched over.