Ceph CDS Infernalis, 4 Mar 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Infernalis (Day 2.2) -- OSD: Peering / Latency

Description

Videos from Ceph Developer Summit: Infernalis (Day 2.2)

04 March 2015

https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)

A

Alright, so few minutes late on to the next one, the next one we actually combined two blueprints into one talking about faster peering and improving tail latency, relatively related or almost the same thing. I guess: Sam guang, you guys wants to go first.

A

I looked like Sam gets to go first because I don't see him on the list. Oh there there he is ok Sam! You want to start it off, and then you can pick up the tail light and see stuff.

B

My laptop restarted itself for some reason: awesome.

A

Well, Sam is waiting on that here we go: ok, ok, fire away so.

B

In the blueprint I put in the appearing state just so we get a feel for which steps are necessary, but there are three particular steps that are the sort of main contributors to round-trip latency in peering, and that's that if, when peering first starts the first thing we do is we compile ourselves a list of everyone? We could conceivably need to talk to, and then we asked them for a PG. We asked them to set us a peachy, notify, uh let's see PG in film query or whatever yeah do all date.

B

They send back a peachy, notify message. We wait for all of them. Before we proceed for alcohol correctly. We don't send these queries, of course, to download these, though. That means that if an OSD is not going to send us a notify because it's dead, this is going to stall until that it gets marked as down and we stop waiting for it. It affects the prior set and causes to restore peering, so we'll assume from here forward that the oasys don't die. So this stop part of the process.

B

So first we wait for everyone to set us notifies. Then we use those notifies. Those notifies contain the PG info for each host E, and we use that to make a decision about which OSD has the authoritative PG info, the one we are going to go forward with. We then ask that OSD for its info and log, because we are going to use that to adjust our own log and missing, set and then to adjust.

B

Everyone else is missing steps once we get their logs so once we have that we adjust our own log and then ask everyone in the acting set and everyone we could conceivably need to pull an object from for their login missing set so that we know which objects they actually have on disk and for our own acting set. We know which objects we need to recover over to them. That's log based recovery for supposed to be following along at home.

B

Let's see after that, we finally send out a an info message to each of every pose: do we're going to activate which was the acting set and anyone we're going to need to recover objects from and that causes them to adjust their logs to the authoritative log once they find out once they persist that they can go active.

B

So if we're keeping track that's three round trips and at the beginning we have to flush all of the iOS from the previous interval before we can send anything to the primary or before the primary consent anything to anyone else and at the end, before we can accept rights. We have to persist the update to our own info and log, lest we tell the clients I think we're not allowed to. So that's two distinct flushes and three round-trips. So the question is how much of that is necessary and can we make it faster?

B

The two things I had that seemed like they might work is based on the past intervals, there's a decent chance that we already know who's going to happen. The authoritative log, for example. We might know that it's almost definitely probably going to be us or the primary from the previous interval.

B

So when we send out the first round of requests, if we're in a situation where we can hear us to cle, feel like we have a good shot of being right, we can also ask for the login info or fungal lot for the login missing set from that OSD. That saves us to get log step. If we happen to be right- and we can check this once we get back off the notifies, we can verify that we asked for the correct one and it happens that we have at them yay.

B

Similarly, for the get missing, we already know which OS DS we're going to need missing sets and logs from that would be well. We know first batch, we know everyone, we know everyone, we need to go active. We only actually need the missing and logs from the acting's up. We don't actually need them from the people we're going to recover from. We can do that part subsequently.

B

Well, we can ask for the logs and missing in the get info step as well. So in the best case, we might be able to get this down to one flush. End, um do one flush and one round trip and then one additional commit a.

B

Related piece is that we currently need to flush the stuff from the previous interval, where, ideally, we would actually would like to only commit it. That is only make sure that it's in the journal, the reason we need to flush. It is because, when we go ahead later and serve reads, we don't know which objects are dirty or we don't know what judges have pending iOS. So if we track object context across intervals, that is we remember from the previous interval which object set in flight iOS. We have an in-memory structure for this.

B

We keep around wall where the primary, though, if we extend that to keep that structure around when we're the replicas as well. Even if we're not active until off the iOS finished blushing, then we would not have to wait for it to apply only until it commits, which might say this a little bit more more time. Okay,.

C

What would okay so I have a couple questions? I guess the first would be.

C

It seems like in many cases there actually wasn't anybody going down up and down, but we have a forced interval change anyway, like because the the PG temp record was set, for example, but it actually said it to what we were before in that case, like everything we already know is our is still correct and in fact we already have the peers. Like a lot of times. We already have the peers info. We don't need to request it, because we know they didn't go down or come back up again. That's.

B

C

That's maybe maybe this is even just a general case where we're like I'm not going to bother requesting this peers info, because I know in the previous info interval and this interval they're still up.

B

Well, it's more like when we go through what does it start? Peering interval maybe start bearing interval. We observe that the acting set didn't change and that, therefore we should not flush our missing info and log sets and go through a truncated pairing process, because all over all of our information is still authoritative or even.

C

I mean even if even if 10 SD went down yeah up right, okay, so that's answer the other two, but I mean even didn't: go down a weed yeah.

B

Yeah so an we can generalize it even even more. That's tough yeah so.

C

That's that's actually.

B

Good point: I haven't thought about them because.

C

Then it could just basically say you have three one of them went down. It could just basically send an info that activates them immediately like without even doing any further work, because we just want from prerequisite to yeah. Okay I mean.

B

I think we actually have all the information we need to do it and stop hearing interval, and so the only difference is we don't reset. We set a different event that sends us through a truncated circuit, okay,.

B

Because the catch is during new period oval or whatever we clear blank state for absolutely everything just to make sure we don't contaminate the next interval, because, if the all the bugs we've had from that, though, it just means real. Relaxing that part for the truncated purine case, yeah.

C

I mean it might make sense, this sort of call out the specific cases where that we care about like it's yeah like it's, it's when we have a PG temp, that's the same as our current set anyway or when a nose t goes down and we just want to like keep going. It's actually a.

B

Hot sure that a PG temp, that's the same as our current sup actually change actually causes a Pyrrhic interval change. It.

C

It has to because the Oh, actually you might be right because.

B

It doesn't actually own writing, so, in fact we would never set that PG time, but the first.

C

Place yeah well what I'm thinking is if we dispatch the pending, if on a crush, proposed, crush map, we pre populate all the PG x, oh yeah, but that's actually yeah. We want to make sure that that doesn't cause trigger peering. That's.

B

Your point yeah, even so, though, if you, if, basically by definition, if they're well by.

D

Connecting that.

C

Wouldn't add the.

B

Way we maintain all of the PG temps. We never have a PG temp, that's the same as crushes acting setup right.

C

Actually so in that case, actually it does force a pending, because the thing that that changes is the upset changes, but the acting set stays exactly the same. So.

B

Yeah I def little house appearing church yeah, but.

C

We want only one if the acting.

B

Set didn't change and that's annoying because we actually have to ask those two was to use before we we actually have to go through getting get info. In that case, that's.

C

A mobster do we. If we were already active, then we already know that the people who are in the previous interval obviously are up to date.

B

Yes, I think so, basically we would we would. We would have to remember that we would have to know that the pre we have to know that the previous interval went actually what acting I. Guess it? Okay, if we know that the previous interval, when acting because we can look at her state and were the primary when acting then we know that our own info has to be authoritative, yep.

B

The problem is that we don't know how much recovery we have to do on the other guys, but we actually do still have to ask them before we can go active all.

C

Right and Leslie push all this like yeah like get missing all that stuff into to happen after they go active, which.

B

Great well I mean you could preemptively Justice you could you could send them a message? That's it I really couldn't care less. What your state is your now backfilling I mean that's the only way to do it, though yeah.

C

I mean if we could. We could just assume that, because we know that they were actually up in the previous interval also and so most likely. If they did have any data, they would have sent us a notify and we know that they're not stray, and so we go. That's we.

B

Could make myself? Yes all right, all right, yeah.

C

B

It's agreed it.

C

Could have just been really slow to arrive, so we're not totally certain, but that's.

B

Yeah but clearly we did I need them before back to class, exactly yeah you're right you're right. That's that's a good point.

B

Right- and so another piece to this was is, when the monitor to use an OSD come in it. Oh sorry, was there anything else going to talk about about truncating. Her I was gonna move on to the other two which I forgot to add to the blueprint. Apparently.

C

Yeah done I, don't think yeah go for it. Yeah, okay,.

B

So the other two pieces are so previous CDSs we've already booted the idea of pre populating the PG temp with sort of what we expect the PG time to wind up with as soon as the primary finds out that it shouldn't be the primary and request a new one.

B

We do something like bound the amount of time the monitor spends on it and then, wherever we come up with that goes into the next map and hopefully that cuts down on the work I Nate doesn't want me to go over that more because there's that okay cool one other thing we can do is um there is a thing where, if you take the primary the placement Roop down and bring it right back up, it'll be missing a few rights, but it'll still be primary, and some IO will tend to kind of hang on those objects because we have to recover them before we can serve reads or writes on them.

B

So it's possible that in many such situations we would prefer that for a brief period of time, the monitor simply set or when a toasty is marked it up after it was down for a brief period of time. We would like the monitor to set as primary affinity, 20.

E

B

That's right: we can't do a per PG primary affinity because we would affect all the pg's no way you could do it. You could do it to attend a primary.

C

Though, oh actually, that's in the most see map I think about.

B

Yea yea, though, so we would set a temp primary in hpg with unclear it on a case-by-case basis, so we can do that too.

B

That would so that, combined with the changes that went into hammer which allow us to perform rights under greater objects, I mean will mean that those objects won't see right or write or read fluctuations. Ok,.

C

So I think that that the meet the main missing piece there is that right now and peering. If during pairing, we decide that the acting set should be different, we just sit there and wait up through, even if the current acting set could service could go active right. Oh.

B

Hang on, you had a way to oh I, fret about way to put it through. I am right. Why do we eat up.

C

Through that was because mm well, just in general, we want to we want to. We want to switch acting to be what we wanted to be, but even if the current the previous acting is good enough, then we should go active and then asynchronously wait for the absolute to get the PG temp to get remapped.

C

So this is actually a problem with with the PG temp repopulating too. It's also going to it up days, I'm confused. Ok, let me see if I can remember. So, if you, if you set they, they your map, 21 23 and crush changes to be like 312, and so you set up PG temp to be the old thing and it goes through peering and it's like well I could be one two three like that's good enough. They will sit there and block instead of waiting for them. Yet another OC map update cycle instead of continuing.

B

Why would you? Why? Would it ever do that, though? It's a sabbatical, whatever that primary are.

C

Yeah yeah I mean it basically yeah because I'm peering it gives you like. This is exactly the best acting that I would want, and if it's not exactly that, then it just it for it to be that pretty.

B

Much the only cases were so usually that's, because the primary that it currently is mapped to is not a complete.

B

So we there I, can't think of a case actually where the primary would be able to be the primary and would not leave itself as the primary.

C

Well, I mean the example is actually that the temp, the temp primary map.

B

Yeah so mentioned, I think that would really have an example. If, if the well see, I'm not sure it's worth doing, if the OSD has to do it, I think the monitor atomically with the map change that marks the OSD up also switches the bell right.

C

Primary but then, but then during pairing will say well, I see I'm 213, but but I could be 1, 2, 3 and so I should switch to that and it won't actually do I owe in the sort of its temporary state. I know.

B

We're seeing as long as we set this up so that it doesn't immediately try to clear that which it shouldn't then it'll just.

A

Bank, at least.

B

Yeah well current currently would try to clear it. We would have to have it not do that right. We don't want to do it. We wanted to go active right for with this actual, it goes clean. All.

C

Right, so that's like the what I'm saying is that that's a totally generic thing? Basically, when we, when we get to the weight up to up through decision, if we could go active with our current acting set, even though we want it to be something different, we should continue and go active and asynchronously yeah.

B

I know I just can't think of a case where that would happen in the current code.

C

um To think of an example, I.

B

Mean the only case I can think of where that might happen is if the prom one knew when the prior set came up and for whatever reason that just became the one we wanted to use the temporary primary. We would actually wait until switch, but that's quite rare, or what.

C

If what, if you're degraded with with no but then interment,.

B

Yeah, because, if the, if the upset, if the upset primary is capable of being primary, it will be theirs yeah. It's not.

C

B

Try to change the.

C

Acting set for that yeah.

B

C

I'm pretty sure I had an example before, but I can remember it now. So maybe we'll leave this to last yeah. Okay,.

B

Okay and with the PG temp change we just had, we would need to wire in some logic so that it doesn't try to change it back. That's all! Okay,.

C

B

Think even look I was gonna, say I, think even with the truncated peering peering takes too long to switch the primary. If the bond over can't do it itself before anyone actually doesn't it work. Yeah.

C

C

Should we switch to the issues that that long has.

B

Here sorry used up too much time to go. Go ahead. Oh hi.

D

Sage, hi Sam, lon, hey I, put blueprint, which is talk about the take. Tell latency issue we came across is I think at least some of the issues we came across. One is a purine that is when I always be a stung for a while.

D

It's like is either crash or the some other stuff and bring it up the purine during the purine stage, their tail latency, some some requests as take tens of seconds to be served, but I think that is already powered by the DD temp change and the blueprints am just a talk about another.

C

Sorry on the ungraceful shutdown, oh.

D

Yeah, the first one is of the purines appearing what and another one yeah another one is ungracefully shot down, as in our as I mentioned in the in the blueprint is like because on the done, Staters of the OSD can only be a be noticed by by the class or by his peers, where heartbeat and that could take up to 20 seconds and and cause the map change so that the client could retry. In that regard, I'm wondering if there is there's any plans we can make that better. The.

B

Only does actually do that if you shut it down gracefully. Well, maybe it's not that the end great ok! That's it a crowd.

C

B

A lot of crash I guess any crash case which results from IO. We could absolutely choose to send a message to the monitor. There's, no reason why we couldn't yeah.

C

Yeah I wonder if a more general thing would be I mean basically, but if there's any any situation where we know for certain that the OSD is down, that's has no false positives than by great, where to begin immediately, we can immediately market, so I mean the things that might work would be. um If you are another process on the same host, and you know that the OSD was a specific PID and that the ID it appears, then that would be a the trigger. Yeah I mean.

B

Or we could just a watchdog process yeah.

C

Exactly right, so maybe the calamari agent could do it. um Maybe p ro s DS on the same process on the same host could like monitor each other's bi DS way.

B

More complicated than a watchdog, brussels, yeah.

C

Yep but the, but the westshore process needs to somehow know that this particular OST is in fact this PID in this PID namespace yeah.

B

I think that the UM it would need to be wired through the upstart or other thing system d, so that it actually is the thing that does the starting induce the parent. Yes, we.

C

Go good way, that's.

B

The classic way to do it, yeah.

C

I forgot about that yeah.

E

Yeah system DN classic gear, but yeah system people would be the way to go so.

C

There's, probably even a simple hook that you can just make a run this command when it when it even shuts down, and it could just do stuff, you did I forgot.

B

About that yeah, let's roll, we drove okay.

C

So I think in fact, I think we talked about this like a year and a half ago and forgot about it. I think that the trick was. We need some way to identify particular instance of ghosts. Ideally at least the particular instance.

B

So study has all that information, so what up stirred can I.

C

B

Init system, right kind of.

C

Is what you, when we give it a lot eid.

B

C

Mean you know it's OSD, 12 or whatever, and so you can mark to NOC 12. But you don't know that o SZ 12 didn't start.

B

C

Me in a different process, or something so maybe like, if that, if that failure command knows the PID of the PID that failed, then, when it sends a command to the monitor, it says, mark down this OST. If it's PID was X or somebody.

B

The way we invoke the process, it could actually be invoked with a file descriptor in the sefa and no and the OSD could tell its parent process what what its nonce is or all.

E

B

The information needed to identify that.

E

B

That's the simplest thing and.

C

Maybe it was a list, but ok does that make sense like if we can figure out a way yeah if we can figure out a way with the systemd hook, to identify that it's that specific process, so that when we send the OST down command, it only marks it down. If it's it's the one that's currently up in the OST map is that same one, then then we're golden well.

B

I think we dead-ass fo sdn stay down which to.

C

Excise yeah, something like that: yeah, whatever, whatever that unique thing is like I, could it could be the inst? It could be that there's a cookie that we put in those d metadata or something whatever. As long as it's yeah, okay I mean.

B

C

Something in communicating.

B

To all of the OS use messenger peers right, the dupe entity, entity.

C

Is that it yeah but did system? He doesn't know that, so it can't pass. It know what we good yeah. I'm.

B

Basically, systemd could instead of invoking south d, it can invoke cephalus d supervisor, which knows how to communicate with the with child process. Just this little bit, yeah.

C

Or the toughest yeah yeah yeah yeah, okay, yeah.

D

Previously I I sort of two options: one is that, like we like just these classes, that out out of the OSD precise, we have another watchdog process or some other stuff which attacked the failure of the OSD and report immediately as soon as the filter is detected. Another one is, as I mentioned in the blueprint.

D

Is that toys, the pre m, / TV too late to tell the monitor that is going to crush, is similar to the grease Felicia done stuff just sent a message before eight goes crush, so with the discussion that does the first option is preferred. Is that correct.

B

Well, in the crash case, I'd rather do is, do not do anything but just I. If we can gracefully do it some other way, which I think the supervisor accomplishes for us. Mm-Hmm.

E

C

B

C

Yeah or or persistent eat because they're right.

B

Or I mean just some: some flavor of supervisor process avoids the thing where it's actually a memory problem thing and recycled. Yeah yeah.

D

B

Even in that case, it would be would be able to tell the monitor, which.

E

Yes, Yeah Yeah.

D

Right, okay, yeah, the second one is that there's some slow osts. Actually, this includes at OSD goes down for some reason, and currently we we have a page, I think is already in that has test. The speed is that we read all the trunks, both data and code, eight trunks for ear, ear coding and use the only the first return: okay trunks to serve the request.

D

The definitely avoid that if there is OSD that is blow or even a stung, that the request can be successful and in low latency, but the problem is that um that doesn't work for the senior that the slow OSD is a primary one, all right um yeah. In order to accept a problem, it seems like we need to shape the responsibility from the primary OSD to client side yep.

C

I was going to say the same thing so if we talked about this a little bit earlier that that so that, in order to do easy reads from the client, the the thing that's missing is one the ability to for the client to explicitly read the different shards foremost ease yeah, which is the minor librettist extension and there's some walking that stuff that we have to fix on the OC so that they actually give you the right information.

C

The other sort of hurdle, though, is that the lib greatest client then needs to be able to link in all the eraser code, plugins, mm-hmm and um currently doesn't so. We have to change the way that the package their way right. Another part of this F Steph package, I, guess we'd- have to change it and we have to switch around the way that the package dependencies work basically so that it yeah do that at least in the in some cases, if it fails, I can always fall back to reading from the primary, but yeah.

D

In that event, some direction we potentially pursue in the future, or we just keep the current outlook floor.

B

So um another thing is: if it's the storage, that's slow and not the request, handling on the primary, then you can still just Ryu. You can still go ahead and try to reconstruct when you have the first k come in and you don't have to wait one other primary. So finish, yeah.

C

Yeah, it might be the CPU that I guess well.

B

If it's a CPL help, but if it's the storage it does yeah yeah, yeah, yeah, definitely yeah.

E

If you move the erasure coding stuff to decline, then it would also mean to integrate an iconic line, for example, correct.

C

Only if you wanted to do this only if you wanted to do this optimization right. This is a lot.

E

C

Client could do and it's optimistic apparently because it there might be erasing right, and so, if you get back different versions of the shards and you just give up and fall back to reading.

E

C

Anyway, so slowly fine for the klein not to do it.

B

Instantly for for rights, at least on replicated pools the it seems to me that the analogous thing would be waiting for min sighs replies. I, don't know if what anyone would think about that, because we already have a min size parameter, which sort of defines how many rights we require to be persisted before we accept reason right, so we actually could possibly wait until you admin size replies on a particular right.

D

Okay them, so you are talking about the second one is right, so that on our pre, our previously is only for re, and if we want the same effect for right yep, we properly already have some mercenary in always decide. That's of course, eight right well,.

B

There's no machinery for that yet, but I'm not sure that it would break rado, so I put it that way. Okay sage is a.

C

I'm idea, I think I think it might, it might work, but I'm a little bit worried that we assume that we touch all replicas when we, because of the way that we do peering like if we have at least one person from the previous internal, then we assume that any uh nacked right when I write that we didn't see is an act. Yeah.

B

C

B

It doesn't work, you'd have to do the same thing. We do with the racially coded pools, which is make sure you talk to at least size minus one size, plus 1 replicas, yeah yeah,.

D

B

Program, yeah I knew there was something else feeling yeah.

C

B

Well, we could do that, though you could make it so that you always have to talk to size, minus one size plus 1 replicas during kering, but.

E

C

Increase the chance.

B

Of unavailability.

C

Size- min size plus 1 pick the internet 3 in the replica in the 3x flip in size of two, that's still too yeah right. So you securing.

B

C

You have to talk to I seat during gearing up to talk to at least two right got it yeah that that did this, let's increase your risk. Okay,.

D

Okay, that is pretty much what I have offer of this. You know once a lot. I will go ahead, ah obvious more information for the first stuff that is ungraceful each other. Ok,.

C

D

Thank you say: jeez thanks, I'm yep.

C

So I think I mean just to throw this out there for the for the for the slowest ease for the easy thing. I think the thing that's going to be most that's going to best solve the problem for the EC case. It's going to be the client. Turin reads. I think that's probably where, if in latency sense of environments, I think that's the way to go, because you don't care about racing rights and and so forth, so yeah, okay, okay, I must and disagrees I, don't know if you'd like oh.

B

You're right on it saves you a round trip through a network, hop there's, really no arguing with them: yeah, okay, okay,.

C

B

Does, however, I mean that all of the well it means that all of the plug-in decoding stuff has to be on the clients, life yeah.

C

Yeah I think it's more of a packaging challenge, but.

E

It's optional: it's.

C

Optional, so if it that we don't find a plugin, we can just bail. So whatever yeah, okay, cool cool.

A

Cool thanks all right wrap that one up.