Ceph CDS Infernalis, 3 Mar 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Infernalis (Day 1) -- Clustered SCSI Target using RBD

Description

Videos from Ceph Developer Summit: Infernalis (Day 1)

03 March 2015

https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)

A

Alright I think we're going to move right on through here looks like the next one is the clustered scuzzy target using our Beauty status. I think this I think this was my christie's mike is this. This was yours right, yeah, cool, alright! Well, take it away.

B

um So this is a continuation of the one I didn't hammer uh so I'll. Just start with, like a status update the things there was like five things that we were trying to complete or five items that I need to complete to be able to support, running li on multiple targets and being able to access them. Export rbd devices to them, and so some of the stuff that I've gone done or it's set is a for configuration and distributing the device state.

B

It looks like we're going to go with pacemaker, and so it's just a matter of finishing up modifying the existing I skazhi target and scuzzy target and device resource agents and those currently support active passive based on like this virtual IP type of failover, and so it's really simple to just modify them to support active active for us and with that you can use like the pacemaker commands to set up the targets and the devices or you can use various each.

B

A management gooeys that the distro have I'm, also working with the loop storage management, people to modify their library and plug-in and their tool to be able to create targets and basically that our plug-in would just call those pcs commands. So people don't have to do it directly and that's pretty simple.

B

The part we're working on or that's more complicated as being able to create there were their concept of pools out of RPG devices, and so that's just taking that's just more details that we have to hammer out because I guess to in order to do it generically across like a netapp target and those other ones. It's kind of weird um until those two items are pretty much done and set the next one was for persisting group reservations.

B

Originally, I was looking at using just dlm and coral synced to do a by locking across the cluster and protecting this PG, our metadata, which basically just holds like the initiator information and I team, Nexus information and what type of reservation it isn't side like that, and since the last talked I have implemented, Colonel rbd locking. So it's basically what we have in liberate us and user space right now, only it's based on the colonel lipsett stuff and you can make those calls from the car.

B

No, no and the second part of that is being able to store the metadata that I need in I'm currently looking into or I'm currently doing it, storing it in the rbd header. So I just added some new poland, that's class CLS, underscore RB DCC file and that one I'm gonna send a patch for that soon for RFC, because I'm not sure it's kind of the new colleague I called it like said scuzzy PR info, because I wasn't sure.

B

If I should do it like more generically or just say, forget it and like I'm, just gonna store like this exact data structure. So I'm not any comments on that to see how people want to handle that and that one almost soon um and then the next item was compared and right support until that's needed by ESX boards atomic into atomic test and set command which is needed for their vaa I feature. And for that one. It's pretty much said I'm.

B

Just waiting for um some other upstream people that need similar functionality to lay down this of their infrastructure, so I don't have to do it.

B

Cuz I have all this other stuff to do, but I I get done with my stuff first to know late on that infrastructure, but right now I'm, just hoping that, though, uh needed sooner than me and implement it, that's just like adding a bunch of fields and breaking up these fields and going through all the black drivers and doing it, and so it's like a upstream chicken I guess we're playing, um and so the big one, that's oh, the big one that needs to be done still is the scuzzy task management and like the unit attention and those type of um handling, and for that one I've been uh I haven't worked on it at all.

B

I've, mostly there's been lots of bug reports and the boat commands, timing out and the air handlers failing and so for that one I've just been looking on to just originally we're just going to handle for what happens now when l io gets an award from the initiator and abort task task management? Is it just waits in hopes that the underlying device completes it in time before, like the initiator test management award, timeout occurs, and for that one I want to do something a little bit more intelligent.

B

At least log try to narrow down where it's hung and log a message, because it's really been difficult to debug that on the list and it seems to be happening a lot but yeah for the most part, I. Don't think we can do a lot to really. If the commands like on the device, then we can't really enjoy MIT, and so at least log a message if it's Jim, somewhere else and I haven't been able to I, still need to look into more how-to yeah I'm jam it in the stuff like an OST layer.

B

Something like that. I don't. I don't really know yet. Yeah.

C

How long are those timeouts, typically before things start to break.

B

So when the initiator sends a command, it'll normally be around 30 seconds to a minute. If that command isn't illegal down time, it's MJ abort task, the abort asks timeouts is anywhere from like five to 30 seconds to a minute. Again, it just depends on the operating system and how the person configured it, though okay.

A

A few yeah there's like.

B

Yeah, okay and so for the big one for that is still handling at the board and how to do device resets and for device reset. You know you want to kill all the commands on the bed of running on the rbd device um again for that one l io currently it'll just wait for all those commands to complete, and so there's not a lot. It can do in a lot of paces like uh yeah and.

A

B

For see you at those feel then it'll go to like a higher level error recovery for like ice cause you to go to like it'll log out the session and again for that to log back in and waits for the commands to complete. So some I need to look into some way to be able to actually do something to unjam these commands or something like that to make progress order. Just and that's what we're seeing a lot of so.

C

You um we had a conversation several weeks ago, where you were. You were basically saying that the OSD failure, detection timeout, is like 20 seconds, and so you can it's not that uncommon to get it niño. That will take that 30 seconds and if you do, if you do trigger that timeout, then the in general, you would normally like start fencing and failing doing all that stuff yeah which, if overall the sub cluster, is just gone away for a minute, then is sort of wasted.

C

Effort I, wonder if it makes sense to have an out-of-band communication between the between the OIO, the gateways or whatever. Just for that that reason right yeah, because I mean if both the gateway is actually haven't, failed and it's actually stuff that's going slow than doing a failover between them is like yeah.

B

C

So if they could like coordinate so that they don't fence each or don't try to fence each other and go go, do you know go new nuclear I? Guess yes,.

B

Cuz he has like this command called acquaintance. Oh sorry,.

C

B

There's a query task: that's the newest cuz! He specs I, don't know if we could no operation no initiator sends it, but if, like Li oppa do or the black layer the Linux layer Linux could like query. What's going on with this task, to throw at us or oscs or something it would be helpful or.

D

B

To tell us like yeah, to prevent and like failing over from and initial doing anything to prevent the upper layers from doing their higher level ii recovered because that's a lot harder like for oracle. It's really heavy instagram or.

C

Even if even if they are doing that that failover, then if the gateways can recognize that actually the gateways are fine and they're still both alive and cooperating and seeing isn't necessary, then they can do sort of a lightweight and off between them yeah. So they can do a graceful, lock, brake or whatever. This.

E

Amusing pacemaker, you can store extra metadata within pacemaker, that's more than just it's up and running, so you could potentially have your your resource script um store that extra information act on it that, on that information, as part of its fell over logic, yeah.

B

I can do that so on each LOL node I can detect. If, like the other icicles gateways, are you can detect like if the service is running and in like the monitor call-outs, you can do various checks to see it like the ice, cozy layer or layers running in things like that, and you could even do checks like a can. You do. I saw that there's like this code for the lip stuff and stuff so like from you. The pacemaker, userspace monitor could act.

B

They can do anything. Basically it could do any type of rbd command to call into the actual rbd device on each li. Oh no to try to detect if that layer is gone or something like that they.

C

Can send messages to each other through rbd, so it could be that it might be that in most cases where you have a hung I, oh it's like it's, not the entire Raiders closer at the subset of them, and it's probably not the one object that they're sending notifies to each other on, and so if they, if they can send a notify to each other and do like a request, lock brake and they get a napkin says: okay, I'm greasing whatever, like so that the even, if, like the normal reset procedure, is cooperative and sort of skips, a fencing part.

C

A

C

That is that what it does now Jason in touch I, can't remember: yeah.

E

The lock brake.

C

Yeah for the exclusive locket.

E

It tries to be cooperative, it says, hey yeah, please send me the lock okay, if it fails, but in cousin says well, let me go secret actually listed as a watcher and that it's not list as a la.

C

Minute, just lock.

E

In blacklisting got.

C

It okay, so in that case, so in that case, as long as the colonel rbd is actually doing is going through that whole protocol for lock brake than handling a reset in like the standard way where we just break the lock, take the lock it'll cooperate and do it. Okay, so that actually might give us ninety percent of the way there.

B

Well, then, with that, would that be.

D

Okay with this, if there are high, as though on the original one.

B

Well, is that part would.

D

That be okay, there's two regresa like the transfer block corporately without blockbusting, your third like I, always in flight still off on the original or good I.

C

Suppose, technically not, but in reality the client is going to send us a my eyes anyway. Right. Yes, because.

D

It's gonna go ahead, we try and let's, like Joey, wants. It fell over anyway, so yeah I.

C

Guess they might arrive out of order like it might write a fail and then write a B to the same block and then the original eggettes there for that might not.

B

For scuzzy and overing for what we support now, it's it'll be okay. If it gets reordered on that um on that type of sequence, we rely on like the upper layers to like synchronize it correctly and so like there's different type of ordering that we can that people could set or at the sky sea level, but no one does it that I know if I think.

C

That maybe pretty case, but it might be that we need to make away for the for a lot of client to like explicitly cancel the quests or something.

D

If any kind of cancellation requests can be racy because it could be all the way down, took the drive already like my review.

C

But it could rustics I say yes,.

D

I successfully.

C

Cancel it er I, don't know if I cancelled it.

B

A

At the hospital oh I was.

B

Just save us down actual at the device and a tongue because of that then we're screwed, no matter what we do. So it's ok, just wait or not do anything or just bail because of its hung at the device. It doesn't matter what path or what? No, we exit from we're gonna be stuck yeah.

C

But I mean there are a lot of cases where education.

E

Issues with the OSD I mean.

C

So you can't think of one ratos client it an asset has tonight I'll, just like whatever they're like various. There are a lot of different cases where we can conclusively know that we canceled it successfully like, for example, if we haven't sent it anywhere yet because the host is already down, then obviously we just take it out our key. We never send it or if we do have an outstanding to it, I know Stephen we can just um maybe there's a we can send it a cancellation or something yeah. I don't know anyway.

C

This is a we're going down the rabbit hole, but.

D

Sounds like a general right now, since no one uses it we're in constraint, guarantees and scuzzy. We can just do the cooperative, lock, transfer and not worry about them in flavor requests. The vein we send it right to the secretary anyway.

C

With very high probability, I think yes yeah yeah.

C

Okay, okay I'll.

B

Look into those things then,.

C

I'm services like five different times but I, always forget what the answer is for the compare and right. That's the the upstream chicken you're talking about is basically plumbing that, through the block layer dent to the glare.

A

B

Be exactly well I'll, just.

C

B

And trimis, okay.

C

So when it hits, but when it actually hits rbd, then we'll send a special request to the OS data automatically. Do it at that. Okay, cool yeah.

B

I was gonna I'm. I was looking into adding a new stuff command to it. Just like I'll.

C

B

With trimming so perfect.

C

Okay sounds great man: um okay sounds good and.

B

That's all I really had.

C

B

Okay, you guys they give you yeah.

C

Bernie any questions. Anybody, oh.

B

I had one question for um distributing mime edit, my PG our metadata, so um I just did that I was adding that new class rbd command and it's really easy to set the data, but I wanted to be able to cash that data on each Li. Oh, no, so I don't have to go back and read it every time. You know someone does a command and so, like you know, people can do a watch on that object and be notified.

B

But I was looking for a way to like before I was using coral sink and you could run um like a command on all the nodes and be notified when that command is completed, and so I just wanted a notification that everyone has updated their cash so like unknown, 1 I'll set the metadata and then other people be watching and they'll be notified and they'll update their cash. But I wanted to be able to be notified that they have to do their cash. So I can tell the initiator that we're all set across the cluster.

B

Is there such a thing in SF right now,.

E

So when you send that notification, if you basically get a timeout, if not everyone, that's watching that file accident time. Oh.

B

E

You can also even go beyond above and beyond that if you want, if you had additional metadata somewhere I'll, say your other guys. I know that I'm supposed washing for you can watch as far as far as I the notification response they can send whatever they want back and say. Well, this is I'm fine XYZ night, the truck.

B

E

My cash or whatever cool.

B

That's that's exactly what I look at all we're gonna go tomorrow. Thanks, that's all I have been is.

C

That Jason is this, something that is mrs. sort of what the lock leader or whatever would do right so I mean in the old pattern. You would you would set it on the object and then you instead of notify and then only when that completes. Would you know that everyone has seen it, but in the new world you actually just send a message to the leader and you let them do all that, but did will just fall in that category or I mean.

E

All the new stuff, everyone's everyone, it's it's still everyone's watching the rbd header in the case of the stuff I'm doing so everyone's getting all the notifications, it's just they say I'm, not the leader, so I'm just kidding I'm just going to ignore it, and by ignore it they just send back and notify ack, I'm just so that you know if it timed out or not, and if it does timeout.

E

If I delete you know in the cases where you actually looking for leader, even if it does timeout, I say well, I got a response from leader. That's all I really cared about anyway. So I just got gotta clear out like nor the the timeout error code that comes back from watch notify. I say I got my response. I looked was looking for from the leader, I'm good to go with whatever the leader said, was the real status as part of his notify ack.

C

So in practice the person who's actually setting this scuzzy reservation is probably going to be the leader right, like they will have already taken eight Sosa block and they will just yes.

B

D

In the it's a new um like the uShip streamer that locations ridden- and it also has like different commands that you could do what happened like hi- um see this because it could be like one of the commands would be. You know, reread their person reservation.

D

I did and then send me back explicit, notify response not just working out if I act but I suppose I get different notification when you finish reading that, because, typically um you want that whatever you do before you send back, notify active some kind of quick operations, you don't potentially timeout like blocking for some io somewhere.

D

So you just say: okay, I plated a thing something internal state saying I need to update the cache act that notify and then later once you have to actually update the cache, send a different application back to the original guy Oh. Everyone watching really, but and all this stuff is all in user space right now and that Colonel doesn't even have the newer what freshmen for wash notify yet so then take some effort to actually get it all worked up. The colonel okay.

A

All right does that wrap that one up then yep.

B

Thank you awesome.