Ceph CDS G/H, 24 Jun 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS G/H (Day 1) - RBD review

Description

https://wiki.ceph.com/Planning/CDS/CDS_Giant_and_Hammer_(Jun_2014)

24 June 2014
Ceph Developer Summit G/H
Day 1
RBD review session

A

Alright, so this is the rbd review session, looks like we're going to give sage a little bit of a break. You won't have to run. Things will make make Josh in the spotlight here looks like the focus here was mostly on journaling and mirroring, which josh, if you can give us a little bit an overview there. But after that, if there are any questions on rbd work that was in flight for giant, we can also answer those so joshua on. Take it away sure.

B

So, and already very and / journaling, where we introduced in the design summit and we actually haven't- had a chance to work on them yet, but the general idea is that it.

B

So it's an asynchronous replication for really um possibly from one day to another datacenter or within one cluster, from one quote for different look cool and the general structure of um was would be having a journal of all the data written to image and striped over Raiders objects, hello to the way the MDS stripes it's general and have another process somewhere reading that journal and replaying it on a different cluster or different pool or even a different, a different site.

B

Perhaps- and so there were a number of um things we talked about at the giant cement for how that could work and win terms of the journaling. But what we didn't really define well was how the actual mirroring would be done and how that agent would be structured. So I want to talk a little about that. um Oh.

B

If we have a journal of all the data operations that have been done already image, as well as like metadata things like resizing it or taking snapshots, we can replay for him basically from an arbitrary point and all this operation see our idempotent.

B

So you don't have to worry about keeping like to much data, but from about where the which operations having to be played so far as long as we other than as optimization to avoid be playing things cool on.

B

But so yeah with a the actual agent that does the mirroring this question of how how how it should be structured, shouldn't work on I know, one image sure shouldn't work on minium visit once you should somehow coordinate among the other agents running, to share the work of mirroring a bunch of different, a birdie images from entire pool or cluster.

B

Thank you my mind. The simplest minute way might be to have a quick sort of 11 paths as over, like sort of a scheduling process that I keep track of all the ugly images images that exist. That needs to be mirrored and runs sub processes that actually do the mirror rank for each individual image that wouldn't really scale past them and network interface. For one note, though, but it might be a good starting point.

B

And that would make most sense tuna in a pull model where that kind of agent would run in the destination cluster um where's mirror where it's mirroring from source cluster to the destination cluster. It's run in the destination cluster of its it's, this mirroring from to a bunch of images, and then you can be limited by the network interface in each destination cluster, rather than in only in one node from the source cluster. Perhaps.

C

Yeah so there chime in here briefly, there was it. We had a discussion on sat oval, I'm a couple months back about this too we're just sort of looking at what the what the I/o model for this is. So it's definitely not. uh It's definitely not free right. So then the cost is basically that for each image you want to enable Mearing on your also writing a journal for it. That's a journaling inside of or on top of ratos.

C

Do you get this like another level of Rights, but by spending the time to write that journal, then you can use that journal to strain to the other cluster. So if you opt into using this this feature, then you did want to provision appropriately.

B

Yeah another thing we mentioned there was that we could have an option to pack rights after they've written to the journal, rather than after they've been written to journal. The journal end the normal location and have an inner, and you perhaps used already cached serve the beads from that for any progress rights.

C

And it might be that if you put the put the journal in a different pool like maybe you have the base images on hard disks and the journals on SSDs, just a sort of a familiar theme. I guess and then you get sort of low write, latency and the reeds would be coming from the disks might be might be. Okay, maybe.

B

Ya, thang wit I got because of my something you wanted to tune. It just enable out to be in a different pool, since I think you might be a common thing to maybe have the journal not replicated three times for apps, while the orbit and acclimate is or maybe have the journal, even a race recorded. Oh.

C

B

Yeah, this is a good actually, the FN only right.

D

You just agenda what says: cancel.

C

Yeah, but if it way they're small enough, then you could have your turn'll. That's always bad line up my nose. That's a possibility.

B

Yes, whom and I think the general model I've been thinking about. This is where 11, which you mark and RVD image as eating as wanting to be mirrored, and so it sets some flag on that image and then um there's no question about whether we want to score more than one thing being from a journal and how we how we handle a I criminal journal. In that case, if there's only a single like follower to the journal, let's you be playing it on a different close to our separate location.

B

It's easy enough to just update a single precision marker, but otherwise we have to know how many there are and then how many different replication the things are, reading the journal and what the position of each of them is and then something else has to be to determine when to trim as well, and maybe I could be the same thing process. That's doing application.

C

C

C

We can also yeah so I think this is that this is the big chunky thing, that's sort of on the one of the big chunky things. That's on the road map for our BD, the.

E

C

Questions are around where the priorities are going to fall as far as focusing on dish at current issues with you know, improving performance or robustness or usability or whatever or sort of biting off a big big thing. I think this is definitely something that would be very valuable for a lot of different use cases.

C

So, just to be clear, you can do you can do sort of multi-site.

C

Replication right now by using the rbd snapshots and snapshot mirroring. So if you periodically take a snapshot of the image, there's a moderately efficient way to get a diff, essentially between two snapshots and stream, just that difficult network, and one of the other blueprints actually can help prove that somewhat. But this would be more near real time where you know you have like a mirror. That's you know seconds till later, in this delay or whatever you can bigger.

C

But it is sort of a nice enterprise-e feature, that's not present any other. You know open source software defined type systems kind of, um but I don't know if their questions about that. We can talk more about that. I think there's also. I can also use this as a catch-all slot for any of the other, just random stuff in our BD that we want to that. We want to address over the balance of the cycle. I think one was one of the things we talked about was enabling caching by default or let us the option.

C

It's like cash.

B

His way through us he's an option like right cat right through until flush. um That's it to be safe with older guests that don't necessarily actually send em right flush operations down, so it the cash is a right through until it sees the first flush come through and then there's the big girl supports it. I.

C

Think we could do that now, actually um for the next release. That would be enabled I think I get to hear somebody who doesn't want this. Actually, so, if I should've done this or firefight yeah.

B

And for a lot of things, people will have to configure it in the system. That's like managing their VMs anyway, since they will always have the default value there. That's our kind of writing, whatever default. We said as well, but it would help for more general operations like between line in for export. That kind of thing.

C

Yeah and just establish what what those defaults should be going forward so, hopefully.

B

An OpenStack.

C

That all that to also hear all these defaults to.

B

Let's see well that's when the ones where it has its own default, that overrides it unfortunately, but yeah yeah.

C

C

Let's see what the other things there was an object, cache or fix that that how am I found when he was doing performance testing, where we fixed her to the easy one but there's a different variant. It's like they're still had a duration, doing right back to identify which objects for dirty or something like that. Yeah.

B

That's right into that then I think we we discuss a little bit about. Maybe making a I'll be, can t keep better track of which objects had um 30 blocks in the first place, a little even iterating through anything.

C

Sarah I'm: do you know if we have a ticket open for that? No.

B

I, don't like so.

C

Parallel nization of RB export-import.

C

It's for some particularly fast right now.

C

See with us there's.

B

Some generic etiquette things that we might want to change for the performance work. Things like adding more trays points um for what lab I'll be catcher and for a little video itself to see where things are slowing down.

B

And with the more recent there's been a bunch of work in the object here to try to improve, didn't see there and make it better in parallel and then we'll need to maybe pebble at some of that up into the bar VD and now the catcher take advantage of that. Parallelism read that at every level.

B

Yep I might involve breaking breaking a blocks and make um example. Abdi catcher has at one go block. We should probably need to break up I'm go ahead. Siege.

C

Yeah yep I was gonna, say for talk about trace points and tracing Adam. Croom is looking at some of that with over the summer too I think a cool one of the goals is, you could have and exists a running. You know mu kayvyun process or any other LaBarbera client and attached to the oven, sokka and start slurping off trace information. So you can capture our hood and then replay it later model it or do whatever.

C

Let's see an RB d, DF command is something that has come up where r BD images are firmly provisioned. So if you do SEF TF, you find out how much space is actually used on the pool, but you don't know how much you sort of promised to users by creating images that are you know under terabytes in size or whatever so RB gdf could have that total separately.

C

You can see how under provisioned, you are I. Guess based on that, a really simple thing: let's see.

B

Yeah via ideal to use the I'm pure flagon bitmaps to take it do them. Yes,.

B

Since they provided nice fast way to get allocated space.

C

um Let's see I think the other thing. That's sort of a sort of stepping up a layer is what what the community is lacking right now, our documentation but best practices and tooling, to do effective, I skazhi gateways that we have the TGT I skies you back in.

C

You can also run l io to re-export rbd as I skazhi, but if you want an H, a situation where you do multi path to multiple targets and that sort of handled properly in the back end with locking and blacklisting and so forth, so you do the failover in a safe way and there's a bunch of probably like pacemaker stuff. That need to be done to do that correctly and I. Don't think, there's anything.

C

That's um at least open that that does that right now, but get having sort of a robust, I scuzzy back, is something that people definitely definitely want to see for now getting her legacy environments migrated over.

B

Yeah it's like main thing. There is a lot of testing and documentation.

B

There's also been some discussion about me. Maybe I mean I userspace pass through for el año, be able to take advantage of newer rbd features.

C

Yep is that just like a like a few Z thing where you just get like you're just like reading a nut league soccer or something or getting recent rights, yeah.

B

The interface mocosa might be like a memory mapped region that you reading right to UM like about various receiving commands and setting responses. Once they're complete.

C

C-Deck asses there's currently a way to monitor which images are using the most io sort of an RBI of stat.

B

Yeah, it's other whole area that we haven't really addressed yet yeah.

C

Yeah I wonder if that's: if that's even like an arbutus Pacific thing, it's might just be a rate of specific thing where yeah.

B

Well, it might be a beta specific thing. We might want to add, like a tag to different operations, to associate them with different types of clients like for drbd, different images or for the file system may be different file system users or like that.

C

Yeah, it's negative right.

B

That hook, monitoring into that at the latest level and get a better statistics for like different our body images or different set of s users. It.

C

Might be useful for the RVD or for the most ease that just have sort of an ephemeral map and memory of that's sort of telling up the iOS they're doing broken down by client and that the calamari bids could slurp up and and publish somehow. So, if you wanted to, you could sort of get a snapshot of you know over the last 10 sec, whatever a minute. These are the clients that are doing the work work in the cluster.

B

So they got the there's, be a little bit too much to connect the dots between like re-image end but client name for calamari, though thank you yeah. I was thinking this one so get rid of leveled egg in the message.

C

Well, like this is our be traffic resistances, not.

B

Worry- or this is already traffic for this image, even just has some my art, arbitrary identifier, associated with it yeah.

C

D

Think when you're wondering you're backing Laredos, you could I do get identify yourself put some kind of eye on a true good string which really pass through yeah.

B

C

Yep useful yeah right now we just have a client identifier in a IP address, support which might be enough, but better than nothing what it is enough. It's.

D

Hard to connect, so we added actual information yeah.

C

What do you think about having the SD sort of if they're, counting all the stuff it internally just keeping like these are the top 10 clients that are doing I Oh with me and then having that be something that Callum I or something else could slip up and aggregate across the cluster. So you can just see these something go to the grid in like a SEF, io top or whatever type.

D

Of thing I also like the truth out there pretty good all right enough harder.

C

Yeah, yes, yes, not the monitor should go to the monitor Fraggle. If you please.

C

Let's see thank.

B

You thank you. 0 1 have some kind of like the laboratory string or some other way to tie back to individual actual use. Cases, though cool won't always want to look at this electric quiet. Yours is I, mean.

D

It could be a person object that way you could, you could just you could filter based on keys, listen.

C

Up I just adjacent string, the string could be Jason well, yeah right, yeah graduate.

D

But I'm saying that way it from the calamari interface. You could look at everything for alcohol, prep attributable to our BD or all traffic tributyltin rbd of its facts for all traffic attributable to client did why anything.

C

Yep, so we get our there try to shoehorn this into the existing negotiation stuff that the messenger does with like the authenticator, which you're probably annoying, or we could just make it so that after it connects the first message of the object or sends is a little I dint mess at this is by the way, here's a string, yeah.

B

Triana back even the first message, but maybe just something later that you fit. This is a separate message that you send that it can say is my new idea that identity, for example, with rbd, you may not have the identity until later, when I got you've connected for a while, but well yeah.

D

um Well, I mean each each OS di because you might frequently have an r BD client, that's multiplexing, a whole bunch of images right, yeah.

B

You can do that too. So if you want to track that, then we would have to turn everything chop which would be.

D

B

C

B

It's a little bit of a head butt prints, probably on terrible yeah.

D

I mean if it's badly to compressive.

F

Is my audio working now yeah? Yes, okay, so I dropped out for a second we'll be talking about having a look up of an ID that we would attach. Two radars requests to some other table and the more complete ID.

F

Maybe so like, rather than sending a Jason blog with every request, you would you'd have like a sort of ID table somewhere and said something smaller with each request that.

D

Sounds like it needs to be lightly table somewhere.

F

Exactly yeah that does, but the thing is that you need other metadata right: you're, always going to want to be able to resolve like rbd image back to user or file system session back to host name. That kind of thing, when you want to present it to a user.

D

Yeah I think it's easier to just attach all the information that way. There's no need for translation table and yeah.

C

Yeah I'm just hesitant to add to fight 100 bite blob to every I Oh.

D

F

I mean in in lust, uh you do it with the job ID and then you look that up by the scheduler. But that's obviously like use case specific and it's easy because you already have the schedule or server.

D

Well, so the first time you set it off by a particular session that has been reset.

D

You could send the full JSON blob at the end of an to token and subsequently, as well as long as the session doesn't reset. You can trust. The OSD remembers, the yeah.

F

Yeah, mr. its quantum sounds right.

C

Now, within each OSD session, object or just be a map or something yeah, yeah yeah, that's true, but.

B

If you have clients multiplexing over the same connections are similarly resignations him yeah.

D

So they're gonna, so if you're, if a single librettos client is acting on behalf of five images, it's going to send each of those days on blobs. Once with you know, tokens one through five and later will descend. 354 I see.

B

Are you saying yeah we could do that that might be the best I'm most flexible way to do it.

C

Yeah I mean I wonder if it would make sense to actually bump it up a layer, and so if the client says I have a workload ex that I want to associate with IO with it would like to-- mop monitor, because greg will beat me about the head and shoulders, but somebody will say I want to register this this workload. I get.

E

C

With and then I just tagged all the O's that way now with you guys, doesn't care she's. He has been through. That's.

D

Exactly what we're doing here, though, right we're anything.

C

You register globally, not within an OST session, so.

D

Yeah I, don't know what the cluster, hopefully that doesn't introduce a horrible problem.

D

Yeah I think the best way to do this to me is to keep it as local as possible, because the JSON blob itself is unique identifier right. So when the osc is offloading, this disinformation to be calamari.

C

D

um I you can either compress it or use the same token trip sure. There's, there's no real need to have a global booked up today.

F

The puzzle committed even get quite big, I mean if you, if.

D

You've got that, so that's that's! That's what I'm saying we can use this the same trick. We can. We can maintain a session with the calamari thing. That's stateful enough to maintain a local lookup table so that we don't dance.

F

For a note right, right: okay, yeah now, let's look.

D

At that I, don't think with you live, love, I suspect. If you just compress to the resulting blob, um you wouldn't wind up with all that they can explosion. Yeah.

F

We could probably encourage people to keep the blokes. You know down to sort of tens of bytes for the most part, yeah.

D

Well particularly sets this is liberators users and they're supposed to be enlightened in the first place anyway. Okay,.

C

C

C

um What else so I impact our BD authorities? What other other things that are.

B

C

B

Of the taifa template is some this and work on that on the copy and read, which means that for clones I'm, instead of only doing a copy data from a parent image, when you do a right you're, also when you're reading it, you mean the entire object. Instead of just that section, you need and then write that out to the child image. So it's kind of like an on the online flattening rather than the standard offline planning, did.

C

The users great fun land, yet not.

B

Yet I still need to take a look at that pull request and there's still a few little things if except fix up with it,.

C

And for the colonel side there's a blueprint, they will talk about ya.

B

C

Today or tomorrow, I remember.

A

So we got about five minutes left in the session. Folks have questions now, it's time, I guess to poke them in there and they haven't been answered already.

C

Let's see the discard on Colonel rbd discards, work, yeah,.

B

That was great um awful.

C

B

C

B

C

Is that what's going on or what yeah.

B

We have much photo testing of it now, since via I'm Liv Lee adapted our exes likes to work on Colonel every day, as well as local PD, and so that's finding a few bugs one with apparent overlap, handling in the girl. And then one with this card's board is what we're still freaking out right now and.

C

The other thing on the horizon for the colonel, our booty side is format to striping yep.

B

And dozing bears oh yeah.

C

Yeah, I just going to say that, but live already lets you do fancy striping or instead of just chunking the image across objects. You can stripe across sets of objects with a small stripes eyes, which is mostly useful for workloads where you have something that is itself like a journal like say a database. That's doing a bunch of small is sequentially and you want to spray those across multiple disks, instead of just hammering on one disk for a while and then hammering on the next disc object disk whatever.

C

So it's just adding that support to you the criminal side to you right now, it still just as a chunking, not the fancy, weird striping modes, that one is a bit of a bit of work, though, because all of the global stuff, whatever yeah it would basically needs to code, needs, be restructure to do more scattered and sort of peace. Peace, iOS together to map a bunch of sort of parallel things to multiple requests and dispatch them in parallel, and it doesn't do that. Yet the file system supports this.

C

Even though RB doesn't that the file system does it inefficiently, so it will just do like each little piece at a time as separate is so it'll just go really slow. So hopefully, when we do this work, we can have that generic bit infrastructure in there to do that effectively. That'll been captured. Both the file system in that are we to use cases, but that's ilias, I think on his way. But again, so it's a junkie piece of work will, by focusing on some of the I'm.

E

C

E

C

You want to talk behind. Oh.

E

Well, I, don't have anything parel working yet so I have I, guess what you call a one-way one thing working for the semi working for the non parent case still went when there was just a single image without without they need to go to the parent, because the parent can have a different type important. So that's the that's all complicated yeah yep.

C

But it's coming eventually.

A

Okay, does that just about cover off on the rbd stuff, then yeah.

B

Thanks I awesome.

F

A

Right well, let me end the recording here and we'll move on to the RDMA update here.