Ceph CDS G/H, 24 Jun 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS G/H (Day 1) - Cold Storage Pools

Description

https://wiki.ceph.com/Planning/CDS/CDS_Giant_and_Hammer_(Jun_2014)

24 June 2014
Ceph Developer Summit G/H
Day 1
Cold Storage Pools

A

Alright- and we are on to cold storage, pool discussion, this one's going to be a friend of ours now at hgst, Roger weeks Roger. If you want to give a rundown of what you had in the blue corner, there.

B

You might want to know I, don't forget to.

A

Now go get neck, oh.

B

B

Now is that I don't know how was it I can't keep my.

A

Your winner, I, can hear you I just want to make sure we couldn't hear ourselves. I fairway.

B

Yeah, sorry, I'm on a laptop with no headset, so um so this is a. This is a really high level overview blueprint. So what we wanted to get a discussion going on is what is it going to take for a pool, type or other other type and SEF to exist where something can be written and effectively, never rebalanced or rewritten, or if it has to happen very infrequently so we're talking about you know, I make the point in here. We talk about cold storage.

B

Think of this is the level above tape, but not actively access data. This isn't something you're spending a lot of time looking at, but it's still data you need to get to a lot faster than you need to get to tape and we could go into the use cases for that there. There are many, but that's the general idea here since I'm not a developer. I didn't put anything technical in this detailed description. I was hoping we could get some of that out of this discussion, but that's kind of the 50,000 level of you.

C

Is it possible to adjust mmm I'm sorry, okay, I'm gonna start talking now? Is it possible that looking for is more like an interface where the OSD will notice extremely cold objects and demote them to something that is insane like that's three, my Kevin three or a different interface? It could be plugging base. Make it to my lesson for agnostic.

D

I think I think what I think with the way I read this at least is what Roger has in mind is that this would be the pool that you're demoting, they're really really cold stuff too yeah.

C

D

Imagine a data center full of Ethernet drives, but you're powering them down.

C

Yeah, that's the hardest thing: I'm dead! Well, something like that mass over the OSD. You know to me well,.

D

So that the thought I had was what right now, when the pairing process gets going and it's setting up that it sets at PG temp mappings to avoid moving stuff just so, we can remain available until it has sort of rebalanced data. But what, if the that logic was policy based instead of doing the normal thing in the cold storage pool it? Just that's at the PG tech mapping to be where it was, and it doesn't change it until it drops below the desired redundancy at which point it.

D

You know inserts a new item that is sort of aligned with the current crush map, but not before that.

C

Well, it does, but we can't do it using the PQ come having.

D

Why is that? Because.

C

Anything linear the nipple of PCs is not a good thing.

D

Yeah I mean yes and no I mean that the OST map is linear in the number of pg's, no steeze right, it's just a small constant, so this is bumping up the content by something less than 100. Hopefully maybe this that's that's a price that you paid on your cold storage here right I mean that, ultimately, that's that's! The crash is forcing you to run into the spectrum where you are have very little metadata because you calculate it all go.

C

D

It's only by like a factor of 100, like literally, you could store every PG mapping expertly and to be under times bigger, but.

C

There could be a thing: I mean you get almost the same result by just not working. Do SDF right.

D

Kind of you you, you still want the system to to wake up and add because and move them around when either balance becomes problematic or because, because you drop below your your desired redundancy level, all right. So there's still our reasons when you do want to make things move you just want to be sort of, instead of like always going all the way to the sort of the moving target of crush. You drag your feet as much as possible. Is.

C

This a right application, / right stand in a problem. Is that what we're talking about fun? Valley?

C

Usually it's not that much I mean the whole point of that. The location is that it's a whole bunch of these performing a little work. Rebounce. The data ray.

D

Yeah I think it's just still more work than you'd want right. I think you add, if you had, if you have a thousand disks and you add 20 more then every oh steve is going to wake up and move a little bit of data and then shut down again. Yeah.

B

And that's good, that's kind of the that's kind of what we're trying to avoid here.

E

B

If you don't remove, is that data at.

E

All then, how do you put any new data on the OS days? If you're I mean the painting temp, don't ethics, that because then you just don't have.

F

E

On the new OS seus yeah, that's.

C

F

C

At all, because if you add another, the additional 2003, so let's there pepper Raven with you either you rebalance the data where you have 20 health, standbys right, yeah,.

E

So I mean it sounds.

C

E

This use case you pretty much need an explicit object mapping and maybe there's some clever work around, but I'm struggling.

F

Known it no say it ain.

E

T mapping, because you don't want to move the data, which means you can't move any prior p.

C

If you try to minimize data movement in the face of disk failure to employ, they have a pool of some number of hops dem boys when the mamaroo text. That knows he has failed. It cycles one of the hot standbys into that crushed position, if not the OSD ID, and that new disc gets logically the same data right. That's that's! What we're talking about well,.

E

I mean if you read the blueprint I mean: maybe your honor should start talking and but.

F

C

Also there well.

D

That's another.

E

Option Yeah right like trying to.

C

Tell me before officially.

E

Sprint or more evenly spread data around and stuff.

C

Well, yeah, so the goal is that so it's a trivia system for fly so you're trying to handle this value right. So when you is, is it that you think the disk might come back and therefore you don't want to waste the workout or you just generally, don't want to risk reshuffling data because you're only interested maintaining a hundred deaths. Just even if you have a tap connector.

A

There are some questions coming in. Irc might be good to roll into this, or we let them fall too far. Behind.

A

E

A

Placement groups could be range partition rather than hash partitioned asking about layering data.

A

B

So someone explained why range partitioning would be how that differs because I, that's beyond my Ken I guess.

C

He wants to be to that, but basically ranges are in the same. Tv is out to be racist.

F

Yep, if you arrange partition, you can say that you will never touch the stuff, that's written initially. What once it's written? It's filled up. You know the placement groups filled up. Then you move on.

D

Assuming that the that the range is the range of things you're talking about ours, it's like the object names like they're they're named by time stamp, or something like that. Ok, so.

F

D

If you're, inserting randomly within the range than it, which is sort of what happens currently, then doesn't help right.

F

I think notes you.

D

Do that, though, once once you have PT groups that are range partitioned, I assume you mean sort of arbitrary ranges where you can say this: PG is a little bit bigger a little bit smaller or I would adjust the range for this specific PG. All right.

F

The advantage of being on a blue now we can make of it yeah. Yes, maybe a welded partition, maybe everything's fixed, sighs, a lot of unknowns, whitening right.

D

I mean so if it is a if it is a so if it's a partition, that's explicitly specified which is sort of what I think a lot of systems do, or they say, like this partition of the hash range maps to this server in this partition, the hash range, a range of the hash range met whatever, then, you have metadata for every PG, in which case you're sort of where we started with. Where you have you say this PG is statically mapped to these.

D

These devices I think, basically, that the difference is that right now we're partitioning a range of hashes and and sort of what you're suggesting is you would partition based on something else like the object name which conceivably you could do, you would have to have per PG metadata, that's sort of specifies the start and end key for every ethnic group, but you could do that. I think that the challenge is that.

D

Yeah, that would be a bad idea, though, because you could have you know every too many objects that start with foo and they all pile up in one PG and suddenly I have to split that and then the PG splitting becomes totally ailing, non-regular and yeah I think that'd be pretty hard I'm, not sure it really buys us anything here.

F

Choice yeah, it might only work for a small number of devices know when I was at Yahoo. We did some range partitioning on some of our out early yet, but let's pretend it's just either a hex range or or date staff, and you know you're, never going to go back again. You know you can split it. A.

D

F

Don't know for sure, there's a large number of use cases but I think there's a few of them. Yeah.

D

Yeah, I think that the trick for me is that, right now we don't ever contemplate the idea of moving objects between place and groups, except in that sort of context of a split.

D

And so, if we did, this, we would have to have we'd probably have to like put restrictions on the way that those splits happen like, for example, the last TG in a pool, as always like some key to the end of the possible key range but like if all your objects were prefixed by a date, then that last PG would be like the current time to you know the infinite future and then, when you have, whenever you do a quote-unquote, PG split you'd be carving off the PG.

D

That's at some boundary look at that range and once it's sort of no longer the end one, it's sort of fixed for all time, which I think captures the use case. You're talking about and would conceive the way, I think fit into the framework. But it'd be it'd, be it'd, be tricky but again, I think that's sort of a little bit different than what what we're talking about here.

D

It seems like the thing I mean the sort of the general motivation for what Rogers talking about is wanting to have devices powered off so that, when you add, add new stuff you're, not sort of jeseline data to rebalance yeah.

B

Or if or if they're not powered off their devices, where the overhead of doing the rewrites from rebalancing is sufficient, that we don't want to do it on on that box and.

D

At the general problem with that is, if you have sort of one big pool that we, if you ever need to expand the pool, then you you don't want to not rebalance. You do want to rebalance, because otherwise all the new stuff is empty and you need at least you need some separation within the new category in order to get the like separation between racks and replication and so forth for the variability and redundancy you're looking for um actually.

B

Guys I know in some ways this is actually kind of diametrically opposed to house f is designed so that that's right I wanted to talk about it and see if it even make sense, but but.

D

But, but if this, if that's actually what you do want, then I think what you actually want to do is just create it. When you add, like your next three racks of cold storage, you just create a new pool, that's just those three racks and you fill it up and when it's full, then you, you know, turn off the spigot and then at deploy your next three racks and you fill those up right.

B

So we had a toilet when you deploy, we have a pool yeah when you deploy, you have a pool, you fill it up, you don't add anything to.

D

F

C

B

C

But the goal, though, when you're, adding stuff to these cold pools, be what's what I've seen? Is that because you don't plan on reading back at a these objects anytime soon, you actually just want to write to wherever you wrote the last night right. That's that's the biggest thing, because you want to wake up. Whatever is already awake. You don't want to wake up random stuff in the past that pretty much completely rules out any kind of past base placement, mmm I, cracked. So far, pretty.

B

Much yeah with.

C

The with the one pool, / rack thing: you're waking up the rack. You just woke up, but it's still possibly a larger unit than you really want to wake up right. No, that's not just for the digitally. That's that's! Why you're a cop yeah! um That's.

C

Yeah, that's going to be really strange, yeah.

D

I mean it's at the end of the day, it comes down to how much metadata you're going to spend to find the data, because if you have sort of infinite metadata that you're willing to spend to locate stuff than just you know, wake up three, two at a time right, dislike stream of stuff, two, three of them until they fill up and then go into the next three and carefully choose them, and you just have this huge database that says where everything is well. Madame is a scalar right.

D

Exactly you get are exactly the right pattern you want, whereas.

B

A piece of middle so basically we're basically we're creating an index then, which.

C

Cell does it out today right if you're, if you're, if you are, if you do, have control of the key things and if you're doing range finishing it sounds like you do, then you can still use crush to locate the pg's right, you're, just changing the way objects map on the pgs and you're, making the number of gigi's very dynamic, yeah right. So the hard so bet that all fits in okay to the OSD.

C

The part that doesn't fit in is that the OS DS are constantly looking to maintain balance, and that means they don't want to go to sleep. That's the problem not sure about right.

D

I think so, yes, and now I mean I. Think that the simplest thing this is sort of shifting topics to like spending discs, but the simplest way to implement spin down would be that the OSD stays awake and the little arm thing on there, like you, can a drive stays awake, but the platter spins down. So you like power off most of the drive, but it's still awake and means as I'm alive I'm just I'm. Just you know, spun down we're.

B

D

Next layer, where you like literally power off the whole disk- and you tell Seth like this- is a weird hook that like does it power and ethernet and pokes the rack to turn it on or whatever, which I think it'd be more more work. To give you sort of that extra level of a power seat. But.

C

I think I'm to Android thinking about.

D

The first one and to.

C

Do that Theo SDS need to be able to defer persisting things like those d maps until client requests actually forces them to do so.

D

C

D

Think Molly these ones would have dude have like a very small amount of flash or something where they can like buffer some metadata or something right right.

B

And not to put too fine a point on this, but I also kept this blueprint high level, because there are other technologies that were interested in here. That we can't talk about yet because they haven't been announced. So I wanted to see if there was something that could work across multiple types of cold storage and.

C

B

Unfortunately, all I can say at the moment, but.

C

That's that's why I'm taking any plug-in architecture, that's capable of taking the information that comes out of scrub and arpit set stuff and offloads the data to something that isn't necessarily stuff might get interesting direction so.

B

Basically I'm I plug in that it generates an index to do what sages.

C

Alakina Bogut's perfectly, when we do a flush, we flush to a rate of cool, but we don't necessarily have to write.

D

So I mean I, think I think what you're you're going towards damn is sort of the the other tearing blueprint they like the cold tier blueprint or basically the object becomes a symlink that just says over there. It's sort of right there and that the way we caught deflated before was another rate of pool, but it could be, you know, tape at offset. You know, take number 237 what.

C

Would the way I could be a bubble, but I I think there is actually some knowledge be well something.

B

Because that's well what one of it one of the motivations one of the motivations I had reading this was looking at the cached hearing that you guys added recently and wondering ok. Well, there's there's this to layer tier now and looking at the architecture might even sure if it's possible, but kid could it be made a three level tier 2, where you've got basically hot, yet hot data warmed in and cold data. That's.

D

That's that's kind of what we are. Yes, that's what we had originally contemplated. So you have the base tier that the cashier, which has that stuff that based here, which it flushes to you and then when stuff gets really really cold. It gets punted off somewhere else and there's like a pointer that says, go look over there right.

B

D

Like a symlink that could be into another rate of spool, it could be into a you know, totally different storage system. It could be onto like a tape. You know archive or something like that. Ok,.

B

So let's say that you know: let's save the drive, had an APA, let's see the dr had an api exposed for a key value store. Could it be by there that's.

C

It definitely I'm actually, that's that's also straightforward concept: yeah the radios in talks with storage is abstracted and we actually have a couple of backup and set are capable of speaking key values. So thank you.

D

Right so I think so going back to the going back to the beginning. Ok,.

F

D

When again, I think you have yeah these extremes, you have the extreme. Where you have this huge index, that's that's exactly where the object is and you get to decide exactly where it goes and you carefully choose drives. You know that already powered up something like that versus where stuff is where it's like I, don't want to store any metadata, so I'm going to calculate everything on the fly and and move things around when I need to and you're you're sort of. I think what you're looking for somewhere in the middle, where you're.

B

Willing, I don't want to we Eric attend Olivia I, don't want to Ryoga tech to all of stuff. So.

D

I think I think that the easy, the first easy thing would be some sort of explicit PG mapping or you can limit some of the EG stuff moving around, maybe that'll that'll. That might give you like, maybe twenty percent or thirty percent, maybe more or sort of the the phyllo thing that we talked about where, when you deploy big chunks of storage, you create just separate pools and you sort of fill them up and then fill up the next one, because it's sort of like right once and rarely delete type archive data. That's.

B

D

Option there's a suggestion on IRC channel just a moment ago, where you know that the problem is that your your hash, distributed storage is spinning up random disks, as you have. The sort of trickle of data coming in I mean this sort of. Obviously some there is that you just have it buffered somehow, so you have like a cache tier, even like a ratos cash dear, that's getting all the ingest stuff and that everyone's wanna go and you flush the whole thing.

D

If they spin up the a bunch of racks and you do it to rights, and then you spin it all down again, um which I think is conceptually simple and architectural II simple as long as you don't architect things so that, like it's, some stupid thing where you can't power up more than five percent of the drives without glowing your power circuits or something which right.

B

So basically you're limiting so basically you'd be using a cached. Here is basically, you write things to the cash at host. Sick, yes, host, sequential, writes, and then it writes all that stuff sequentially out two disks yeah.

D

It's not strictly sequential, but it's a it's a buffer, so you can leave everything powered off on the back end. Yes, this is assuming you have you ingest, but never you're, not reading any of the data right.

F

D

Then, whenever you're casting fills up, you tie up all the back ends and you do a big flush and you just you know, push it all out your cache of empty. You get any powered all down, and then you let it fill up over the next couple days or whatever.

D

That's what I think that's another another model that would work here. Okay, I think one sort of one last thing that I want to mention. This is mostly for Sam I'm, back up on the PG temp topic, but maybe PG PG temp, isn't the tool or like a policy around how you use PG temp, is in the tool, but you'll notice.

D

There was a there's a similar query on the on the email list earlier today about disabling crush in sort of using a different algorithm I'm wondering if what we want to add in the postie map is a as an exception. Mapping similar to PG temp. That's called like PG force where you can, just if you want to have just leave the crush map empty and just literally enumerate where each PG goes, and so the admin could guess. I I forced it over the I, want it there and it would do it right. Little B and.

C

Then you can have my grad also say: don't feel the catch is crush. Metrically handles the failure case.

D

C

D

Not all I, just it you have to you- have to manage it, but I think there are cases where you're like I, think you're like they're, like you want it. You want to prod at the cluster like make it do something for some reason that, like we aren't contemplating right now- or you literally just want to have like essential thing, that's just deciding it sees the failure and it says: okay, move that piece of data over there. Like you really want some external agent, that's orchestrating all the movement. You don't want to.

D

Do you don't want to sort it if you tap it.

C

And get the crushing odds if you do that? What you do, if you do, that, you don't need a hug excuse for a fee.

D

Exactly yeah right or say you see, you are using crush and it gives you some it gives you some. You know a bell curve of how that what the utilizations are and you're literally like okay I'm just kind of this agent, that's going to move ten percent the pgs to make my balance perfect right. Instead of doing the sort of exceptional case, that's whatever so I think the only the only caveat there is that I think those mappings need to.

D

In addition to like what the actual desired mapping is, they need to have a priority associated with them, so you can identify like who set the mapping and therefore who's able to change it guys right now, PG temp is sort of one hundred percent of them by the OST peering code, and you.

C

Can use it anyway as an admitted it overriding the PG deals. These will need the ability to write to override the echo exactly.

D

Yeah right right, yeah, so anyway, okay, so I think that's that's another thing that that we should do at some point, but another piece of the puzzle that could be used in this. In this context, okay,.

D

Let's see oh I.

B

Did this on Spears.

D

Thing up that was interesting. That and I think we talked about that before, but that I thought. That was a good idea too. Yes,.

B

I mean it's basically a raid methodology, but that doesn't mean it won't work. No right.

C

Well, it's it's creating some capacity for a static or more set on having it.

D

Means your recovery is slower because you're bottlenecked on that one disk, but it gives you the static map. Well,.

C

Sort of you were going to recover back to that one disk anyway, so you recover you to get revenge, but the data loss window. It is the whole time instead of the part of them right. That's the doll yeah.

D

Yeah anyway, okay.

A

We were wrapping this one up. Are we recovered.

B

Everything you wanted.

A

B

I think so I got some good notes in the etherpad.

A

Cool beans, cool all right. Well, that's the case! Then I'll get the video flipped here and we'll do the last one, which should be short, excellent.