Ceph CDS Hammer, 29 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Hammer (Day 2) - CephFS: Forward Scrub

Description

http://goo.gl/U4b70r

29 October 2014
Ceph Developer Summit: Hammer
Day 2
CephFS: Forward Scrub
Greg Farnum, Sage Weil

A

Alright, the next session we have here was thus ffs forward scrub and sage. That was you and Greg. So one of you can fight for the right to party down loopy.

B

All right, I'm gonna, nominate cranky.

C

Reichenbach yeah.

C

So we've been talking about this on and on for oh man a year and a half now was when the first email it out is in January last year and we're finally doing work on it.

C

So this is bored scrub and backward scrub or repair, whatever sort of the two are sort of the mechanisms that are going to impose steffeff, esas, FS check, and the name comes from the fact that, with forward scrub, what we're doing is we're, starting with the file system hierarchy at the root of the tree and just slash and then sort of descending through all the metadata and finding and finding everything in the hierarchy and verifying that the hierarchy is self consistent with respect to itself, and that will you know, sort of that will detect sort of rape right errors.

C

That happened because of a bug or something it will detect things that have been where the objects are. The data objects are gone, wait.

C

Well, eventually, it can detect things where the data objects are gone, but we still have references to them, but it won't do things like say: oh, we have files which have been removed from the tree and no one remembers them anymore, and so the backwards scrub is is a system in which we go look at the raw data in the radio spools and then try and map that back into the hierarchy and that's really hard problem.

C

So we started with this forward scrub and there's some code up for review now, which is the first part of that. um We just put it and I, put a link in here somewhere to the to the action, pull request and it's a whip. I know'd scrub branch, and that includes a bunch of mechanisms that make it putt that sort of promote the MDS internal operations into things that you can that we can work with the same way as client requests, um because architectural II we've had a problem.

C

We had this thing where basically client requests can start at the beginning of an operation and then it will run through in a grab all the all.

C

The metadata locks and makes it needs to process them and then, if it hits, runs into something, then we'll put them on a list of I'm waiting for this lock and then, when it's possible to get that lock, then the request will get retried but internal operations, all sort of just assume that you've already set that up and that it's going to succeed and so there's no retry or anything. So we have a bunch of infrastructure that allows you to create first-class internal operations that look to the system.

C

Basically like client apps do so you can do things like have admin, sockets or other mdss or whatever say I want you to do this thing and it will gather up all the locks and then you know wait until it can do so and then run through the operation study did that infrastructure and then the actual code to validate a single inode and an admin socket interface. That lets you say I want you to scrub, slash home, slash, greg and it'll go and look that up and say: oh, that's it and it'll. Look at that!

C

Look up that that DN treeless or that path and will say. Oh, this is a directory I know'd so it'll go through and will validate from the back trace on the directory object and that the directories are stats Matt or the directory self consistent with respect to its our stats and things like that and work is started on what I'm calling the scrub stack, which is how we sort of convert the the idea for the algorithm.

C

We have for a long time into actual working for sort of recursively, doing this to everything in the system, how we convert that into actual working code, um but that's not in not ready for a PR, yet I'd been hoping to have it in before the session, but a lot. Several things were going over the last week week and a half, but in the blue fur and I have to find sort of the key data structures.

C

I go with each of the cash objects, the CI note and see during the CD entry that are sort of associated with scrub, metadata and Hallows interact and how those are related to each other and how we use that with to sort of go through and just spun off these validations on each individual item and then back our way back up through it.

C

Not sure how much of that is appropriate, or we should talk about now versus doing more interesting things like talking about the stuff that we haven't planned, that as well um is John here: hi John, hey, hey, Tim, okay, thanks for waking up, we appreciate it from the United Kingdom, um so I guess was it last Thursday several of us sat in a room together and started talking about how we could set up the reverse or not, actually not how we can set up the reverse scrub.

C

We talked a little bit about it, but a lot of it actually was just sort of like what are the individual pieces. We can. We can build to start recovering from certain cases we identify that will be useful as pieces for the final, the final backwards scrub as a whole, and we talked about that for a while and I really want to pass this off to John or sage, because they like actually wrote on tickets and I did not.

D

So the I mean the first bit of backward scrub. Stuff we were talking about was filtering by path for when particular dirt rag objects are lost and then inserting directly into the backing store.

D

Those recovered paths I think quite soon, after that, we're probably going to want to insert via via running MDS as well. It kind of sucks to have to take your whole file system down and flush your journals in order to in certain the updated master data. So I think that will probably have a similar kind of infrastructure requirements. The the ability to inject that recovered metadata by some kind of backdoor.

D

Let me add the other sort of onto that. I can imagine coming quite seen, as well as the wanting some kind of object class, but when we're doing this grating yeah.

C

They're yeah, so there's actually a mechanism for that. It's PG LS filters which still exists in the code base right. So hey.

B

Yeah, it needs to be sort of checked to make sure it's correct, but yeah the basic infrastructure is there yeah.

C

So it's actually possible to I think we need to write a new filter class and then or I mean it's not a class but write it write. A new filter function that we can call. It says you know decode the factor, X rating filter and then and we pass along the data to it, yep and it will select yeah so that filter will. When you do a PG, LS filter, then it'll look at the back, trace on every object and be coated and compare it to the one we've sent along and hell.

B

Yeah I mean it could be that they were the initial version.

B

Yeah we probably use the filtering. Everybody is.

A

B

Fetching him: okay, yeah Oh need a new HP GLS filter for back traces. One of the tricky bits there is that, right now that the filters aren't classified or whatever they're not.

D

A

B

Class they're just hard coded into the OSD, which so far everything that the metadata server is needed. We've just done that because it's all well, it.

D

All have long time before we diving yeah yeah.

B

So in this case, we probably want to use push that into push it into a ratos class. So that means that the PG LS filter, op beautiful I have to specify which class and which filter function you're going to use and it will get registered in the class mechanism to do that.

B

Let's see I think the other thing that we we talked about was the idea of taking a backup or a snapshot of your metadata pool. Before you start doing any of this scrubbing repair stuff, um we talked about fixing rados, export and import yesterday and I think everything that we talked about their should apply, but you could actually make a full backup of your metadata pool, including all the omap, all the right, metadata, stuff, I think also.

B

We could use a pool snapshot too, but we need to make a pool rollback function, berate us that iterates and rolls back for that to be workable, but either either way. There's that that piece of being able to like, if you do, a try or repair and it doesn't work out, you can go back.

C

B

Let's see so so.

D

Just be clear: we're talking about the import export tool, in addition to snap shots, we're not looking to make a choice between those.

B

Yeah or we do one of the other I think right, you can either do a snapshot and try repaired if it fails roll back or you could make a backup and at the kind of one we work.

D

If we're choosing between the light I lean strongly toward making it back up but um I, don't know, I mean thats case kind of a subjective thing. I knew my feeling is that once you're in disaster recovery land, you want to be using something as brutish and simple as as possible. Yeah.

B

Yeah, well that so that the rain us export is needed for, like all manner of other things. Also, oh, it's totally non specific to this, so we could number free, but the pool snaps also I mean the only thing that we would need to make pull snaps a viable way to do it to make roll back. That's all yeah! Well, I mean it's easy. You list objects and you call roll back on every oh.

C

B

Every object right all right, okay, so it's not that it's not that yeah! Okay dude its order. In my.

C

Brief and simple, yes, but I thought you were thinking of like a pool, roll back function, then yeah I would be.

B

A little bit less easier structure: yeah, okay, yeah, oh.

C

Oh right, okay, um sorry I was going to say so. I actually did have a couple of things that um well no. There was one important thing, but that we haven't talked about much. That actually would be good for forward scrub, which is the about how we should be surfacing strawberries to administrators for them to do anything with ah yes, I mean we don't want to crash the MBS, but we can't just throw it to the central log. We don't so I think. Maybe what we need. We need to like the MDS f health checks.

C

Don't do much right now and actually don't have a good idea of how those health checks are structured.

B

We we kind of talked about this, I think the other day or we talked about something similar well, maybe not yeah.

C

I guess we actually that we do have a health system now, because john wrote it for the client capability watching right, though,.

D

So that I mean that that relies on on the ability to have something that is going to keep poking whatever the health check is to keep populating it in the beacon. I think the most challenging thing about surfacing scrub errors is going to be having them indexed in such a way that when you go and repair an error, you can clear the the flag forum or for the defect.

D

So it has to be something where, when we, when we run this sort of metadata insertion from Mac traces or whatever it is that that tool can work out which scrub errors no longer apply, so whether that's sort of it I'm full it through so I, don't know whether that's going to be like a by I note or a bypass kind of table of scrub errors.

D

But it has to be something that the computer can can look up into two clear errors, rather than relying on the user I think to go through and tech because there could be a lot of them. Is.

C

It sufficient to just like have a an error: pin that pins anything we find a scrub your own in cash and like is an excellent or something so we can just immune rate them on every health warned because that'll prevent them from being flushed out right, so you actually maintain both versions of it. I.

B

Mean you could, but then you have to put it. You have to put an excellent nub thing and every and every object and that's yet they're thinking that you never gonna use I would be more inclined to just set a flag so to do the preventing part. So the thing that we talked about the other day that I like some.

B

What is that, if we make the this we're talking about backwards, scrub tools of the time, but we make them so that they take right like a structured log or journal of the things that they find, that can then be consumed by the next stage of the repair pipeline, and if we did the same thing for forward scrub, where a a byproduct of the scrub itself is a log, a structured log that you could reread or later of the errors that it finds and then that's something that the backward scrub could refer to or the administrator could query when they when they get to the other scrub.

B

That says there was an error, and you say what would the air is? You could go look at it and read it and something better than like a big log file. So it seems like that. The key is just that we need to.

B

We need to have enough memory or enough information in memory to have intelligent, Health's.

C

B

C

That that we actually find a bad scrub where it's something like what we have disagrees like what like. We had something in memory, and it totally disagrees with us on disk because, like our back-trace versions, don't match or something or like you know, our battery surgeons, that our impact rates diverge with their the same version or whatever. um We don't want to overwrite. What's on bits, because we don't actually know what them to correct.

D

So I think you need to having the log is important, but once you've generated that log in an inside of running MDS you're going to also have to have an in-memory version of that in order to serve up like a iOS. When people try and access those parts you have to have something in the end you cash, even if it's not if the persistence might just be the log but you're still gonna have to have an in-memory thing.

D

B

If we want, if we pin the things that we find errors on, then we we don't need to use an excellent or a cell yeah cuz, the premise tables that one need: okay, that would be simpler and then and then you, if it's like a set that in stl set of you, know, India's cache object star or whatever. Then, though, health warning can actually say, there are 37 items that are no.

C

B

Know whatever does.

D

It have to say it in that tone of voice yeah. Yes,.

B

D

There is an assumption running through this stuff that the set of nodes on which we find Ford scrub errors will fit in RAM, which is which is fine. We just have to bear it in mind and maybe set some kind of artificial planet on the size of these structures, so that in pathological cases we don't we don't fill up the ramp on the machine. I mean.

B

It could be that we I guess we don't really have to pin them. It depends on foot. The goal is right. Well,.

C

I mean we don't want to overwrite the backing, object, data right and sort of the only way back those we have for that is keeping them in the journal and pinning them.

B

Yeah I mean what-what inconsistency. Is this is going to be like the directory. I know that points to a directory object, that's not found, or it's going to be the file. I know'd, where the back-trace on the file says it's linked in a totally different part of the tree yeah the first case and.

C

We don't know that have the versions in.

B

The first case, where it's just like a missing directory object like I, don't know that it matters that we have the it pinned, but in the second case, certainly, we would because we would want to know later. If we do happen to find that we come across another primary link to the same file that does exist. Then we would have it have the first one, perhaps so that I think that it might be useful.

D

So what what about the cases where we're detecting income system, our stats and that kind of thing that can happen at any level in the tree.

B

That seems less, that's just the thing where you just like make a note and fix it. Oh you.

D

Think you'd always always fix those in line during the scrub. Well,.

C

And our SAT, like I, think so yeah yeah I mean because, if you're, comparing our cadets and you've already validated the ones on the children, and so you know that they're right, though clearly yours are wrong. I mean.

B

I think you'd always want to have a scrub that doesn't make any changes yeah, but if you're doing a normal scrub in the normal case, like there's, no reason you would just fix it up, and you know make a loud note in the log that says you know there be bugs lurking.

C

C

I'm not sure you actually want to make cloud notes about that. But okay I mean what a window agha. We want to be able to find that, but I don't know that it's useful for an administrator to see it's just like. We believe.

B

I want to know their bugs.

D

House in golden yeah I would deal warranty to tell support at some point like just the next time they happen to meet them. Oh yeah,.

B

You want you in a worn at warning or error or above so that the pathology airtex properly flag it as an error in the run, fails.

B

That's the useful criteria here.

A

C

Okay, don't she.

A

B

Every as as anybody reviewed, the forward scrub stuff, yet you started looking at it turn yeah.

D

um Okay, I've played with it, um but I'm not I'm not claiming to join a comprehensive java, and I think he should look at it too. Yeah.

B

Yeah, okay, the plane.

C

Okay, I think that wasn't stuff I wanted to talk about for forward scrub, because you know we already designed it. This is the third or fourth session we had it in perfect, TDS we've had it in so just nothing much yet comets and it's sort of the backwards stuff. That's more interesting.

A

So that's everything for that. One.

B

So we can take like a I say this. It was a good.

A

Time for actually like a 10-minute break before we have to start the next session, so it's a good time to get up and stretch and stay awake. So I also will see you guys back here in about 10 minutes. No.

C

Awake a little longer.