Ceph Developer Monthly, 7 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-03-07 :: Ceph Developer Meeting

Description

Every month the Ceph Developer Community meets to discuss current work in the Ceph codebase, and coordinate efforts to minimize collisions and issues.

This monthly Ceph Developer Meeting will occur on the first Wed of every month via our BlueJeans teleconferencing system. Each month we alternate meeting times to ensure that all time zones have the opportunity to participate.

Meeting planning:

https://tracker.ceph.com/projects/ceph/wiki/Planning

A

All right well, so the goal here is just to make the system resolve errors it finds automatically in the cases where it can do it safely, so that when reasonable, we can have sort of more hands-off operation.

A

A

Think there are the kinds of errors that we might encounter because of things that are out of our control, like the media as an error on it, I'm some cosmic ray, hitting your hard drive or something, and then there the errors that are caused by bugs.

A

Punched me, but we should be less inclined to automatically fix unless we know how to fix them and don't actually need any information from the system.

A

Right so here's ticket.

A

But the auto repair is something that you have right now for a richer coated, pools.

A

Apparently, it's really old because it was done when we could do it for record pools, but we couldn't do it for replicated pools because of our story.

A

Is it it just means that, when you're doing a scrub with repairs that it will just repair it? No, what does it do it.

B

Just drips forever right, a deep scrub with auto repair, we'll switch to it a repair at the very last. Second, you know when it's like got.

C

B

It kind of repair is a deep scrub, followed by pausing recovery to repair, so it doesn't have to deep scrub again, it's just sort of it's as.

C

B

Was a repair all along I, see yeah.

A

Okay, that makes every keeps crab, but basically a repair.

A

B

Including not having more than the number of errors that are allowed to be auto repairs. Okay, so it could stay a deep scrub if too many errors, not meeting the criteria that allows otter with hair. Okay,.

A

Okay, I just need to make it do that for replicated pools right.

B

So right now we're keeping it very simple if we had wanted to since it's in the OSD I mean it would have been neat to like have like a bit mask of the error flags kind of, and maybe you could have a restricted set of flags. It says we're only going to auto repair if it's these things and not other things, but that would also impact the way it builds the the list of inconsistent objects that are then going to be repaired.

B

So you would actually not even if we did something like that, it would be to say we don't try to repair every object. We only repair, like you were saying obvious media errors like an e io, or you know, subject to the flags that we keep, which we don't actually have any IO flag. We just have like we got a read error, yeah.

C

B

Some sort, we don't know whether it's media or blue, store.

B

The scrub JSON errors are not like that they're just these bits that I showed you before.

A

Okay and in order to make this work with regular scrub, if regular scrub detects an air it'll just schedule, a deep scrub which will then auto repair thanks, makes sense and then it, if okay, so that this other thing was that if you're going to make we're gonna make the system automatically fix things, and the initial state of inconsistence is a lower severity than if we found any consistency, automatically tried to repair and then fix it or try to do a repair and then couldn't fix it.

A

That's an unrecoverable inconsistency or whatever, and you might want to raise a Health Alert only if you try to fix it and couldn't oh, the proposals to add a new PG state flag for a failed repair right and maybe a different health warning level. I can I. Think inconsistent is amorphous. Warning right here. Is this warning now.

B

C

B

We are as long as we're trying to repair everything, then, at the end, when we're siding, if the inconsistence isn't going off after the repair, then we know we should set failed repair. If we're trying to not everything, then yeah you'd have to figure out which ones you did and did not repair that if you had.

A

B

Repair it did you repair it yeah, but if we can simple for now, yeah.

A

um We have a free tree to read and we said any I/o and try to recover on the spot and that fails.

B

Try to set that flag I have to look at all that logic to see if it's pretty safe, it's pretty simple to do or not, and.

A

Then we want to count just maintain counts of repairs that we do so. We know, if happened a zillion times on a PG or on an OSD all right, let's see on the OSD the OSD we have to. If we don't have at a place right now, where we persist OSD level statistics, so we have to add a field to the superblock or add another object, just like billiard count or something that's in the meta collection keep track of those I'm sure you appreciate, for it.

A

B

Right getting the repair account might be a little tricky because again, that's part of recovery and knowing that you did or didn't repair any specific thing. So if you, if you just did the counting the way we output now, which we say you have ten errors and we're trying to end, we we assume we fix them when we start recovery. So if we tally that that way and then, if you had failed repairs, theoretically, your repair count would go up like every week.

B

Well, I guess, every day on a scrub becoming a deep scrub every day you would scrub and you would try to repair and then you would bump it by all those ones that were never really getting repaired.

A

B

The point where we have the count, but maybe we can get something into the others.

A

B

A

You do an in context that is set within consistence that doesn't have a account for how many many objects are in consider how many inconsistencies there are. Yes, it could be that when we clear that flag we got, and we reset that inconsistency. We just used that number as the number that we cleared right actually.

B

I, don't think we reset it we just if after recovery finishes after repair or we Dow an other scrub to actually get the count of what it recovery actually did. So there we would be going from yeah. We had ten errors and now we're going down to two errors. So.

B

Maybe we could subtract if we had a ten. Originally it's a.

A

I won't be totally precise, but I think it doesn't matter too much.

A

A

A

And that's all there was that's all there was in the list. Anybody else have anything to discuss.

A

But she there's a bunch of for some planning coming up around our GW multi-site stuff.

A

Casey needs to prepare something for that. First.

A

um Oh well, okay, so the next other thing that I was going to work on besides Tamala stuff is um the gradually waiting in new OS DS.

A

V8 and at the current idea, or how this would work would be if we set the default balance or mode to be the crushed compat mode, which means that there's a compat weight set was sort of adjusted crush weights that are independent from the sort of a local crush weights you set based on the size. So when you add a new OSD, it would get set. It's it's real crusher. It would be set as the size of the device, but the weight set value for that was two.

A

You would initialize to zero, but wouldn't that be the one to use, so it would actually get any data and then the balancer would come back and slowly adjust that up and throttle the recovery.

A

Or similarly, if you like, move remover- and you know I just really just for adding devices, though in theory all the pieces for that are there, but in reality none of its tested and I'm not actually sure if the setting or making the weight set values initialized to zero instead of real crush weight, I think that isn't there doesn't work right right, so I want to build some tests around that and then just make sure it all works, but I think that would be.

A

If that does work and we're okay with using crush compat as a default balancer mode, then we can make this default behavior so that you need to pull in a cluster and you add OS T's. They start out with effectively zero weight since tool to get ramped up. I.

A

Think that's the favour that people are gonna want in general unless they like I.

A

Don't know maybe I need to provide a command to like make the voice jumped all the way to where they want to be so they can make everything go really fast if they don't actually own it workload or something sure.

A

Hey guys, what do you guys think.

D

I think the the second part can be dealt with later about if you want to just pump it up. If there's no plan to load and stuff, but gradually increasing, makes sense. Yeah.

A

Yeah, okay, I guess initially we could just document like. If you set these four settings bouncer onto this mode and turn this setting, then you get this behavior I.

A

Wonder if, eventually, that should be the default behavior, it does kind of all there. A couple of the other thing is that it it sort of walks you into the crush compat balancer mode, because that you have this sort of separate set of crush weights that are indefinitely managed from what you expect them to be. If you're using the EPS up up, set up map map Abnett bird, there is no such analog at that they, don't you don't have that gradual.

A

Gradual waiting, I, don't know, I, don't know how many people there are a bunch of people actually who are using up that AB map at scale. I, don't know what the prevalence is.

A

We should make sure that that telemetry module reports that, but we can find out actually.

A

Good thing to find out.

A

A

um Yeah, okay,.

A

If that's gonna see it.

A

Any other anything else anybody wants to talk about.

A

A

All right just do a short meeting.

A

Hey Brad: are you coming to Barcelona I.

E

Am I've got my flights and hotel organized I beeps? If someone tells me I can't go now, I'll be disappointed. I.

A

E

E

Yeah now it'll be good, be an excellent chance to catch up, as they always are. Yeah.

A

Yeah I think the I think the emails are going out for all the talk acceptances this week.

E

Yeah well, yeah and cube con should be should be an interesting agenda as well. Yeah are gonna, be a zoo, a.

A

E

I'm trying to catch up on all those I discovered all those videos prom.

E

They were in the January or the February CDM.

E

Wiki page yeah trying to catch up on all of those.

E

Interesting stuff, Oh, what I did want to talk about was you probably saw an email, bread or you might have seen it.

E

Generating s tres G stack, any of those sorts of outputs from containers is problematic. Does anyone have any thought on the.

A

House race, to handle.

E

A

Racing in a container.

E

Yeah, it's trice g stack g core. You can't run any of them because they all use p tres Petrus is disabled on unprivileged containers.

A

Can you trace from outside the container.

E

Kind of you can't do it well, you can't do a G stack, because the host environment is not binary compatible with the container so you're not.

C

E

You're gonna get garbage, you can dump a G core, but you have to dump a G core of each thread individually, so you have to list out all the threads and dump them each individually. Then you can take all of those cores and go through them, one by one in gdb, in a debugging environment, but they're not going to make a lot of sense because they weren't all captured at the same time they were captured at different times so I see this is a problem. We need to have some.

A

E

Of solution for yeah.

E

Yeah, because there's certain things the reason this came up, we were trying to verify the user was seeing a thread deadlock now. The only way I know of to do that is to dump out the threads or get a or dump or you can set up a BRT on the machine and ask it to dump core and- and you will end up with a core dump if you've got the right core pattern and settings in the kernel.

E

So so that works, but that's not always going to be the answer in a situation where we don't really want to kill an OSD to get a cord up. We just want to see what the threads are doing. The.

A

E

Thing is something like the poor man's profiler, which just grabs G's stack, output or P stack. If you like, output repeatedly, that's not gonna work either. There's a whole lot of I.

A

E

A

If we could or should make a like an admin soccer command that will just dump a stack trace on every thread, it'll just iterate across the threads and just dump the structure on it won't be as precise as a GDP one cuz. You won't have all the arguments or anything, but at least we'll just know where, where you are it.

E

Also, it also won't freeze the threads like gdb does yeah.

F

E

Gonna have time synchronization issues. You know you might have a thread that looks like it's waiting for another thread, but the other thread has moved on and in the intervening microseconds yeah.

A

E

Therefore, they don't make sense when you look at them well,.

A

I think it you, it's gonna, be racing, no matter what, unless you actually suspend all the threads until all the stacks under suspend them.

E

A

E

That I'm sure I'm, it's actuated, that you might kill it.

A

Can check with a dog if you get something similar when he did the he wrote, the C version of the profiler I was doing something similar, there's something a stock tracing wheel, strap which.

E

Adam won in your yeah yeah, okay, Adam koopchick yeah! That's right! That's right!

E

Well, might not be right, but I know who you mean?

E

If you have the luxury there you can, you could stop the container and then run it in privilege mode, but I'm.

F

E

Sure we'll always have the luxury to do that either, and there might be might be certain environments where they just refused to do that due to security concerns.

E

Nice to have enough sand.

A

The like, if you need to like layer in gdb and the debug packages inside the container without running I, don't know if it's possible do this on the running process. You add that stuff in, and so you can actually do to be attached, may be nice to have I.

E

Don't think you'll be able to because anything to do with pate rice gets denied to.

A

E

Yeah well, but can the container I was looking at shipped with the GDB GDB server and I tried to connect connecting to that and I got a access denied anything.

G

E

Tries to run P trace, just gets denied.

A

That's a container person. What, if there's like a way to do it, maybe create a container next to it that shares the same namespace that has all the stuff that is privileged, something.

A

That's the discussion reminds me, though: I have a lated like open question: soin minik we added that crash the crash module and here I can just let me just have taste something see what it looks like on.

A

The lab cluster, for example, they have like 50 odd crashes that it's logged and then you can go, look at them and they're pretty useful right. They've got.

A

They tell you what demon and what time the crash happened. And then you you see the stack trace for each one and what version all that stuff.

A

But there's no. The whole point of this was that there'd be a persistent record of when a crash happens and the game gets are started and everything goes back to normal. You want to you want to know, but we don't have anything that raises any health alerts based on this and I.

A

Think I I'm not really sure what behavior would make sense like you could set a threshold that if you have more than some number of crashes in the last time number of days, then it would have a Health Alert, but the problem with that is it's not actionable, because you can't then make that go away.

A

um You couldn't make it so that it warns if there are any crashes in the last 10 days that haven't been like acknowledged and they have a new command. That says like acknowledge this, and then the health warning goes away. Could do that be.

G

Like a lot more precise than that, though, like monitoring for crashes within a demon or a host or not a team but like an entity or a host, it can't just be across your whole cluster because, like that's like, unless we're gonna, unless you expect every admin to set their own thresholds correctly back so yeah I I mean I I heard about this.

G

The only way I mean it's good to have that data, but it really seemed to me like a thing that either it's gathered up when someone is looking into issues with the cluster or that gets automatically phoned home, not something that admins are gonna like pay any attention or not in any other circumstance. You know.

A

Because I mean like what happens if a demon is in a crash, loop right and it's crashing every two minutes well,.

G

First of all, run into that probably gonna that, like that'll generally trigger health warnings because stuff will be consistently out of whack like if the MDS is crashing out all the time. It's not it's. Just like there's going to be an MDS that isn't healthy well,.

A

So the reason I noticed this was because I was just staring at a watch on the cluster and I saw no Steve down and then it disappeared. I was like I wonder what that was. There wasn't actually a crash. I just saw these are the crashes I finally went onto that machine and it's like out of memory killer, because these mirror machines have like no memory on them.

A

G

I mean, but in that case again in fact you unpack what happens not in the end of memory, but like you want to track. Okay, like OST 47 is crash ten times Wednesday, oh whatever, and I mean someone could build that yeah.

E

What are we talking about? Building.

G

Just a more precise method of alerting on one crash dump: reports: okay,.

E

One that sort of right limits itself and keeps the database of same crashes and that sort of thing thing.

A

I don't know yeah, that's why I brought it up. I can't figure out what what it, what the alert would be. It can't just be the number of crashes. It's Dean, it could be repeat crashes in the same demon, yeah.

G

Like that's what you it might be like, okay, like there's, clearly permanent state wrong with this. What's this demon, but in most of those cases, you're gonna like end up in a different alert system like okay, now that demons dead because I'm sorry consistently stopped restarting it or there's a missing logical MDS that is never I.

A

Mean not always like they stay every time you scrub at eg. It crashes, for example. Then, once a day, I know Steve's gonna go down and come back up and then you're not going to notice all right. Unless you happen to be looking right then, and you see they're like Oh Steve go down and up cuz, we don't even more and I, think I don't see down until I know you're right.

G

Like that's a great thing to notice, but that's a very, very precise, trigger condition right because you know, unless you think the crashes are so rare, which they might be, that you can just always alert on a crash and make an atom and acknowledge it. But yeah I, don't think that's gonna work it might it.

A

Seems like that might should be our longer-term goal right like it should be that any crash as a problem, unless this is the other point right now, when we had a media err, we crash yeah.

G

A

They're indistinguishable from like a software bug versus that I killed myself deliberately like you.

G

Know you overloaded my hard drive and now I'm crashing because I had a heartbeat I'm out or whatever yeah like there's a bunch of those.

G

E

You didn't didn't sit back, yeah.

A

Well, those ones at those ones, at least like.

A

Notice right, but the media error like we should be, that should be a self-correcting thing.

A

E

That we guys are BOTS okay. What about if we gave a cert lower priority to seg faults, I.

G

E

G

Lot of those media errors actually aren't turning into crashes anymore and someone actually I saw a ticket today where someone was complaining about that. They're like I automatically recovered, because it was your a sure coded in now, I, don't know what my disk is going bad. Please tell me anything: that's.

A

So that that's the last item on David's list, we'll add that right, it'll Dobie, a per OSD counter of errors that have been uncovered right.

C

A

Get that look at that soon, but I wonder if a simple step we could take was it would just be that anywhere in blue store or anywhere else, I guess where we we hit an error. That is like understood to be an e io that we just called a specific function that then crashes.

A

So there's like a signature like the top item and the stack trace, is you know, handled media error or or something or handle a IO so that later, when we see a crash dump and we're like I wonder if that was because of this or that it will be obvious, and we could even like programmatically identify that or even don't.

E

Even don't even crash just shut down.

A

There you go that's thinking outside the box. Yes, that would be so much better. Yeah yeah hit exit, look like a exit. I mean it should be even be exit. One I guess you guys have one, but we can give it a. We can give it an exit code that makes it an air-conditioned, but system B doesn't try to restart it. Probably.

E

That's wrong: yeah! That's.

A

E

Just thinking because otherwise, if we shut down system D is just gonna say well, we do need to be restarted. Yeah.

G

It's pretty configurable to actually like you can set up your system, B units in different ways to different errors and signals and things yeah.

E

Yeah, so we could have a special error condition that system D recognizes as a do not restart yeah.

A

Yeah, that would be don't be good. I think. The reason why we well yeah, we don't have to crash and core dump and all that stuff to do right now we don't have to do that. um I think if we were to go one step further, it would be nice to like and a nice message into the log. It says by the way I'm shutting down because of XYZ and like login and all that stuff.

A

That's all the trickier because we're usually in some like inner code path and does noise convenient to that without suspending everything else.

A

Maybe it's due a book I, don't know maybe best everything perhaps, but although actually you know what, if we just if we, if we shut down cleanly, but we still log a crash report, we can annotate it appropriately.

A

So there's still a persistent record in our stuff crash.

E

A

E

Shouldn't we also set a flag in the OSD saying when shutdown gets called if this flag is set, generate this message saying that we crash because of an eyewear we shot down because of an the.

A

Problem is shut down, tries to do an orderly shutdown and it like drains all the queues and waits for a bunch of IO. To finish, and when the IO error happens, we basically want to stop immediately and not do any other IO.

A

And good, we can't go through the normal shutdown shutdown sequence, but we could do. We could log a crash report so like you'll notice that the one I've pasted, if you just crash and hit a seg fault, it'll just have the back trace. But if you hit an assert, it puts all the assert metadata there like what file on my number and what the condition was. We could do something similar for an IO error. That's like a special function that logs a crash report and says I.

A

Where and like is it device name and whatever I don't know whatever you wanted to put in there or check some mismatch or something.

A

So we could do it that way, but then not generate a core file because that's sort of useless. It's.

E

Kind of redundant, if we're dealing with a meteor error, just means somebody's got to look at the core file and confirm that it just yes, it was a meteor error. Yeah.

A

Nobody would actually do that.

E

E

With this back trace is generated in in the PI statistic such would. It be feasible to use that to generate a back trace of all the threads, the same mechanism or a similar mechanism.

E

A

You have to be a similar message mechanism, but a little bit different, because this one is in the current thread. Did you call all these like Lipsy functions that could do the stack information and you can't go do that in all the other threads that are in the middle of some process?

A

So this is what the thing that Adam worked on. Did it hijacked the threads and like made them execute this code and then go back to what they were doing doing some trickery? I, don't know what he did but tell you what you don't use, something like that. Basically I'll see if I can talk to Adam yeah I think it's on github. It was called it's a p.m. PMP lit PMP or something for men for a father what it was yeah.

E

Have a bit of a legend, the poor man's profile.

A

A

All right anything else.

A

All right have a good night, then all right, thanks and so on.