Ceph Developer Monthly, 2 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2020-09-02

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Yeah so uh because we can get started this week this month at cdn, we actually have one topic on the agenda, which is going to do a bit of brainstorming about ways. We could improve our ability to diagnose and analyze and fix failures, especially in pathology.

A

um Paste the link in the chat or each by we that reads from some ideas, uh probably the beginning of summer last spring.

A

um I thought we could maybe use it revisit some of those and perhaps pick up some new.

A

Ones so, let's see um so the first one on there about scheduling, that's already actually already taken care of from by a wilson code student this summer, estrada um has completed rewriting the worker as dispatcher, which allows us to have have jobs, lock machines before they start running and therefore never time out waiting for machines and also when the jobs finish or if they don't finish, if they time out for some reason, because they get stuck or they just take too long.

A

um We now gather logs at the end of that for those dead jobs that's uh currently being tested in the mirror. Cue, and hopefully we can roll it out to this. With these maybe next week.

A

So that'll improve our ability to get these to get consistent logs. When do things do fail today we do have a number of times that jobs just time out after 12 hours, and now we don't collect the logs, which can be a problem.

A

We can't tell what's going on without the lugs.

B

Exactly then, the cost of uh trying to reproduce bugs, which are hard to reproduce, also to mention uh things like upgrade suites, which need more than you know, regular two machines or two more than two nodes, sometimes five nodes. uh So that is one problem. That's going to be addressed with her uh work as well.

A

Yeah, that's a good point. We could potentially be adding um jobs to take a large number of machines and not have to worry about any kind of clock contention.

A

A

So I guess where I wanted to focus maybe uh was today was on the analysis side of things like when there are a bunch of failures that we make it easier to diagnose and figure out. What's.

A

A

The first two things on that path we just talked about- we actually already did we get dead jobs now and uh scrape was actually not integrated. Thanks to another student who was applying for google's raw code um and the third thing there century um is now back up and running thanks to brass suggestion in david galloway and hopefully we'll be able to upgrade it to a newest version soon and which will allow us to have uh kind of links back to popido, as well as a ability to create that kind of dashboard.

A

So you can track your time what uh failure rates look like in a given suite, for example,.

A

But the in general sundry gifts gives you like a history of um when a failure occurred and how often it's been occurring, and you can kind of see that when perhaps even when it was introduced, it shows up initially in some testing branch. And then you later start seeing it master for.

C

C

Josh I was thinking this morning, it might be. um I've been a nice to have might be being able to run uh technology and set interactive on error from the command line.

A

Yeah, like a single commands, like so take a sweet job or something and rerun that, on your own.

C

So you can, um you can specify the uh ridge config.yaml without making any changes to it.

A

Yeah that'd be great.

C

Because my workflow is generally copy into local directory modify run, whereas with the appropriate command line switches, you could just specify the original file and and just run it with the you know, a switch for interactive on error and also need to check that um we can. Actually, I know you can specify an archive directory and I do but I currently delete the archive directory that is specified in the yaml file, and I think I did that, because the command line archive doesn't override what's in the yaml file.

C

That would be nice to be able to do that too.

C

But I might have to verify that's definitely the case. I think it is, but anyway those two would be nice to have.

A

Yeah, I guess there might be some other kind of ammo settings that we would they need to manually change. That would be nice to not need to worry about if you're, just trying to rerun a given job.

C

A

Archive directory this one might be some others. There, too,.

C

Sweet path is the only other one I can think of, and that's already there as well. um The like, I said the query I have is. I can't remember whether the yaml file has higher priority than the command line and therefore what you specify on the command line gets ignored if it's already set in the yaml file and those two always are set in the ml file. So.

C

um Might need to do some tests on that.

A

Okay, yeah: that's a good avenue, though, making it easier to reproduce things and interactively debug.

A

It's another another way to approach. This might be to think about um some failures today that are very laborious to debug and analyze and consider if there are ways we could make that kind of more automatic or simpler.

B

um From my perspective, we get sentry in place. A lot of our problems are going to be easier to debug, especially because we you will be able to track when the problem started occurring and also uh sometimes it could just be like a testing branch or like a master, which we at least can go back to and look at what most after that, um things like that today are a problem, because you have to just go manually, look or like you know, bisect or do other things like that.

B

uh We do have things like cron jobs which run master or any particular branch on on. Like you know, some frequency, but that still doesn't give give us that kind of accuracy that we can get with century.

A

Yeah, that would certainly help a lot, um especially if we start running the sweets more frequently like and give you this merged pr. This today, from yuri to uh run the radio's tweets, yeah.

B

I I I did I mean for raiders: that's what I've been pressing for and now the idea is to be able to run at least um 300 jobs. For some reason I don't know I was not able to. I mean, get the the subset um value correct to get to anything lower than 300. The thithology suite command just hung for me.

B

uh Maybe there's a bug somewhere there, but we can at least get like 300 jobs running every night and that will give us a better picture of when regressions get introduced, but just for everybody's interest, people who are not aware of how century looks and how sentry works. I'm just pasting an example. There's been some failures that I have at least been looking at and you can use it as a reference to understand how I'm pretty sure the newer version of century is going to have much better features and trackability.

B

But this is a basic example.

A

But I think you'll need to create an account if you don't have one already to see that.

B

Yeah good point.

A

It might be- maybe maybe there might be a way to make the newer version not require an account to view it at least that'll, be nice.

D

Just wanted to add to that josh. It will be also nice once we get a telemetry crash data integrated with entry, so that could help a lot. Of course.

A

Yeah, that's a good point.

A

It'll be interesting.

E

A

Do like the the way we're capturing the the data from topology today we're kind of just uh categorizing everything by the traceback that totology gets from python, uh not necessarily the factories from the core dump. We might want to maybe add that to the topology, so we'd be able to correlate that with lemon tree as well.

D

Yeah, I'm sure we somehow can find a way to correlate them um yeah. I still there's a lot of work to do so sure yeah, but but I'm sure it can help a lot um yeah.

C

We discussed at uh one stage about modifying the generic timeouts like a generic timeout in, say the run task in technology so that it actually mentions what script it was running rather than just saying I'm timing out.

C

um It should actually say what it's running in the timeout message that ends up at the bottom of the topology log.

C

um I know we talked about that, but I I don't know whether it made it made it into the the document or even yeah. It's probably something that I could work on anyway.

C

um The other thing I wanted to mention was.

C

The thrasher quite often there'll be a job running and the thrashes involved and something fails uh and the thrasher just seems to carry on indefinitely.

C

So you end up with uh because it's it's waiting for the cluster status to come back to um health, okay or waiting for various inconsistencies in the in the status. To clear it might be a good idea to look at establishing internal timeouts within the thrasher for those sorts of things, so that it doesn't just continue on indefinitely for 12 hours and then time out, and you don't get any logs and.

A

Yeah yeah, that's a good point. I mean at least now with a dispatcher thanks to shredder. We will get logs, but it'd still be nice. If we didn't have 12 hours vlogs to wade through.

C

Yeah yeah, that's the thing, um even just loading, the toothology dot log file, it's massive and.

E

Most things actually do have timeouts. It's likely that whatever it is you're looking at just happens not to happen, it would just be a matter of adding it. At least they used to. I suppose they could have changed.

A

Yes, that might be worth the checking in the thresher, maybe doing it out of it for anything that doesn't have any kind of timeout.

C

B

Yeah, but I think the low-hanging fruit that uh brady described is at least making the failure messages more obvious, where it is failing versus just saying that something timed out I I know I mean I, I made a small effort towards this um earlier, just to improve the safe task earlier we had like anywhere we used to fail used to just save failed to recover or something even if you know um we were failed to recover because it was some recovery going on or some pgs are stuck.

B

We don't know also we just kept dumping all the pg dump.

B

That just makes the topology.log file huge. So I think.

C

B

Pr that I just pasted that now just outputs the information of the pgs that have not reached the desired state that you exactly know which pgs to go, look for or like from the logs as well. If you want some osd logs and stuff, and these are like small improvements that we can make, which can make debugging way easier than what it is right now.

C

Yeah, that's a good good one, um maybe even a just a global timeout on the thrasher.

C

They set it to some arbitrary time. Eight hours, ten hours.

C

A

Kind of like the uh the 12 hour time, it is today.

C

Right is that is that how that works.

A

Well, it's not the thresher itself, but it's a it's a general time after the whole.

C

Journey mythology task yeah, good job yeah I mean if the. If the log-gathering issue is resolved, then that's acceptable. I guess.

B

Yeah and I don't think we should even wait for 12 hours- uh I don't uh we've discussed in this this. In the past, our longest running tests are probably six hours or six and a half hours. um So there's no point waiting, um 12 hours.

C

But if you've got a job that wants four or five machines, if it has to wait for.

E

Five or six.

C

Hours to get them well.

E

C

E

That wouldn't be accounted the same way, so that's the correct fix there. You don't account for the waiting for machine element. Also, you can create a tighter timeout and audit the tests that really need a longer one and specifically white list them or specifically, add an annotation to those tests. That says, I need an hours.

E

Like the big cluster tests and the upgrade that's right.

B

E

B

Yeah, the the not the objective is the object store.

C

B

C

Verification tests.

B

Yeah the verify radios verify.

C

Yes, sorry object, store, not object.

C

A

B

But I guess we're only talking about raiders here I I don't know what the longest running tasks and other suites are right.

E

Now I mean same deals: you can create a top level mixin that just sets the default, pretty particular switch whatever. It is now and then go one suite at a time and start ratcheting them.

B

A

Other kinds of um areas that we see today that are kind of shadowing each other where they have like the same kind of error message.

A

But for many different causes.

E

Well, I mean at a course level any radius level bug that results, that any radius level bug that looks like recovery didn't happen or period didn't complete, could have any number of causes. I don't know if there are that many analogs to that kind of problem and other components, though so that may be just the category on its own.

E

I mean, I guess I o isn't happening. Is another category like that could be.

A

Anything, I guess that generally manifests as some kind of time out somewhere.

E

Right but I'm saying that the the top level, like effect you get, will be the thing didn't complete before the timeout, but you could get that 10 times and get to different uh causes without further ado. You wouldn't know what the cause was. So I think that might be another category in addition to recovery and peering didn't.

D

B

I know um cephalium is a new area, but some of the cepheidium failures are really hard to understand. When you just look at the failure reason, it's cryptic you have to go through the logs and figure out what the hell is going on. So that's another area. I guess that's going to come in later, but something to keep in mind.

F

By the way, another step.

B

Yeah another cause.

F

Of a stock is that a a demon could be could kill itself and without an another task. Thresher test, for example, could be expecting a healthy and active cluster and awaiting that demon for forever. That happens in uh uh crimson's treasure test, because sometimes criminals have killed itself without being noticed, even though that that process does not exist.

F

A

Yeah patrick had added some um watchdog stuff to try to address that, but I think it didn't it was 100. Complete fix, doesn't catch all the cases where that can happen.

A

Maybe we could add something to the thresher to try to detect that more.

B

Actively yeah I've seen that catch at least inactive managers. The watchdog.

B

B

Another broad category, I would say, is a lot of tests fail with weight for healthy or something um a lot of times. That happens, because there is either a health warning or a health error that is like possibly uh there, because we did not ignore it or there for some reason.

B

So, maybe just uh at that point, when we uh failed, we could just have like a self health printed out before we are just failing, so that we know if there is a health error or a health warning that you don't have to go back through the laws and see whether there was a health error or a health one. Just like simple things to do.

A

That's good, that's a good point when we do encounter this condition like print out the kind of common sources that they could be causing it. So we can easily tell which one of these cases it might.

A

A

At that point we can, we could even uh try to differentiate there, not to say we failed to recover but say like there was a warning or pgs were stuck in active or uh demon's crashed, or something else like that.

A

Another category that I was thinking of was um a lot of test test. Cases we run um have um are our test programs that have their own uh like test cases within them like, for example, the object store tool tests for tests um or a bunch of other tests that use the g-test framework or the python unit test framework are running a bunch of different um subtests and the exception that they were giving to sentry and that we list in the in the failure today just says that the like the test command itself failed.

A

uh It doesn't tell us which particular test it was running when it failed.

A

So we if there, if there's like one one particular test- that's that's uh as is newly failing or necessarily notice that it's a different instance of them or our new bug, without looking through the logs a bit more extensively. So that's maybe something we could try to parse out from the output.

B

Yes, that's that's. The quintessential example is the object, store tests that fail and we've had instances when master is seeing a different failure versus a stable branch, you're, seeing a different failure, and we all like we tend to categorize them like as the same failure, but they are actually not.

A

Yeah, I guess that would probably be more helpful later on uh once the once. The test cases are, or once again century after running, more and you'll be more helpful to distinguish between the different kinds of failures.

A

A

We did, we could do something, perhaps with something similar with this fadm failures. The number of them um end up failing, maybe because it can't uh started even running for some for some reason um and there perhaps we could parse some of the day out, but getting up to that to say uh more precisely what what why it failed like if it failed to fetch the image or if the discounted space or became read only.

A

G

See cyberium related failure. I generally see this keyword container. Linux dot go so I think, maybe if the container fails for some reason and it gets in- I generally see this keyword, so maybe there's just a low hanging. Fruit could be to search and see what was the.

G

Related to that, although the id for with it is different, but I'm observation.

A

Yeah, I think that's a good one, maybe some uh common keywords that we could parse about from the output to get a better idea of what would actually um cause the command to fail like that.

G

Yeah, maybe sebastian can help more on that.

C

Maybe the other thing we could do sorry josh. You go ahead.

A

C

I was gonna gonna say. Maybe the other thing we could do is identic identify as um the really generic subsequent failures, such as an example there in the in the chat that failure is almost always a result of a test failing and not cleaning up the directory after itself, but maybe we should annotate that message to say this is probably the result of a previously a previous failure.

C

If we can do that on the really generic failures that happen after a test fails, that will help people that are new to be able to say ah I can kind of I I I shouldn't stop here. I should keep looking. I should go further up in the log.

A

C

Yeah difficult just said it was confusing for her, and um I had that discussion in mind when I was talking about it. Deepika um yeah. It is confusing um the problem with debugging toothology failures. Is you need to backtrack through all the subsequent failures until you find the original failure which can be?

C

You know many levels deep so for people that are new to doing it, some sort of annotation on a failure to say this is probably not what you're looking for would be good.

G

All right right, I was just wondering: how do we achieve that? Do you have ideas around that.

C

How do we actually, what.

G

How do we achieve that? Okay, if this is this uh like self-test failure? uh uh So what keyboard do you play like uh f test? uh You showed that okay, there might be some uh test case uh failing in standard or somewhere.

G

C

I don't know whether it can be ordered, not automated. I I suppose it might be able to. But the first thing I would do is look at the failure reason and generally the failure reason if it's not a generic error in the failure reason, failure reason is going to give you the name of the script that failed or the test that failed.

C

D

C

Error, that's not related directly to the test that failed. I would be considering to be a you know. A cascading error.

G

That makes sense.

A

This brings up another, I think, at the end, at the bottom of the pad here, which is documentation. I think we could probably um write up some good notes about how to approach analyzing a photology failure and in the process of doing that, maybe come up with some more ideas about improving that process too.

A

Yeah the attack talk would be good too.

F

Hey guys, I I've been thinking about a a probably a vm plugin to to to to view the technology log becomes the uh the different source, the different output from lived with each other. So sometimes I need to sort this good groups, the output by by the host, in my mind or using another editor to do so. Do you think it's.

C

I'm I'm aware of a log passing um application that that does that sort of thing- and I can't remember the name of it, but I should be able to find it so this is used already right. Are you sort of partying.

F

C

Yeah I, and that probably works better for people that don't use vim.

C

F

And I are okay.

C

With the vim plug-in, but uh but um some people using emacs or whatever, um no no um okay,.

C

Yeah, let me see if I can find the name of that. I I know a guy that is um submitted some patches for it. So I'll ask him what the name of it is and come back to you.

C

F

One more implement by myself, because I have been working on.

A

Like jeff, who, you can also think about ways to improve the technology log itself, um instead of needing more processing tools,.

F

Yes, some something like that, but the the the way how technology log shoots has its own upside, because it's already ordered in the chronic order. So I I've been.

A

F

One way to do to attack this problem is to have a.

F

Is to to have a pre-first pre-purse.

F

But I like looking at the log as the as output of diff side by side. If we are interested in the interaction.

F

And so we can look at each other in the same window, but two different.

F

A

There, so I didn't quite get the middle of that if you're talking about looking at a diff between two parts of something.

H

Something like that.

C

Now josh is asking a question: kefu he's saying: can you restate that.

C

Not understanding, okay,.

C

What two things are you looking at diffing.

F

Careful no not different something, but just like how.

F

The output of diff works because we have, for example, in.

F

We could ask the github to shoot by side by side, but uh this this out, this format of output also applied to a a logging file. For example, in the left side, you could have a output, the output from host a for example, another.

F

Right, yeah, yeah.

E

F

Tesolite.Org, the output are just interleaved, so it's very difficult to understand, causing things just difficult to reason and you need to go all the way down up to find the command with standard output.

C

Yeah, I think um the tool I'm talking about will allow you to filter the log so that you can get the output from post a and then filter it again. So you can get the output from host b and then sometimes then.

F

C

Do a comparison.

F

What we are interested in the interaction between different hosts and tasks, because sometimes, if you send a command from a technology, you are expecting the output from a a slave, a a test node. But that's an interesting thing.

F

So sometimes you are more interested in the interaction between different hosts so that that's something we cannot achieve with with the with process tool. It should be integrated into into a tool which is visible, which allows us to to see the the the output from different hosts simultaneously.

C

So you're you're talking about aggregating a set of logs.

F

C

Yes and then yeah this tool does that as well.

F

E

C

Me wish, I could remember the nightmare.

A

A

Yeah again, we're talking about that makes a lot of sense um kind of treating the different aspects of the log as an event stream from different.

A

A

Everybody like we're curious about about that tool. You're mentioning since that would seems like it would be quite helpful for the stuff logs themselves too.

F

Let's group by the different threats.

C

Yeah, I was thinking about that use myself does aggregating logs is something that we do quite often, and I don't know why I haven't thought of this in the past.

C

Trying to aggregate them in our.

C

E

Are you talking about just kind of sorting together the osd locks I used to have the script that did that. I assume a lot.

A

I think we're talking about kind of viewing the the ost logs from the same time period in different um kind of tabs side by side.

E

Wouldn't it be easier to sort them together, the each log line already contains, which ost is one.

A

In some cases, but uh it's helpful to see that sequence of events strictly from a top to bottom viewpoint, since you have more vertical space to understand, what's happening on one particular osd.

E

True but my suggestion can be executed with a bash script and about you know an hour of work. The other solution requires writing a new editor or a plus. It.

A

Sounds like it sounds like there's some people that is aware of that already. Does this.

C

Lnav, I think it is lni the.

F

Could you paste it in the in the past, so we don't lose it.

C

Yeah, I put it in the chat, but yes.

F

Thank you is this: a freeware didn't get a chance to yes, okay,.

C

Wow yeah, it's it's open source, whether it's I I assume it's freeware.

A

A

We've talked a lot about the geology logs themselves. What about the soft logs other things? We could prove there to make debugging.

A

A

Sure, there's more than zero things we can improve there at least uh tons of abbreviations that don't make any sense unless you yeah.

B

Everything is perfect. We don't need to do anything to the safe blocks. They are amazing.

E

I just I'm not sure it's the kind of topic that lends itself to one top level directive, it's more like when you debug something you should try to identify any shortcomings of the logs that made it difficult for you and fixed in those part of your couch, but it doesn't work as well as a single line. Item.

A

A

It's just something to keep in mind the next time you're debugging. Something uh try to consider what information would make this easier.

A

Is that something we could change about? The logs.

B

Also, if you're using some sort of debug levels that are not by default, enabled or if there's a particular suite that is not using that level of debug, let us say you are trying to debug a manager issue and you don't have a default manager.

B

Debugging set to a particular level that will let you know uh what the hell is going on. We should always make sure that when we fix such failures, we go and add that um as a default value to those suites, because a lot of times failures are not reproducible, so in those cases having that default, uh debug level helps even vice versa. If there is something that doesn't need too much of logging and we can just do with debug levels on one particular subsystem just we should just try to clean those up as.

B

A

I'm curious to hear more from folks who learned about biology what are the most difficult parts to figure out so far.

A

Maybe that's something to think about and consider, um let's give you guys a link in the chat about a bunch of papers regarding root, cause analysis.

A

Yeah, this might be interesting, curious to see.

G

I just googled it so maybe if something is relevant, we can discuss it in that new uh uh meeting we were going like we discussed about reading in performance meeting later.

A

Yeah, that sounds like a good idea. Okay, add these to the, uh and this allows this link to the uh performance pad.

C

Lnav is available for bionics, so we could anybody, who's got root, access can sudo and could install that and we can start playing around with.

C

A

Now I think, we've covered a lot here already there's a lot of interesting ideas. We can um add in the bed um any other topics that folks wanted to talk about or any other ideas around. This.

A

B

Nobody has anything, I I have a question, maybe general question, maybe for kefu, even sam uh how's, debugging crimson pathology failures, different from um debugging, regular classic, osd failures. I know you guys have been debunking some recovery failures and stuff.

B

Is it just the same or what? What is different but all anything.

E

Or a question for keep: do I think.

F

We do have a signal handler in classical sd which handles like, like six six foot or other critical um signals, and it brings out the um batteries like like, like the up, uh address two lines, which is uh very helpful to us for for diagonal to understanding the root cause, and it also, it also um dump the unique id, for example, to a uh a metadata file. That's that's where we collect the crashing formation using the um crash.

F

Manager module, but this does not exist in crimson, so we still rely rely on the address to line to the script offered by sistar cedar project.

F

Another pain, point of chromatin at this moment.

F

So that's a missing missing part in crimson every time I I look at the crash. I always need to fire up a a container in main container just to to understand the root cause and to find out the the the the batteries.

E

That's it, it sounds like we should invest time in a back face.

B

E

F

E

A

Yeah, it seems.

E

Like it's gonna be useful.

A

In the real world, too,.

E

I agree yeah, it's probably better- to integrate directly into the binary.

G

Actually, I was working on that, but I uh halted that because I was focusing more on jager recently. Maybe I can do that this we got next yeah give you maybe some of your.

F

Help I I I'd like to accept the problem from the outside.

F

Take take a look at the signal handle to see if we can edit support in crimson or even to sis sister, because because I think they are also suffering from it in their home page of system, they are suggesting us to use this address to line wrapper script. So I think that also it's also their pain point.

G

Like you want to package it and create a separate c square address line, and then that would be used by c star, as well as.

F

No, no, I think, there's a there are two ways to address this problem. One two to the post, modern.

F

Post modern way to to to address it like using a screw.

B

F

To to analyze the login file, just like how still db and see that.

F

To do it is just like how we are attacking from in crimson or story in classical d to handle the signal and print out the batteries.

F

I think I I will take a look at it at the second solution to see if it's available in.

G

A

Now, speaking of the eager work, difficult tracing would be something we might want to start collecting with technology as well. Once that's submerged yeah may help us with debugging number of issues when things get stuck.

G

Yeah, even though I was thinking of that just uh wondering uh how so maybe I'll I'll advance tracing is in, and maybe a second step would be to use it in totality.

G

I don't get that.

A

Yeah, that could be very.

A

A

Every time any other topics.

A

Okay, well thanks folks, this is a very productive discussion with tons of new ideas.

A

uh You know if I read some of these as well. I want that maybe discuss this with them next to the m2, so it'll be a different time zone and a different group of folks who might have more thoughts then, but anyway, thanks everybody and talk to you later.

B

Thank you very much.

D

Thanks guys, thank you.

B

D

A

Thanks everyone.