Ceph Teuthology, 16 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Teuthology Training: Analyzing Test Results

Description

* Notes: https://pad.ceph.com/p/analyzing_teuthology_failures/timeslider#2039

* Ceph Developer Guide: https://docs.ceph.com/en/latest/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-intro/

* Ceph Teuthology Documentation: https://docs.ceph.com/projects/teuthology/en/latest/

* Ceph Teuthology project wiki page: https://tracker.ceph.com/projects/ceph/wiki/Teuthology

A

Hello, hello, everyone and uh thank you for making it to the session wherever you are, I know um getting a common time slot that works for everybody is difficult, but I think this is the best we could do.

A

um This is the second session that we are doing in terms of pathology training and this one will be focused on analyzing uh pathology failures, particularly um what do we do after we schedule a run or what do we do after uh we start running some tests. I know there is already a learning curve involved in you know getting you set up with pathology and having to understand what a suite means or a work unit means or getting zapier access and all that stuff.

A

But once you get over all that and you start running a test, you need to be able to understand what those tests are doing, whether the test is a pass or a fail, and I'm going to try and cover some of those details in this session.

A

So, as um I shared the ether pad and the chat, uh the first few points that I have, I feel are um you know basic guidelines or basic things that everybody needs to remember when they uh start reviewing pathology runs.

A

I feel it's a responsibility that every developer or anybody trying to review pathology runs, has because what one person does uh impacts, uh what the state of master or or whichever branch they are running guests on, is going to be on the next day. So it's you need to be a little more careful, and I think, if you remember these, two things you're going to be good.

A

If it helps I can share my.

A

Screen can everybody see my screen.

B

C

A

Cool, so uh let me just quickly go over this because we'll probably be going over all of them in more detail later on uh first point is: every failure needs to be reviewed. We don't. We cannot take anything for granted that okay, we think something is something, and so we don't look at it. So every failure when I say failure when you run a pathology run, n number of jobs um m out of n have failed.

A

Why have they failed is something we need to analyze and to do that, we have ways um that we will be going over later. Second thing I want to mention is: when you do a test run a lot of times: you'll see uh there are failed jobs and there are dead jobs. There is a minor difference between them, but both of them fall under the bucket of failed brands means something has failed. Something is not right.

A

The difference is, if you have to go into the details of failed versus dead. A dead job is something that does not terminate properly within the time frame, um so it just keeps running until uh tathology runs actually hit their timeout, which currently is set to 12 hours. So, if you see a run that finished, uh you know close to 12 hours, it means it was a dead job. It will be. um You can just quickly go over an example to just show you what a dead job should look like.

A

C

Maybe we get this.

A

So this this, when I say when you do a pathology run first place, you have to look at is palpito if anybody is not aware of it, this is the place to go. If you see this is the url and anything that follows is the name of the pathology run that you've scheduled?

A

It has details of you know your username when you scheduled it the suite. You ran the um branch. You ran all that kind of details, but the idea is that you'll get an overview of how your test run did in this dashboard. Now you have this category of failed dead. Past total uh blah blah.

A

When I was talking about debt jobs. I only talked about this. So when I say it's a debt job, you see the run time here is close to 12 hours. This is what implies that this job did not dominate in time, and so it kept running until tautology handlers kill this. So if you see a run like this, uh you do need to look at what was going on and why it went dead.

A

So yeah, that's why I have it in bold letters because um a lot of times, it's not clear to folks whether dead jobs need to be reviewed at all or not, because um there's some, you know impression that they're only caused by you know infrastructure issues or some other randomness, but that's not true.

A

uh Moving on then there's help along the way when I say there's help along the way uh you're. Just not you know, thrown to look at all uh kind of logs and all kinds of demon, logs and figure out. What's going on or like drip through thousands of things, uh there are tools that are already present. That can make this process easier and there are three things that I've listed here.

A

um First, one is sentry, I'm not sure how many of you are aware of century, but centuries is a tool that helps us gather uh similar kind of failures and show you a history of when, uh when an event, when I say a failure, a failure is marked as an event in century and when an event started occurring in um pathology and it's it has. You know things like you know which branch started failing or which distro started failing uh when it started failing all that kind of stuff.

A

If you want to quickly look at what century looks like um I'm just going to pick one sample, because every um I think, except for pathology, sorry dead jobs, you will have a century event like this when you click. What did I do? I just clicked on this and I went to that particular job.

A

So, as you can see, I am looking at one particular job now, so these are the stats of that particular job. This told me what job was run, what what test was actually run, um and here you see there is the details of the log and there is century. So when we click on sentry, what do we see.

A

You see another um dashboard opened up now going back here. We were interested in something like a command failed, a error where it was trying to look for something it was trying to look for uh some versions and making sure that the number of versions was one in that total entry. That's not important, but what is important here is. You can see that it is tracked here as the same branch name. You'll see that same error is shown here as the exception that was caught by that was used by century to group.

A

This particular failure and you have all kinds of details about what the config was and where the logs are and all kinds of things. Now I think what is the most important thing here. Is you see here this right part here last 30 days? It tells you there are two occurrences first seen last scene. This is what helps you figure out if a particular failure that you're seeing is new or not um especially, for unique failures.

A

This is very helpful, which I will talk about what a unique failure is in a bit, but this events, part here, is something that you can go look at and it will list all those similar failures that have been seen in pathology over time. So, for example, you are reviewing a run and you see a failure. You click on the century event and you just see one event like example, this this should ring a bell in your mind. uh Why? Because this means that this failure has never been seen.

A

There could be two reasons. One is that this never appeared in somebody else's pathology run because of whatever reason there could. The test was never exercised. It was a race condition, blah blah blah. You can think of whatever reasons or the the most important thing that the failure is related to your uh branch stuff that you're testing is related. So this is a clear um indicator litmus test. I would say uh where you can.

A

You know that you need to look more and not just give up by looking at the century event a lot of times century events will list a bunch of things. um Then you get more confidence that okay, this may or may not have been the same exact failure, but you can go and look at the other failures to verify whether somebody else also saw exactly the same thing or not.

A

So that's the whole um idea of sentry being a help in this process. Now there are a lot of other things that you can do with century. uh I think that will take a separate session on its own, but I just want to um touch upon this in terms of what I was just talking about.

A

There's help along the way.

A

Okay, let me just stop any more any questions here, any anything that was not.

A

Clear, okay, it doesn't look like it, then I'm going to be moving on second thing. Second thing: yeah go ahead, josh.

B

That's one thing I have to mention about sentry is that um you can to log into it. You can log in with your github account, and once you join you, you can add yourself to the cpr project to be able to look at these failures in the cpl lab.

A

Yeah yeah, that's important yeah. You need to have us so, as you can see that I'm part of some steps, dpi organization or stuff or whatever. That is so. I think that something we also need to document what you need to get access to century.

B

Yeah, there's nothing special you just need to. um This is the first time you click on one of these entry links you'll be taken to a page that lets you log in there's a button to log in via github.

A

Okay, well then, that's that's simple.

A

Enough all right, then uh we're gonna be talking about scrape.log. So this is the pr that uh introduced this concept of scrape.log. I'm not going to go into details. If you want to look at this, but what the idea is that there is a script that runs at the end of every run, that you do and tries to group similar failures um to make analyzing these failures easier.

A

Now earlier, this was a standalone script that people had to run manually, but with this particular pr, what we have done is that we have now have a scrape dot log which gets collected in your job along with the logs. So, for example, if I pick this, I'm gonna now show you where this lives.

A

So here what I did was I I okay, I'm gonna be covering this later, but let me just go over it, so I am now in um I. I have logged into pathology and I was initially in slash cef pathology archive where the logs all the pathology logs live next. What I did was, I went. I just used the run name of that particular pathology job and I cd under the directory. Then I can now see if you see here, I see all those.

A

All the 361 um total jobs here, so each of them have details about all those runs.

A

Now looking at scrape.log, you see here there is a file called scrape.log, so this tells you it basically does a post-processing of all the failures and it tries to give you a summary of how many distinct failures were there and how many dead jobs were there. As we saw there was one dead job, so it has said that there's one dead job, um the grouping mechanism works in a way that uh you'll see it.

A

So I'm sure it must have been covered in the last session that there are separate facets that are used when a pathology run and schedule, so this script also tries to find um an intersection of which are the common facets that um are present in a common failure. So that is what is the idea behind this so that you can figure out just by looking at those if there is something um that is directly causing this particular failure?

A

For example, you know um if there was some, um let's just use a simple example, so this d balancer on this means that the you know the balancer module was on for this particular test right and if this particular failure that we were seeing was directly related to the balancer, then it clearly um tells us that okay, the balancer, was on. Therefore, this failure was there.

A

If there are other examples, we can see that you know there's an object, store, blue store, something, and if we let us say, had a failure that was um not related to uh bluester directly. Then we could, you know narrow down or eliminate things that are not directly related to a particular failure. Just by looking at these uh facets, as if you keep going down in this, it tries to, as you can see, there's one failure which so this means that it just found one job with this unique failure.

A

Here you can see there are two jobs that have tried to group together and it has listed down all the intersection of those facets, as, as you become more familiar with pathology facets and what tests are being run. This kind of helps um in analyzing it, and particularly for me, things like this stand out a lot like if there are four jobs that are failing in this same step: tool, dot, ceph, tool, test, dot, sh, clearly, there's something wrong going on right.

A

If there is one random failure happening somewhere, this, in my mind, may be something you know sporadic or something which is um not causing whole bunch of failures. This is this clearly tells me that something is causing a whole bunch of failures, so things like this are are like small hints that you can take uh from the scrape dot log before you delve further into uh analysis of of logs.

A

So let's go back to the the pad, and um finally, we've.

C

A

uh Yeah go ahead.

C

So it's possible to reach the the log scrap block web gui just enough to click on the any technology log and delete in urld.

A

No, I don't think so. You'd have to log into pathology because.

C

A

C

Yeah click and go to the url address and remove the topology log name with the uh with the job id. No, no no ron should be present.

A

I just remove this right.

D

C

And click and click upwards. The script log is kinda in the.

A

Run perfect, perfect, perfect, cool, okay, perfect, so yeah! That's! This is awesome. I I never use this.

C

Maybe maybe it's a good idea to add direct link from the web gui, but.

A

That that is yeah, it should be somewhere. You know somewhere here, you know just where the overview of the run is that you don't have to go through, but.

C

A

Yeah. Okay, thank you. That's that's a very good point and I think just having a straight blog here will be very useful.

A

It also makes it more obvious that there is something like script log now um I have to do a session to explain to people that there is a scrape.logs, but anyway.

C

Yeah, of course, log into the technology server is sometimes more make more sense because it's way faster to grab logs that's in order to figure out some more details. Otherwise,.

A

C

Because logs itself can be very huge.

A

Yeah yeah yeah.

C

A

See I I see it as like, I'm not saying that one can replace the other, but this can be just like a you know, once one step closer, if some somebody, let us say, does not have sepia access for whatever reason they can quickly look at scrape log and say: okay, is there something? That's obvious that I need to be worried about? You know somebody's looking something on their phone. Maybe even there.

C

Yeah, I'm sorry for interruption.

A

This is a good point actually thank you.

A

All right, um okay, uh any more uh questions or any more uh discussion around.

A

Scrape.Log, okay, let's move on, um and the final thing that I have added here is a red mind tracker, so um this is going to indirectly help you analyze failures. When I say indirectly, it's kind of like um you know, when scrape wasn't there or like this is a poor man's script.

A

um If you see a particular failure um and you want to make sure that it is an existing failure and if scrape is not useful, you can go and search for some keywords that you're seeing in a particular failure and see if there is an existing ticket open ticket and an open relevant ticket, I mean, like you know, same failure could have been caused five years ago and happening now again, that's not relevant if something has happened uh a week ago or a month ago and you're seeing it um that makes it again clear that the failure you're seeing is not related to your test run and at that point, what you do is you update the redmi tracker with the test run that you store that particular failure in uh what this helps with?

A

Is the person who's going to be fixing the bug or analyzing? The bug will have more data points to go. uh Look at that's kind of the whole idea, and you know I think, everybody's aware of red mine tracker. There are separate components and components component wise. You can do simpler, search as well so uh so I think those are the three things that I think are tools that you can use. um After that I want to just go over two points before we go into more examples.

A

um Every failure needs to be tracked. uh When I say every failure needs to be tracked. It's just for the same purpose that I just described that if you see a failure- and you analyzed it- you determined that it was not related to your test run. You don't want the next person to be doing the same thing, so you want a tracker where you analyze the thing, and you say: okay, this xyz events happen and therefore this failure happened. So the next person who sees this in their run can just go.

A

Look at it and say: oh I'm, seeing the same thing and my branch has nothing to do with this failure. So that saves a lot of time. So it's it's um more like a symbiotic process. We have to try to help each other by just creating these tracker tickets. Also, it gives you a real um view of what failures are there in a particular um branch, uh because if you don't track things on time and you try to track it after a month, it becomes difficult to even figure out, for you know some failures.

A

Okay, you can figure out where, when they, when they started happening and things like that or you don't even need to see, you can go fix it, but some things uh especially with you know, seth and and raiders. I've seen it's very important to understand where a regression was introduced and tracker issues help with that, especially when um century is not being very helpful.

A

Next thing is differentiate between noise and real failures. um When I went over all of these things, the one thing that I did not cover is a lot of times. There are things like you know: some packages did not install properly, so there was a failed job. um There was some ssh connection error or something like that.

A

The test and run um these in my mind are kind of noise, because test actually didn't run, but it was supposed to run, but you still see those as failed you, you should be able to clearly identify such things, but just by the failure reason you see um and uh the you know the you know best thing to do is to le to rerun such jobs. There is a dash dash re-run option in tathology suite that goes, and reruns failed, dead, failed and dead, failed or dead.

A

um So you can do that so that you make sure that the test that was supposed to run actually runs, but those are actually uh noise and real failures. Everything that we were discussing earlier was real failures. So um then again, uh the next point here is about failure injection.

A

So a lot of tests, um especially thrashing tests, um inject failures. So when you see um that there is an ost that we are not able to communicate, okay, sorry, I'm being biased with uh osds and radars, but uh I think that's the example that comes to my mind. So if you see that there is an osd that suddenly we cannot communicate with um and that just shows as a socket closed or something like that in the topology log, uh there could be several reasons for it.

A

There could actually have been a crash on the usd because of which it you know it died, and that can be confirmed by looking at a corresponding crash in the pathology law, and sometimes uh it could also be because we are trying to inject a whole lot of um messenger failures: messenger socket uh failures.

A

In those conditions uh it becomes a little harder to just you know, diagnose from the topology law as to why that happened. uh Like you know, some in some places you'll see that we are waiting for um even for monitors. We are waiting for monitors to form quorum and due to failure injection.

A

There is one monitor that just gets um repeatedly declared as down and therefore the test fails because monitors did not form quorum, so this again is kind of noise, but to determine that it is nice, you need to go an extra step to check whether the test was actually injecting failures or not injecting failures. That's one simple thing to verify: if you do see that there were messenger failures that were injected there's this, at least in some places, the this is the keyword, you'll see and there's a um directory that we uh sim link everywhere.

A

That has different kinds of uh configs that we apply to tests to inject messenger failures. If you see such a thing in the test run or the description, you know that there was injection of messenger failures and then you go. Do the next step and verify where that happened and whether that was the cause of it or not. Now again, the next part of it sometimes is complex. I'm not going to go into details, but something to look out for and if you're, not sure about things.

A

It's it's good to discuss with other folks in the team and, like you know, look at stuff together to verify, but this is also sometimes a reason uh for noise, and I think I already went over the package install connection failures. You know sometimes fail to lock machines. I think the good thing is these days. We don't have such problems, because the new dispatcher is working very well at least has solved this problem very well. So just just the whole point number five is about.

A

You know there is sometimes noise, but just try to be aware of what buckets they fall into any questions before we start looking at some more details.

A

All right, I'm just gonna move on. We can do questions at the end. Maybe so uh here uh what I have is kind of pathology runs so um I've just you know there may be multiple others as well, but I've just tried to put it into three broad categories uh based on which uh your analysis may vary a little bit, and you know some for in some cases you have to be more careful versus.

A

In some cases you have to look for outstanding or open issues so like, for example, the first one um here is a whipped branch right, whipped branches are what you know. We developers used for validation of prs on on master or it could be any other table branch as well. You know for uh backwards.

A

So these are usually runs where you have you like at least example uh in rails is like around 350 jobs or plus, but these are usually larger runs, um so you have more failures to analyze, but these are the most common types of runs that are done in pathology by by developers, or you know. Anybody, and also these are the ones you need to be more careful about, because this is where new stuff gets merged into the code base. um So if something is problematic and we merge it, there, then on, we have introduced regression.

A

So that's why I have mentioned here that reruns are encouraged. If you, if a lot of times there is some lab issues, you know the lab is going out of space or you know there are network issues, things like that. In those cases, if you have a run where you have like you know, 50 jobs have failed um out, of which 40 were because of noise. uh I personally would not encourage anybody to merge a particular pr with that state.

A

uh I know it's cumbersome, you need to wait, you need to rerun, but I think it's better to be safe than sorry in these cases, because um that may affect the state of uh your baseline runs or baseline, uh how your master, uh baselines or your uh you know, octopus knotless baselines are doing after that. So that's why, when I say rerun the it's the same rerun option: that's there in topology suite that is encouraged to be used.

A

um Next, here is what I call a baseline run or, like you know, um you take the tip of a particular branch and you run a suite against it. um These are, you know, usually done in our nightly tests, see what the state um of uh branches or how how our test suites are doing. It is also used for um qe validation uh for our uh point releases.

A

uh So as you, you know, if you're aware of the point release process, you'll see uh yuri finalizes one show and that we decide is going to become the tip of the point, release that we are going to be doing and that tip gets frozen and he runs a whole bunch of sweets on that uh tip. So in in those kind of runs, what we are trying to see is things that have not been caught earlier.

A

In our you know, whip branches um because of whatever reason there could be, you know some bugs that uh only um pop up due to a race condition, so it may not have popped up the first time but may pop up this time um or you know other things that are associated with uh probabilistically, injecting failures. So if you injected a failure with some probability, you did not. You know there is all our thrashing tests have probability associated with.

A

You know how many osds are being brought down or when they are being brought in um so timing issues again it has got to do with race, but you know um some things that may not happen every time or not very repeatably. Failing can pop up in these runs, um so these, I would say, are generally used for baselines and they um help discover bugs that have not been discovered in our web testing.

A

That does not mean that you know, if you see a failure in this run, you don't create a tracker. You do create a track tracker um and try to figure out where it um you know, merge where the problematic pr merged again. This is going to be more difficult because you're kind of retrospecting as to what could be the root cause of a particular failure, but in general sentry and tracker, and everything are going gonna help with this and finally, uh third, one which I call developer centric is um these are usually smaller um patches.

A

So, like the example, I didn't go over this example, but I mean it's just like you know um the previous one, which I was talking about baselines. So this is a run that has been done by me.

A

um A raido's run on pacific on this particular on 22nd, and this is what the state is, as you can see for yourself and a lot of nightly jobs that are run. Look similar um go over this example, so the this one is a developer. Centric run. uh What I would say is because a lot of times developers want to reproduce a particular bug or verify that a fix for a particular test is working or not working, so you would not run the entire pathology suite.

A

At that point, you will try to run a one job or like a subset of tests um to see uh how your fix is doing or how reproducible a particular bug is. So I added this example or just you know, I was trying to reproduce a particular bug, as you can see this, where um the faster the entire job description is, and uh so I try to use this and run it on pacific. So my idea was that I saw a particular failure on um master.

A

Let me see if I can reproduce this on pacific or not, and that's how you can see that I did reproduce it once out of 10 times. This was the same run using the dash capital n option um just trying to reproduce a bug, so these ones are more developer centric. You won't see these run often as baselines or anything but are useful um when you are trying to debug something.

A

uh I think that those are kind of the points I have here yeah. The other thing to mention is that you can always use this uh pre-tripo and suite branch. uh So basically you can use run your test changes against existing bills um so that you can make some tweaks to the test or add extra debugging change some config and then run those tests without having to build a new uh branch.

A

Just use your two-way changes, I'm sure this must have been covered, but I thought I'd just mention it anyway.

A

um Okay, so before I move to the next section, any more questions.

D

A

D

A question on, for example, you are talking about that whip branch right uh and supposing developer sees uh see some failures uh there. uh Would it be a good idea to compare that against a baseline run, for example,.

D

The most recent baseline run is we have to go through.

A

D

List of jobs that are already run on pulpit or is it it.

A

D

If we mark the most recent run somewhere that we can quickly go back and check.

A

Yeah, so the the easiest way to do this is go to palpito, like I'm trying to look for raiders. So now I just went to palpito and I went to suite right raiders. So now I have all the raido's runs that have been done now. Let us say you're interested. um There are two ways to go about this right. One way is to look for just uh nightly runs so when I say nightly runs so we have these uh automated test runs that are done against uh master pacific, all stable branches.

A

At some frequency, where you can.

A

Say, pacific you'll see any user when you see a user typology. These means these are all pathology runs, so you can go and look for the latest. One here like the latest here I can see, is 25th and you can compare your run against this one or, if, for whatever reason you don't find this uh latest run or it didn't run for whatever reason you can also go and pick. Let's let us say you want a run on master. You can see that sage was doing something um 25th again yesterday.

A

This is going to be more less reliable because he may be testing some um patch which may have issues. So a lot of those failures here may be related to that, but this can give you a rough idea of what you are seeing is that you know if it's related or not so, to compare if you're, seeing, let us say an object, store tool. Failure in your test run um and you want to see if your batch is the only one that saw it. You first go to century. You don't find it.

A

You second go to um tracker, you don't find it! Third, you go to. One of these runs latest runs that have been done and try to look for. You know the failures and yeah. If you see the failures you expanded and if you let us see, some object, store rule failure exactly the same as what you're seeing you go and look at the pathology log and compare it. You see the same thing, then you you get an idea of um whether or not your uh your your failure is related to your branch.

A

Does that answer your questions either.

D

Yeah. Thank you.

D

A

All right um any more questions on this.

A

If not, then I'm just going to go to the steps involved thing. um Now again, when I have listed out some steps that I tend to use, this may vary for every individual, um but the high level idea remains the same. So what I've added here is an example of a pathology run. I already went over what the what when I say run there is a run name associated with it here. This is the example step one. As I said, palpito combined with sentry is your step. One of analysis.

A

We went over what uh palpito looks like what you know what century does so even failure. What am I doing? I'm just going, I'm just clicking on this. I'm seeing okay century event is present. Now I go and look at the century event. If I find something useful, um I'm going to be like. Oh uh this. This is the test tool, dot, sh failure.

A

Okay, there are a whole bunch of failures. This should make me like. Okay, this is I'm not introducing um you know it gives me us a you know some level of confidence that okay, there is some existing failure again. This is not the end of the story. You need to go and verify for yourself what these exact failures were, but this is just a guideline um combined yeah, so step. One of analysis is this: then what we do is analyze each failed dead job.

A

There is some good documentation that uh deepika added to the existing documentation fairly. Recently, that goes over details of what exactly um triaging cause of failure should look like. uh You can use this as a reference um future. I'm going to be going over all this, I'm not going to spend some time on this documentation now.

A

um So when I say, if analyzing failed and dead jobs, things that are involved here, you need sepia access to ssh to pathology. That is, I think, very important. Everything cannot be done using pulpit. It's just a good step one, but that's not where things end.

A

um Second point here is just what I told you earlier that when you cd into.

A

When you see into this into this run, name you'll see all the jobs you have scraped.log for your help.

A

So this is this kind of point two which I have touched upon earlier as well. Now I think, dig deeper is the part that I want to focus on in the last uh 15 minutes or so um so, when you, uh when you looked at step one, we looked at palpito sentry things are not clear and then you go to um the script. Scrape.Log you've got a better idea of things, but you're still not sure about things right. um You saw that. Okay, there was one failure that happened four times, but what is that failure?

A

What does the exact failure look like for that? You have to look at pathology, dot log for each job. um I mean yeah when the failure reason is not clear. When I say failure. Reason is clear: it is like okay, ssh connection lost, so we know that that was you don't have to go and look at you know what crash happened there, particularly or like something did not install something.

A

Those messages are generally clear, so you don't need to waste more time, but when you know command failed or like there is a crash osd something crushed, you have to go and dig deeper, and so that's why I've written that, when the failure is not clear from step one and two looking at pathology log in my opinion, is, you know mostly recommended I I do it all the time for every every failure that I see, even even like you know, even though I can see things and um you know judge sometimes, but I always want to make sure I'm not ignoring anything.

A

So I have an example here that I want to. uh You know touch upon where, like I'm gonna show where step one two, um but not enough. So this is a job which you can see um was a dead job because it ran for 12 hours. It was doing a thrashing test and the failure reason when you see hit max job timeout. This also indicates that it was a dead job. Now, as you can see, there is no pathology. uh Sorry there is no sensory event associated with it, so we cannot use sentry.

A

um We need to go into this log, so that's what we will try to do now.

A

So um so I have looked at this log. I know it is huge, so I'm just going to be using uh less. I usually tend to use rim, but what are the first things I'm going to try to do when when trying to look at this log is look for um trace pack.

A

So, as mentioned here in this talk.

A

At this point here to identify the cause of a failure, you need to use some of these keywords and, in my mind, traceback is the first thing that will give you at least 50 percent of the failure. Reasons uh can be reached by just searching for traceback, but this this one. I picked this example because this is a classic example where you need to do everything to figure out uh what is wrong. So what I see here when I search for trace back it just tells me os error socket is closed. Now.

A

This is the case where you should not assume that there was an ssh connection, error or something. This is a socket is closed. Os error. This is clearly coming from an osd where uh something may have gone wrong. Now we need to figure out what that was. So, if.

A

We try next things. So what do we try? Next failed got segmentation fault. Now I am going to search for failed and see what.

A

A

And I I know it's not going to come up, so I'm just going to give up on that and.

A

Got stuck now yeah. This is kind of the problem with huge logs.

A

There is no uh so far we didn't find a symptom of fail, but if there was some failed, we would have uh found it now, I'm gonna search for cot.

A

So there you see, you see there is an os. This tells me that there is a back trace here and this if we keep going up. This tells me that there is osd 2 on smithy 061.

A

That has crashed with a bad allocation problem, and this is the entire um fact race you see. So this is a perfect example of a job gone dead because of an osd crashing, but the technology not dominating the tautology uh job not terminating in time and you've got a 12 hour worth log.

A

I saw this. What do I do next? I try to see if there is an existing tracker or not so.

A

I thought I had the tracker name there, but maybe not.

A

I'm just going to use this and see if I see any recent tracker issues right.

A

This is something I tend to use, and this is just my email and if you see I found a first hit which tells me that there was this tracker issue where somebody else has locked this and they had a similar bad allocation visual and there are a whole bunch of um instances in the rjw suite and there's a related ticket, which also has a bad allocation problem which has been been tracked since six days. So if, um if I was, you know, this is sridhar's run.

A

If I were sridhar- and I was looking at this pathology log- and I saw this- I saw the tracker at this point. I should move on because I know that this is. This is a track.

A

Failure uh not caused by my uh my test change or like my my pr that I'm testing, but if in any of this, if I was not able to find a tracker- um and I did not see a sentry event at that point, I need to dig a little deeper verify whether that crash could have been caused with by the pr that I'm testing.

A

um If I'm sure I go, create a new tracker issue for it, and uh that's all I mean when you create a tracker issue, you add the technology uh run, you add um uh you know. Sometimes we try to also add the description like if you there is a thrashing test, so we add the entire description that you see that next time people look at it, it's just easier. I mean you, don't have to do that. It's not a must it's just easier just by looking at it.

A

Oh I I saw it also, and I also saw it in the thrash test. um So.

B

Yeah, it emphasized if even if you see the tracker that already exists for this issue, it's helpful to update the tracker to know that it's occurred again. We can tell how.

D

A

B

How often um when this is occurring.

A

That is very important, because this is the clear example. You can see what we have done. This is the history we saw it first, we updated it. Somebody saw it. They also marked it as a related, so it just makes debugging easier. We know that this is spread all across, not just in the reader suite. Now there are there's another ticket that got marked and then the analysis and we keep.

A

You know you see that the issue started maybe more than uh nine days ago, but we still keep updating the tracker so that we have more data points to analyze things.

A

um So I think what what all I discussed is almost written here. You want to go back to refer things that I also want to cover. Next is the further analysis part, so I want to just quickly do an ls here. So when you are in uh in the run- and you are looking at a particular job- you have a topology log here that we've gone over. There are a few other things here which may be of use to you, particularly this one orange.

A

This is the original conflict of the test that was run, and this tells you all the configuration all the details that were used uh even the roles and everything that was used for that particular run. So this can be used to reproduce tests even locally, so you can just use this original uh config and use it to run against a lock machine or even not a lock machine. This can just be used as a framework to run a test.

A

You kind of freeze the configuration because, as we just discussed, that there are a lot of things that are uh probabilistic and next time, when you start a new run, you may not land up with exactly the same values. This ensures that you have exactly the same configuration uh when we want to run exactly the same test again. So this is something to keep in mind.

A

um The other there are all of. uh There are other things as well, but I think the other important thing that I want it does not have anyway. um I'm gonna show one more thing I'll pick something.

A

A

So here we have, you can see we have similar stuff, there's just one thing extra, which is this remote directory. So when you have a failure, there is this remote directory that gets created. What it has is. It has all um logs from corresponding demons that were run in that test. You also have core dumps. You have crash metadata, you can just go over quickly and see what all is there. So, as you can see, this dash was run on smithy 121, so it just creates a directory uh of stuff.

A

That is, it has archived from that. There are a bunch of uh directories here. The most important things that you will end up using is a log directory. It has all the logs um zipped logs. The last few you can see are the the ones like for at least for rails developers. These are the ones we care most about, so you can go, look at raw logs um copy them somewhere. Look at uh you know from this place or whatever, but this is one useful piece of information to know.

A

And there's this poor dumb directory, as you can see, there is a core file that is present, so this can be used for further analysis of the core dump and next we see crash. This is some some crash metadata, which is also used by the crash that is collected by the crash module.

A

um We have stuff, like you know, done log meta um log is useful to tell you where the crash is happening.

A

So this will have some additional uh just when the crash has happened. This will have some additional debugging um just before the crash happens. So this sometimes is useful because a lot of time developers want to see what happened. You know 100 lines above where the crash happened. We don't want, we don't care about the entire ost osd. uh You know zero log. We only care about last 200 lines, and this is very you know it's smaller, easier to open, easier to copy easier to move around.

A

So something to keep in mind that you can take a look at this one in some cases, um rest is like you know. Meta is just the metadata. This is this is the structure of the crash reports that we see as well, so.

A

You have other information, I don't think you need to look at this.

A

Okay, um going back, I think I've covered everything here due to a lack of time, I'm just going to quickly go over this last thing that I have instances where sentry is useful and not so useful yet so like for this example here. This is an instance from yesterday. So there's this pr that was being tested by yuri um here.

A

I have just pasted the exact job that I thought was of interest. So you see there is a century event associated with it. So when I'm, when I was reviewing this, I went to the century event and I started looking at this particular ceph tool: dot, sh failure.

A

Now the good part is this is a command field error and it has a unique test: name work unit name, so you can see that this car. This has this history of events, which are starting probably uh two days ago. If you go and click here, what century helped me understand was if you see this is the same. Failure that has been seen in these web star branches are all master based, but if you see in pacific this is the sole branch.

A

This is the only branch that has seen this failure and if we go back, this was the exact branch that yuri was running. So this is a clear indication of something that has never been seen in pacific is being seen in pacific. That century tells us, um so we need to make sure that the pr is related or not, and tanda turned out that it was so this was. uh This was a patch that merged in mastering was getting backboarded.

A

So, as you can see, I created a tracker for it and it turned out to be related. So at this point we didn't have sentry we'd have to you know, go through a bunch of extra hoops to understand whether something is related or not, but this clearly helped us avoid a new regression in pacific.

A

um I have one more example here, but I I'm not going to go over it now, because lack of time, I'm just going to show you things where century is not so useful yet like here it tells you um that there is a failure in test health history, so I would assume just by the previous example. Oh there's a unique his uh test name. I should be able to group everything, but there is some things that we need to fix in century to be able to also consider or capture runtime errors correctly.

A

So here, as you can see, if I go to events, it does not tell me it is not the same health history failure. It has a whole bunch of things that it has grouped, which are probably not useful in these kind of scenarios. You'd have to go and look for tracker issues and look at the technology log to see.

A

What's going on, uh we are hoping to fix this or address this part as well, but as we stand today, we we should be cautious about not using sentry in these cases and relying on century in the first exam or like things that fall in the first category.

A

Okay. This brings me to the end of everything that I had and now I would really like to open it up for questions and uh discussion. If we have time for it, josh.

B

uh Yeah, maybe just a quick, five minute discussion, since we are heading into the performance meeting pretty soon now,.

A

B

Questions we can also have more discussion next week about this.

A

Sure, absolutely anywhere I mean, uh we've got irc, you've got email, you've got a thing, um feel free to ask questions, but I hope this is useful and next time you analyze something it should be. You know any bit easier.

A

All right thanks, everyone for joining.