GitLab Secure/Quality - Test Training, 17 Apr 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Quality Team: Failure Triage Training - Part 1

Description

Beginning of discussion by quality team on how we quickly respond to failing quality pipelines and quarantine necessary tests for further investigation.

A

Okay, we're recording so make everything perfect at this point once yes,.

B

That's that's gonna happen.

B

Right so yeah end-to-end test failure, triaging training session. Let's go so I'll share. My screen, I think that would be good way to go and I'll go through the process. I'll share my desktop come on share there we go and now I can't get to the buttons. I need all right. I want to open that document.

B

So we have the debugging guidelines that senedd was kind enough to create just recently so yeah I'll follow through that and then feel free to jump in with any questions at any point and yeah we'll add some stuff to this as well, and we also have the Google document that we can take notes in to make sure we don't forget anything. That's important refer to later. Alright, so starting off with what I'd usually do is I check the nightly and staging pipelines.

B

I would usually do that in slack, because there's the chance that somebody else has already started looking into the failures so.

B

Example, go back to staging today we can see here that I'd looked at the failures and found that there were some failures that possibly related to other ones and going back again further. You can see that soon ads reopened an issue and a previous staging pipeline failure, so yeah I would start off. We was checking slack to see if any work has already been done and if so, jobs already done. But let's assume that that hasn't that isn't the case and we're saying from scratch. So I would open there's a pipeline and.

A

Yes, so you went back to even a previous pipeline, that's and that had worked on. Were you trying to see if somebody I mean you can't really tell at this point I think the failure is right.

C

Now, oh, you have a link.

A

To the pipeline itself, so yeah we were just demonstrating that yeah.

B

A

B

I'm just demonstrating that yeah in this case, so now it has gone through who followed the pipelines, had a look, seen the failures and opened a few issues yeah.

B

But assuming that that hasn't happened, then I'd be doing the same thing that he done, which was from the pipeline's link and then go through and check the failures in that I clicked on one where it actually passed. But if you see before it failed the first time and then passed, which means it was retried and so I'm going to go through and click on just any of them. So I can bring up a list and then scroll down and see the actual failures that occurred and yeah.

B

Here we can see that this test this one did fail.

A

Right, do you recommend doing retry before yeah.

B

Yeah we try to start so yeah. It should be reported anyway, even if it does pass, even if it does pass like it did this time it passed after retry, but that just means it's a flaky test, so yeah it just should be reported. I, don't know if this one has actually been reported. So what then, what I would do is I'd copy copy, the file name there and then search for it.

B

It's a couple ways of doing this. One quick way is just to have a look at the issues here. So having a look at the nightly issues to see if, if it's already open here there isn't one now, what I want to do, ideally is make sure that nobody else has opened an issue somewhere else. So it's possible that an engineer had it encountered the same problem running the test themselves.

B

They might not know to open an issue in the nightly or the staging project, in which case they might open it in C or even E. So I do a quick search at the group level to search across all the projects.

B

That'll take a while, but if nothing comes up there then I can be pretty confident. Nobody has seen that failure I'm going to assume that's yeah, okay, good, didn't take too long, but long enough, there's nothing open. So we can create a new issue so back in the nightly project, because that's where the failure happened, I'm going to report a failure in this file there.

C

Was a closed issue, so it looks like processes to open a new issue and not reopen old wounds. Correct! That's.

B

Right, yeah and that's a good point. It is a good idea to have a look at the closed issue. It might be. The same issue might have been closed pretty recently. So that's a good point. Let's see if it was similar enough might reopen it, but it looks like it's not expected to our scroll across and see yeah so see what the text was expected.

B

Cloning into and the test was push LFS commit is replicated to the secondary, and this one was regular good. Commit is replicated to the secondary and scrolling down. I, don't see any mention of the other tests in that file.

B

B

Yeah, we probably wouldn't we might if it happened if it was just closed recently and it turns out that the fix didn't work or the fix hasn't gone in. Yet we need to make sure that it's checked later, but yeah that doesn't seem to be the case now.

B

This was merged a week ago. So yeah we'll continue with reporting this one, so I'm gonna copy the failure stack trace just up to okay I'm, going to grab it up to the first time that the file appeared. So we know exactly which test it was. The rest is step traits from the framework which is always the same. So we don't need to worry about that and we're gonna paste. The stack trace in there I'm also going to paste whoops okay I'm, going to paste in the link to the job.

B

All right and then I'm going to see what's going on here and see if I can read the exception and try and understand. What's going on and add some add some notes that might be helpful to whoever is going to fix it. So.

B

It's expecting to clone into an empty repository. It's got that, but it's also got some.

B

More notes about LS, f FS, so it looks like it's actually worked. I, don't know you guys can confirm you. It looks like it's worked, but it's actually reporting more information than the test expected like.

C

C

B

Yeah, so it might be that we just need to update their the text that it's checking for so uploading. Lfs objects.

B

Yeah I'm not actually sure about that. So yeah, it's expecting uploading, LFS objects, we got initial lives and we got create happening.

B

We have conflict they're happening, but we don't have uploading LFS, so it's possible that LFS wasn't didn't actually work in this case rather than just not finding the right text, so that would need some more investigation, so I'm gonna gonna leave the issue details at that for the moment we'll get back to the investigation, so the issue should have linked to the failing. Job should have a screenshot if available and HTML capture, if available so then browse to the job, artifacts and.

B

Find test output.

B

And it was a HTTP push, spec and there's three pairs of images and HTML output because it retried three times so just look at the last one and it looks pretty straightforward and you can see that well, a push succeeded and it looks like we see LFS there so yeah again, it seems like it's possible that it worked and the test the the text that the test is checking for just doesn't match, but yeah I would want to reproduce that locally to make sure.

B

So, in the meantime, I'd download that image so I can add it to.

B

Add it to the issue and I'll add a link to the screenshot as well.

B

And then we need some labels, so it's QA, QA quality all fails at p1 and s1.

B

B

And that just means it to Bob either in the test or in the application. We don't know yet, but in either case that's the same tag that's needed, and it's a p1 based on the priorities mentioned here, so just double check. Those p1 is for tests that needed to verify the fundamental bit lab functionality, as opposed to p2 for tests that deal with external integrations, and this one is Gio LFS functionality.

B

It's not an integration with another service but actual LFS itself, so that counts as a p1 and let's see yeah that was hit for other labels. We also want a milestone ready to put it. Sir.

B

Milestone is now it's Geo, it's a Geo test, so we don't have a cross-functional team member assigned to Geo I think what does Geo fall under? Is it configure I'm, not sure? But you know Dan.

A

Sorry, mark I'm, not sure yeah.

D

Let's have a look.

C

Enable into the stage.

B

Enablement right, okay, yeah! So we don't have. We don't have a cross-functional team member for the enablement stage at the moment, so I won't assign that to MCC anyone on that one all right. So we can submit that issue. Oh sorry, and it's a transient failure because it did pass after retry, and so we can submit that all.

B

Right and so now, okay, this one is a bit problematic and probably not the best one to continue with for this training session, because Gio isn't so straightforward setting up the primary and then the link with the secondary manually takes a bit of work.

B

So I don't want to spend too much time on that now so I'll just say. The next steps here would be to do that replication to try to reproduce it locally and for the sake of hopefully fixing the test. If it's a fellow in the test, otherwise providing some more input to report, an issue for a developer, to an engineer to look into fixing an actual bug. If it's, if it's an actual bug in the application and then the next step for us would be to quarantine that test.

B

So I'm gonna leave that for now and I'm gonna have a look at another issue. More.

A

But the quarantine quarantine excuse me: quarantine, eing, the test. That's after we determined that it's something on our side.

B

In either case so.

D

B

Yeah, if it's a bug in the software we'd still want to quarantine it until it could be fixed, and if it's about in the test, we can hopefully fix the test immediately. If we can't really want to quarantine it anyway and until it could be fixed. So.

A

Do we treat quarantine bugs differently, whether they're on our side or there are actual features on?

A

Do we have even a different prioritization I.

B

Don't say that we do if they're quarantined they, we still want to get the test running as soon as possible. You know if it's, if it's a bug in the software and that's why the quarantine the test has been quarantined, then that's something that should be fixed as soon as possible.

B

Of course, but if it's a bug in the test itself, then we want to make sure that that test is running even if it's quarantined, we can run it and see the results, but it's slightly less likely to be on our radar if it's running in the quarantine, jobs that are allowed to pass yeah. So if I go back to the pipeline, you can see here that all of these quarantine jobs are allowed to fail.

B

So there was failures in here, but these are ones that are quarantined and then not likely to be looked at regularly enough. So yeah. It is important that we get those tests. Those tests, fixed so they're pointed are.

A

You going to dive into quarantine test in this discussion. Yeah.

B

Yeah, so let's let's go ahead and quarantine this one anyway, because it needs to be investigated. It's going to take some time. So in the meantime, we can quarantine it anyway. So it's not holes up subsequent test runs. So it's an AEG Oh test, so I'm going to need to get to May a.

B

Repo and I'll pull the latest from last time and come on, looks like I. Haven't done this in a while on this machine? That's.

C

Right I forgot mark that use and I both using master terminal terminal layout, yeah.

B

Yeah I just followed a guide in some document. I think it was in some link after onboarding issue. I didn't really have any preference, so I just use whatever was linked in the guide somewhere. So I'm gonna use various code to quarantine this test as soon as it opens.

B

Okay, it seems like my vs code instance, has broken.

B

So I'm just gonna force quit it and try it again. No force.

D

Quit there we go yeah; okay, let's try that again.

D

B

Now I'm gonna find that test again the gr1 yep and you could quarantine it. So when we quarantine whoops in quarantine, we want to add a comment about why it's been quarantined. So there's a link, gonna add a link to the failure issue, then just going to add to a quarantine, metadata.

B

Okay and commit is just going to be a quarantine failing, failing test and I need a new branch.

B

Cheers test grunting geo test? Yes, yes, two minutes that.

B

Let's just make sure it's doing what I expect: yep okay, so I'm gonna push that Oh No.

B

Okay, so bitter pick up, I haven't done this from this computer in a while and clearly I don't have my security set up correctly. So that's annoying.

B

All right, so we know how to emerge in mr anyway, so I think I'll skip that point now and I'll fix that offline, but basically all I'm doing there is creating a merge request. If that change in and then I would post in the quality channel, asking somebody to review it so there's probably one somewhere nearby. So here's here's an instance of me opening a merge request and posting in the quality channel to ask the team.

B

If anybody is available and able to review it, and that's just so, we can get the change in quickly so that it doesn't hold up any other pipelines, and we can see here that there's just the one change so similar to what you saw there, but for a different test. Just the failure issue again in the quarantine tag and that's it sonoda proved it and I think he said it to merge as well. So yeah, that's quarantine, any any questions about the quarantine process.

B

A

That's very straightforward. Thank you. Yeah.

B

B

What I think would be useful is some more coverage of the actual investigation, so after we're quarantined or during the process, quarantine, if you want to do it side-by-side, we also want to debug the test failures, investigate the failure tried to reproduce it locally see. If there are any logs or any further insights, we can add to the issue to help whoever's going to be fixing it or to fix it herself. If it's quick and easy enough to do so, so there was actually another failure.

B

Was it in staging yeah, so staging also failed recently last night- and there was an new failure- needs to be reported to.

B

So this one also retried a few times failed each time so yeah failed each time. So this one's not a transient failure. It's not a flaky test, it's an actual legitimate failure and if we have a look again at the artifacts browse to the artifacts, go to the last run, tests.

B

And lots of retries have a look at the screenshot for the last failure all right and we see that it has a dialogue open to create a new file.

B

B

So an element that is trying to click is not clickable atom element. So if it looks like that, dialogue is in in the is in the way and the test is expecting to be able to click something that's behind the dialogue and it was trying to do that at create new file from template, so assume I've gone through the process of opening an issue pasting on.

B

All of this, add the labels etc quarantined the test and now I'm at the stage of investigating this so I'm going to skip that for now and just get straight into the investigating. So I want to see what's going on here and what element it was trying to click and file. Is the web IDE test, so I'm gonna go to see this time and.

B

Gonna open that file Oh didn't actually see if that was the right one. There are two two tests, the same file name and that's the repository one. That's the wrong one actually won't the web IDE test and the line that was failing at was 70, so that should be where that create, create. New file from template line is, and it's having problems in the edit page web ID edit, and that was this line ability 55 yeah.

B

So it's trying to create a new file from template. It's clicking an element, so that's clicking this file, template drop-down and that's the one that it wasn't that wasn't clickable.

B

Alright, so let's see what that looks like in get lab itself, so I'm going to open a test project I have a test project that I just used to play around in when doing this sort of thing and I want the way by D and I'm adding. So the test is about adding the template, and so here is the dialogue that was in the way and what the test is supposed to do is click on one of those so get rid of these other files.

B

That's the edit file and test back here with the test, opens a web IDE and then try to create a new file from the template that clicks a new file button. So I did that click the new file button there and then within a template list, it's going to click on the file name, so we should be able to find template list element in here.

B

You a template list. So that's that one you a template list and it's going to click on the file name. ah Now what file name was it in this case? It doesn't really matter because it failed on all of them.

B

So it gets the file name from this array of hashes here. So you can see here. The sign name it ignore gives IPI it matches each of these file names and then, if I click on one of them, which is what the test does the test in the edit page. Here it clicks on the file name. So do that and the dialog disappears, but in a screenshot we had the dialog still there. So something has gone wrong. There capybara thinks click the the file name because the failure happened.

B

At this point we have yeah, we have exception, being raised if the elements not found we're trying to click it so, for whatever reason it thinks it looks like sex. It was successful in clicking that.

B

So what might be happening here is that there's an animation there D. Can you see that when I click the new file button, the animation drops the dialogue down slowly? This is something that's happened in a couple of cases before where, while the animation is happening, capybara finds the element goes ahead and clicks. It and yeah. Then.

C

B

C

Yeah I wanted to actually bring this up, so this is actually part of the dynamic element. Validation. One thing I completely forgot to do is on the dynamic element, validation. We validate that it appears, but also that it's clickable so that'll be something like that would fix this as well. Yeah exactly what's happening, it's not clickable either because it's animating or you know. So it's weird, though, because yeah not not ready to click it basically, yeah.

B

Yeah, it's clickable in the sense that it's visible it's there. The element is definitely there, but it's being animated. So something about the clip goes wrong. Yeah. So.

C

The dynamic validation would catch stuff like this, too yeah.

B

Yeah one other way we could fix. This is with a retry, so.

B

We have, we have an exception being thrown mark.

C

What about the click on element? I'm, sorry to click on that capybara method, yeah.

B

Yeah, it is yeah I.

C

Wonder hmm it's not click. If it's not click element, we can't control it because we cannot click element of course, wait for the thing to be clicked to be ready to be clicked, but yeah click on is a little bit different. Yeah.

B

Yeah, so it would be better for this to be click element and in general we want that anyway, we want a pager pager objects to be using the element methods rather than pure capybara methods. So a couple of things we'd want to do there and there are a couple of options for fixing this yeah, ok, so so yeah, but basically we're getting into the weeds of fixing the test now. So we don't need to go into detail into that for this training session, basically fixing the test, so this is looking at the code.

B

Haven't got to actually running it yet, but we already know how to do that. So in that case you would run the test locally, make whatever fixes you need to and then submit the fix and uncor intend the test if it's. So, if the let's assume this isn't, we know this isn't a flaky test, in fact, because it's failing every single time. So if we fix it, then we're pretty safe in assuming that it's fixed.

B

On the other hand, if it was a flaky test, then we might not want to uncor inten the test when we submit the fix. Ideally we'd run it a few times. Let pass in a few pipelines before we uncontained it and.

C

There is a handbook entry on quarantine on quarantine, yeah.

B

Yep so yeah in in the guidelines here at the next steps. We have some information about how to proceed with fixing it and whether or not to on quarantine it and that'll be updated as well. Soon, there's still some discussions about that editor point about that.

B

So that's a bit of an overview we haven't actually had a chance to administrate running it locally in their session, um so might need to think about doing that in another session. Yeah.

A

Another recording- that's probably a good idea, we'll set up a follow-up I. Do you want to continue at this one, or would you prefer to another.

B

I prefer to another, but it might be worthwhile now just continuing do you have any questions? What would you like to see covered in another one.

A

Yeah going through actually running it locally, I think we're a lot of people. New people are gonna, get hung up is what we have all the different things that we have to get set up locally, to get things running, that's kind of a whole different beast, so yeah, you probably don't want to go too far off the common path. At this point, at least for this training yeah.

C

So on item 2d, Diaz and Delta: let's cover just so we're covering all of our bases if it's reproducible, of course you create the issue if it's not we'd probably still create the issue. But let's outline that real, quick yeah.

B

C

Know reproduce this locally? What happens.

B

So if you can't reproduce it locally, then we're still seeing the failure there's a couple of things we would do so what I would first try is to reproduce it locally using GDK. If that doesn't reproduce it, then I would try it using docker so yeah in the guidelines. There's some help there about how to run the docker image good lab, and then you can run the test against that docker image. If that doesn't reproduce it and finally, I would run the tests themselves inside docker, which is essentially running the tests with git lab QA.

B

So that would be useful to add to this to the guidelines document as well, reproduce by running in the tests via get lab QA, so that both the both get lab and the QA framework are inside separate containers, communicating with each other.

B

Potentially something might be happening in that interaction that doesn't happen when running via GDK or docker and local test framework. If that doesn't reproduce it, then something really strange is going on, because that's basically how how it works in CI, it's the pure, a river running docker containers, communicating with each other yeah another.

C

Thing is maybe make sure.

B

Yeah sometimes.

C

If you check out our master you're kind of like behind so yeah.

B

Yeah and also yeah when reproducing locally, is yeah that can be initially because you might have changes that are different to what's being tested in CI. But if we follow the instructions in the guideline and actually using their the char that corresponds to the nightly image that that actually fails, then maybe.

A

That's a demo we can get into the next time. We.