GitLab Secure/Quality - Test Training, 17 Apr 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Quality Team: Failure Triage Training - Part 2

Description

Follow up discussion by quality team on how we investigate failing quality pipelines using the original containers tests were executed on.

A

B

Sure, yeah, okay,.

A

B

I mean YouTube is a stretch goal. If it's simply the minutes. Fine, we just keep any google drive. The important thing is we have materials for other people in the company and future members to to learn and read on yeah. Thank you. I will keep quiet now thanks everybody all.

C

Right, so this is a test failure at triage training session, which is going to be mostly a demo of running tests and reproducing errors by running GDK locally and also by running docker, so I'm gonna share my screen and I want this one right, so I've shared this document.

C

These notes are going to be added. Well, all the important notes can be added to this guidelines document here. So this is the debugging failing tests document that we've got up on the handbook, and this training session assumes that you're familiar with this process, so you can have a read of it.

C

Basically, the steps are creating issue, quarantine, a test and then investigate the failure so I'm, assuming that you familiar with the pipelines familiar with the process of creating an issue, creating an mr2 quarantine, a test and now we're getting to the point of investigating the failure so that we can add some notes to the failure log for anybody who's going to be fixing it or even move on to fixing it yourself. So the document here I had a link to an issue, an existing issue. This is one that's still quarantined.

C

So this is a staging issue and in this case it failed because it was unable to find this selector pureí admin area link while trying to use the menu to go to the admin area. Now this one, this test I was familiar with, so it was pretty straightforward. The troubleshooting was pretty straightforward: I knew that in staging we don't have admin access, and so this isn't going to work but trying to reproduce it locally.

C

I found another problem, so there's actually another error in this test, and in this case the error that I get went around locally was that the branch name master already exists, so I'm going to try to reproduce that using GDK, so good I switch over to my terminal and it's going to restart Gd. Okay, cuz, it's pretty straightforward! So I'll do more. Okay! So that's just killed gtk I'm in the JDK C folder. You can see there.

A

You're zooming that I text a little bit, maybe just like one zoom.

C

How do I do that and.

A

Plus I think should do it.

C

Man plus man lost control, plus I've.

A

C

A

This, it's fine, it's it's not terribly small I'm, just old.

C

What, if I share my other screen I can move to Mao the screen where the resolution is a bit lower, I.

A

Don't worry about I. Don't, though, interrupt your flow is slowly fine. It's visible cool.

C

So GDK run that startup and then I'm, going to switch over to visual studio code and I'm gonna bring up the test. So the test was this: one push push over HTTP file, size spec, so I've got that open here and I've. Just uncommented quarantine, metadata meta tag there, because I've had another bug, so it wouldn't run while it was quarantined. So this just allows me to run the test and I've got a launch country set up here.

C

Well, come back, I've got a launch config set up here that uses the chrome, headless, zero environment, variable and QA debug, one so chrome here this one that turns off headless allows me to see what's going on, QA debug set to one that writes logs, as you can see in the bottom here and then the argument here. So this is running the beam QA command. So it's like running that command from the terminal. If you're going to do it from the command line, it would look like that been QA test instance or there's the URL.

C

So that's the my IP address and the port that I've got GDK running on, and this is the file that I'm going to run so I'm running just one specific test here in the command line you can see.

C

I had 31 there, I'm gonna do that here again, so I'm just running one test on one specific line and on the command line down here you can see I've got a tag in quarantine, so if I, if it was working properly and I, just wanted to focus on this one test because it was quarantine before I could add that tag to let our under quarantine tag but I disabled that so it doesn't matter for this example.

C

So anyway, I'm going to execute that and that should run the test and I'll just drag over the window again. My second monitor so yeah strike that over now you should be able to see you. Let me know if you can't at the moment it's loaded the list of projects good and it should add there. Oh, it's gonna open the console, so that was going to hide for a second in the back. So it's adding a personal access token and you can see in the background the logs of it pushing the files.

C

So it's cloned the project, it's pushed a readme file and.

C

Now it's going to actually start the first start. The test go into the admin area. So this is what failed on staging, because there's no admin area I've gone that's working now and we should save the changes. So it's just checked that the changes have been saved.

C

Mmm-Hmm so I cheated before and ran the test and then figured out what was wrong and I haven't reverted that change, so it actually worked. So let's stop that I'm going to undo the changes that I made I'm not in this back up, but it's over in Bush yeah. This.

D

C

C

C

That'll do yeah okay. So let's try that again.

C

So that was a bit of a preview of a fix, but I'll go over that again in a second.

C

So we're going to add access token again and I can show you how to skip this step in a second, because when you're testing locally, you don't need to do this every time. So if you're reusing the same GD key GDK instance, you can reuse the same access. Token.

C

Okay, so we have the project.

C

You're going to change the file size limit again.

C

Save the change then reload verify the change.

C

And then try to push file, okay,.

D

C

We have an error.

C

And it's running the after yeah yeah, it's running the after block, there's enough to block here that is going to undo the file size change. That's there, because this is two tests, so we want to revert the change that the first one made before running the second test, but anyway. So this is the failure that we were seeing in the logs same one. Here, branch name master already exists now, if you were triaging, and you just wanted to verify that the failure is reproducible. This is far enough.

C

You could say that yes, you've reliably reproduce the failure. That would be enough to provide some information that yeah this is, is definitely a failure, but it would be good to go a bit further and determine if possible, if this is failure with the test or if there is something wrong. We've give that itself now.

C

In this case, we can go back up to the command that was issued and we can see that it's issuing a check out to create a branch called master and earlier in the test, it's already checked out master created the branch successfully. So this isn't a veil of logic. We know that you can't create the branch that's already been created, so there's something wrong with the test there and yeah. If you're looking into it yourself, you'd need to be do a bit of troubleshooting step through try and reproduce identify where things have gone wrong.

C

We won't go through all of that, but yeah to cut to the solution. Whatever change was made, change the logic so that we need to tell this test now not to create a new branch. So this is a fairly with the chest. It's a pretty easy fix. I haven't submitted it yet, but I will now and if you were doing the trade yourself. You've got to this point.

C

If you were able to figure out pretty quickly, is that okay, here's the fix for the test, just submit it and then you're done no need to worry about yeah anybody else having to follow up on it. So, let's assume we haven't done that? Oh and let's assume that we weren't able to reproduce the failure using GDK.

C

So the next step that we had in the document was to have a look at running it in docker. So GDK is a different web environment, then is in the test. The story that runs using CI we use docker containers in CI based on the omnibus, is live image. So that's what we can do locally as well. We can run the docker container and this is the command.

C

Docker run: publish port 80 on the container maverick to port 80 on the host naming the container net test, because by default, the docker container that runs get lab and the docker container that runs the tests. Some be two tests, the two containers we want them to be communicating on the same network, it's easier to have it as a separate network, hostname, localhost and there's a name. It's container that takes a while to start up so I'll just show you one that I prepared earlier.

C

So when you run it takes a while to start up, but after a while you'll see the production logs being output here on a loop and then, if I go over to the localhost, should see that it's up and running.

C

So if I switch back to the test to visual studio and I change, the host that I'm running the test against I can just run the same test from the same environment, but now bring over the browser. And now it's running the same test. But this time it's running it against the docker container, and this is the nightly image. So it doesn't have my changes, so it should fail in the same way, as was in the logs.

C

C-Notes calling me a liar ah interesting, oh of course, no! No, because this is running these live image and we haven't actually changed any code in the kids live image. It's the test code that I fixed so yeah see. Let me run that again without the fix and.

C

It's interesting that this docker image seems to be running faster than my JDK, which is kind of nice and.

C

There's a failure and it's cleaning up alright, and so we see the same error. Okay and another option, suppose that that didn't allow you to reproduce the failure. The final option is perhaps there's something about the communication between to document images that CI uses.

C

So maybe something goes wrong when you also run the tests inside a document Dana, so we can do that by running the same command as as run in the actual CI itself, so the CI job to use, get lab QA and they'll run, for example, test instance image against this EE image, and now we want to run this specific test, and so that is going to spin up the nightly image again run the doc container.

C

This is going to take a while, but then, after it runs the get lab image in docker, it will run the tests and we can come back to this and see what the output is. It's going to run for a while I also mentioned earlier that you can set the personal access token, so you can do that using it lab QA access token, and if you setup a token manually or alternatively, there's now one that's pretty that's inbuilt, let's see if I can find the source for it.

C

In the fixtures for the application.

C

C

Db DB fixtures development, admin.

C

Personal access token, there we go so the development seed adds the personal access token, that's the one there. So if you configure an environment variable that includes that access token, if you run the tests, it shouldn't need to create an access token, so just skip that first step. So we'll see if that works, how something is gone wrong. I.

C

Think my JDK might be out of date, we'll see what the error is.

C

In the meantime, I'll check on that docker run so here it shows that it's started, get lab in a doc container and was waiting for it to become responsive and now it's just issued. Another docker run command to start the queue, a container image and run test instance against it. A docker container address running that test that we told her to just that test that we talked to so we're waiting for that to finish. Let's go back here and see why okay, so yeah gonna need to investigate that. So this should work.

C

Something is going wrong here. I think the the seed was just added, it's possible that my GDK is out of date. Yeah, that's a pretty show anyway, though I'll take that offline here is the output from running it lab QA, and here again we can see that it shows the same output.

C

So those are three the the three first-line options that you would take when troubleshooting. One of these should allow you to reproduce the error, and hopefully some combination of them back to GD k will allow you to identify where things are going wrong and either fix the problem or add some notes to do the issue for somebody else to follow up on so I think that wraps up the demo is there anything any questions or anything you'd like to see.

A

In the situation where you are able to find the fix, you would still have quarantined it, and your fix would just include the fixed + unwarranted yeah. Is that right? That's.

C

Right, yes, yeah and if, if it was a flaky test, then ideally you wouldn't aren't uncontained it. When you submitted the fix ideally submit to change, leave it in quarantine for a few runs, say five make sure it. It passes every time and then submit another merge request on quarantine it. But in this case it failed all the time. So we can and we could uncontained immediately and the that's.

C

Not what I'll do, though, because there's a separate issue, but if it was just this one issue with trying to recreate the master branch, then that's what I would do and.

A

Then, if you're not able to figure out the issue, who does it might be in the other document that you have open? Who does it get assigned to? Do you just throw it I post it in the group slack for whatever group it affects or or how would you move forward, then yeah.

C

Yeah, so if I couldn't reproduce it, then I would probably open it up to the quality team as well as which ever team.

C

The feature is relevant to so to create team, for example, because if it is a problem with the software and I'm, not confident that it's not I'd want somebody to to know and look into it if I'm confident it's the test, but for some whatever reason, I can't reproduce it then I would just try to get the quality team involved and.

A

If you can quarantine it, but you can't fix that you would just post it to create or whoever so that they can try to address it and uncoordinated yeah.

C

So so, if it's quarantined so I think we can see here that it's just assigned to quality.

C

If it to the test, that's enough, if it's, if it's not the test, then I'd also assign it to the create team or whichever team it would be. In this case, it is to create team. Yeah is.

A

C

A question yeah.

A

Will they just see this and take it on or do you also have to find it to someone or how will this get visibility then yeah.

C

I think we did say in there to flag it to someone specific I, think that would be yeah. It's the appropriate thing to do. What does it actually say here.

C

Front yeah refer to the front-end or back-end engineering manager of the content team. By mentioning in the issue, comments, yeah, yeah, yeah, yeah, I. Think if you do know that somebody was working on a feature on a particular feature and you can't identify that as the cause. For example, if you've been working with the team and- and you know that some merge request just went in recently, then you could go directly to an engineer but otherwise to add relevant manager, who's, relevant team.

E

So we have the quality flaky tests label.

E

What do we constitute as a flaky test? I just realized because mark when you said, if you think it's a flaky tests, let's put let's not move it out of quarantine, yet until it's proven stable and the configure team. My smoke test that I had actually assigned to you was failing because of a specific reason. It wasn't flaky. It was just failing so yeah. Do we treat just regular failures as a flaky test.

C

No, no and it can be difficult if we just say one report but I'm pretty sure. Now the the pipelines are set up to retry, so a spec will retry each test three times and then fail. So you can look through the logs and see if it has retried and if it's retried and failed each time, then we know it's not flaky.

C

If you look through and see that it, it actually passed in one of the attempts or yeah if it passed in one of the attempts, then it's going to pass so I guess you need to do some. Some forensic analysis of previous logs to see I.

E

Can't remember: do we have a label for just failing tests, then it probably wouldn't be broken master. This is really I, don't know. Yeah.

C

We have a flaky test: quality flaky tests label, yeah.

E

But what we just mentioned, what you just described, a flaky test is one that's. You know flaky, not necessarily just failing 100% of the time. So do we have and might like I said my example: I spoke to my failing smoke test.

E

Do we assign a specific label to it other than flaky test, because.

C

I was reporting.

E

You know I'm Sam.

C

Yeah I think I understand it would be bug and p1 and s1 if it's falling in master. So in the guidelines, there's the steps for creating an issue and it's just they're, adding quality bug and s1 labels and then also p1 or p2.

C

Depending on whether it's going to be shared or for the current milestone or the next milestone, yeah.

E

C

And there's more links there about there the priorities yeah, so it would be bug as well as the p1 or s1 p1 ns1 labels.

D

D

And just to clarify we're talking about a broken test, not a problem with get lab.

C

That's right, yeah yeah, that would labels would be applied to the issue about the broken test. Yeah. If it's a problem with gitlab there'd be a separate issue for that, as well. Yeah.

E

That's where it gets a little hairy, because I think yeah I know this is a bigger it's a bigger discussion, but maybe like instead of quality, : flaky tests, it'll be quality, :, broken test or failing test. You know, but I mean that's. That's a discussion for another day, yeah.

C

Yeah yeah, there are other master labels that can be applied to the issue about the debugging kit lab itself. Yeah I.

B

Think if it has the label quality and a bug, it implies that there's a bug in our tests- or it has harness I- think that's enough for now. Unless we wanted to bring up another label.

D

I'll see if I can find that in any of our documentation, because just having the bud tag by itself could be a little misleading. I know.

B

I know and we're in I hate to say this of it there's a term the product management says its label health. Thank you, take it at face value, but it can be overwhelming and um yeah like quality bug. Quality : bug is the same as quality and a bug. If, if you can think of a better name for me to propose it, but um yeah I understand it. It's not this. It's not flaky. It's a defect in our test, harness.

E

E

Can it make sense with CAUTI and like a bug, really yeah, that's very, very appropriate.

C

Yeah so that wraps up the demo and that's the main steps for charging and troubleshooting so I think we can wrap that up unless you have any other questions or anything else, you want to see.

D

Go ahead, I was.

A

Just going to say that this is a really fantastic training, so thank you so much mark no.

C

D

It was, it was very excellent, thanks mark even better job than the first one.

D

Like a couple, questions hey, you know what I forgot the second one, but the the launch config configuration that you had any good tips on getting that set up. Oh, the other comment was thank you for the tip on the access to it.

C

Nors yeah tips for the launch: config no I, just googled the documentation for vs code, launch, configure figure out all the bits that are needed. I do have a snippet that I think I can make public I'm just going to make sure I, don't have any secrets in there yeah so I'll share my screen. You can.

D

While you're doing that Dan, have you done anything like that for Ruby million.

E

Yeah I was actually just supposed to take a picture for you, and that was somewhere.

C

Yeah so I have a launch config here, I'll make a public and that's just a basic but you're. Essentially it's adding the environment variables and whatever arguments and changing it and there's this handy little file command there, which means that if you select a file in Visual Studio code, it will substitute the path to that file and run that specific test which could be handy nice, so I actually have a few different.

C

This is this is the wrong studio occurred, see so I have a few different configurations set up if I want to run specific tests for Warner and smoke tests or if I want to run tests in a certain group or I want on tests against staging, have a few different configurations set up so that I can quickly do any of those. But current file is the one that I mostly use.

D

Very nice I have.

B

A quick question here just because I noticed uh the head lift swag. How are we debug being with headless or without headless? Most of its time, I assume you're running full forehead red with debugging yeah.

C

Yeah opinion should always have that to disable. Has us yeah if I'm gonna be doing it locally and then sometimes I might run all of the tests headless in the background, while I'm working on something else just to make sure if I make a bunch of changes and what I make sure I haven't broken anything else, I'll just run them all and I'll run them headless, so it doesn't get in the way. Okay,.

B

Have we noticed any flakiness specifically to headless and I think this came up with a discussion with Dan and me a while back where she was just running you a full full head, because that's what our users are using, because I assume right now we're running headless correct everywhere. Do you mean in CI yeah yeah yeah.

C

Yeah I I, don't think so. I only recall one instance where there was a problem that was related to running headless and I think it was actually just that a developer ran in an environment where the where their desktop resolution was too low, and so they ran into an error there and it wasn't reproduced during CI, because there was more virtual space in the headless CI environment.

C

But other than that I haven't come across any issues here. I see.

B

D

Well, something will get covered in our visual regression once we get that yeah.

B

Yeah I was about to say until we have our only internal grid, I think once we have that I would like to actually turn on go ahead, because um that's that's, probably a better representation of what our users are using I mean spec and mocking can be headless.

B

We can't pick and choose, but I would like to cover like a full head, not sure, even that people do not use headless browsers on a day.

C

Trip yeah yeah: it's certainly better if we do have that full coverage right, yeah well,.

B

Not obviously not something and asking anybody pick on right now. Okay, thank you.

E

Yeah my blood boiling a little bit because that's a hot topic for me.

C

Yeah and since we're running headless all the time anyway, we're not revealing whatever problems that could potentially be not seeing so yeah. It certainly doesn't mean that there aren't problems. I.

D

Found it interesting that your doctor was running faster than your DK.

C

Yeah yeah and I blew away my JDK database fairly recently as well. So I, don't know what that's about web pack, perhaps I'm.

D

Gonna have experiment with that. One yeah.

D

All right: well, we are at just over three minutes left, so that was perfect timing to get everything done. That's uh you get two gold stars mark.

D

C

Thanks for coming on, and thanks for your questions, yeah I hope it's been helpful.

D

Perfect. Thank you so much thanks.

E

C

A run take care.