GitLab Pipeline Insights Group, 7 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Testing Group Think Big #2

Description

Today we think big about Test History. Who would use it, the ideal experience for them and how this could be a differentiator for GitLab.

A

This is the verify testing groups think big session for July seventh. Today we're going to take a or try the think, big part of a think, big things. Small discussion and the topic at hand is gonna, be historic, test data for all projects, so I indicated earlier in my message to the team that we would do both the think.

A

Big think how big this thing could be and for think small, since we're a little off on schedule- and we have another one of these next week- I'd actually like to take advantage of that and do just to think big portion today and then we'll do the thing small portion of it tomorrow.

A

Otherwise, I don't think we'll have as rich of a discussion as we could in the twenty five minutes that we have scheduled for this. So historic test data for projects is a epic that is open.

A

There are some issues that are in it, but I'd like to kind of step back forget about what we already have open and think about experience that we've all had in the past, with CI systems and automated test data and what could be great experience when you're working with one of those systems when you're working with get lab, and you want to see historic test data. How? How is this test performed on previous runs? What could that be?

A

What would be the kind of ultimate experience for a customer in that regard, and so I will I'll seed that with starting with who's? That feature for who is going to want that historic test data and who would benefit from it.

A

You definitely don't all go at once. I think.

B

That you could make a case that engineers who are actually working on the code would benefit from it. It was like a the first line of people who would benefit from it I think from there.

B

Definitely the development leads or team leads or engineering managers would also benefit from it. Speaking for myself, just being able to identify what.

B

Tests are causing the most grief for the engineers on my team and what can I do to fix? That would be a valuable saying.

B

For engineered survive, use code climate, for instance, has has this feedback for engineers and I think they do a fairly good job of putting it in front of the engineers while they're working so and it becomes the like a little bit of this gamified thing. Where you see that you're in an area, and you see where the score is, and you can see that you know whatever the metric is. You know what it was complexity code coverage.

B

It was in front of you in a way that you is a little bit gamified and you could see that your contribution made something better and you'd get that feedback and you'd say.

B

Oh coverage went from this to this and that is sort of I think a different, a very different way to access the data than like from a high level on a dashboard where somebody overseeing a lot of different projects or managing engineering teams would want it to be like that, like that very granular, incremental change, I think was good engineer, like good feedback for the engineers themselves.

A

Who else could benefit from you know, data historically about tests that run and when I think about data I think about pass/fail?

A

How often the test has changed as far as like the actual test contents itself, how long it takes to run the test?

A

Yeah other data like that that I'm sure I'm, forgetting right now, I.

C

Think for a test, automation engineer, it would be good to see the trends for a specific test. Like is this a flaky test? You know in the last ten runs. Has it passed like four times and then failed six times, but you know not like consistently or like a way to surface the the tests that are consistently failing so that they know to go in and investigate those further, because it might be that the test is broken and it might be that there's a bug.

C

Those kinds of trends would be helpful. Okay, so.

A

As we think about that I'm curious about how much data we need there and where is your where's, your breakeven on I've spent more time researching on if this is flaky or not than just archiving the test and writing a new one or some other method of getting your pipeline Green.

C

That's that's a good question. Flaky tests are kind of the worst to debug I.

C

I would say like really just timeboxing the amount of time that you spend debugging it or even going through and just manually checking multiple times that the the feature behaves as we expect.

C

That would probably point to something being flaky with the test, rather than then they're, actually being a bug, in which case we would have to figure out if there is a way to make that test more robust or if it is time to archive that one and move on. Okay.

A

Very good I was.

B

Is there any interesting ways to do like meta-analysis of those tests to sort of had to be able to to be able to show somebody who's writing a test or investigating test? What what the character there might be, one particular characteristic of flaky tests in the suite and from at a higher level. It might be really valuable to say. Okay, you know this. If you want to know if this test is really failing or just flaky it'd be helpful to see.

B

Oh over the past six months, a lot of our tests that were deemed flaky had this particular characteristic there, this style of test, or they use this tool to sort of, say it's and just be able to kind of step back and say the way we're approaching writing these tests might be able to be improved because this this code, this test style, is isn't producing good results for us or it's. You know, producing a lot of investigation, work or something else.

C

Yeah, that's a good point. We tend to run into those kinds of things with specific failures or specific error messages, so, for instance like if you're trying to click on an element and that element is- has gone stale for some reason. But then that only happens like some of the time right, that's probably a flaky test or if there is an infrastructure issue that that pops up intermittently, that could also cause flakiness.

A

Yeah, that's what I was thinking. It was like all right, so here's all of the tests that we've identified they meet our flaky threshold and a pass/fail percentage within those tests. All of them are trying to call this specific API or run the specific query or using the specific test data like if we can start to surface data like that, it seems like they would help a lot of the the investigation time.

A

So that goes definitely I think just test history into more of a here's. Your fleeky tests, which is great yeah.

B

It's a it's sort of right. It's using you need the historical test, data to then be able to say. Oh, you know at we've seen tests like this over the past six months. No have this kind of behavior so.

A

Let's keep going down that path and say you as either a tester or as a developer. You see that a chest has failed and you're trying to identify. Is it a flaky test or not? What's the ideal experience there for you as you're doing the debugging based on any sort of data that is out there.

C

Having quick access to the logs for each of the failing test, friends could be helpful. I.

D

Think just a general history, a visual history of pass/fail over maybe three weeks, at least at least in our in our particular situation. It's probably the timeframe we're looking at and I apologize. If I miss some context here, I thought I got a question.

A

So three weeks could be if you're running a pipeline once a week. That's only three tests right.

D

Yeah for us for our context, we're running it multiple times a day yeah, usually for like master or something like that nightly at least once a day. So that's that's for us right so that's say: minimum past twenty runs from.

B

Like sorry go ahead, I was.

A

Gonna say from the developers I'd love to hear if you're looking at any test that has just failed. How much do you care about? How far back are you gonna dig before you say this is probably just a flaky test, just rerun my pipeline, let's see if it passes like.

B

For me, I just immediately click the retry button, yeah.

D

Definitely gonna be different for developers from an SAT standpoint, we're looking at all right. How do we get rid of the noth-nothing developers? Don't do this, but but we're really gonna focus in on how do we get rid of it? What are the things that are causing it to be flaky and trying to identify if that's really something in the test itself, if it's something in the app itself or something in the infrastructure.

B

For me from like, if looking at it from like one level higher, what I would want to look at would be like what is really giving my team grief so that I can best devote some of my time or ask someone on the team to devote some of their time to analyze it. So what I'm thinking about is like I want a table of the top ten most fairly tests and I mean that by like the top ten tests that haven't changed recently that have the most failures.

B

So, basically, if I go in and I change that test and I alter it, I want it to be removed from that table. So I know that I've taken care of this one and or at least the counter on how many failures is reset after I go and modify that test, so that I can keep track of what I've kind of already attacked and what remains to be done.

D

Think I, like that Ricky yeah.

A

I was gonna, say it sounds like there's two different outcomes that the two personas are trying for, where a developer wants that pipeline green and a tester once that test to consistently pass am I missing. How am I interpreting that? Is that sound right, I think.

B

That, like the immediate, the more immediate need for the engineer would be to make the pipeline routine, but I think when you start to look at the problem holistically like NSE, T or a manager, would look at it. It would be more like okay, this is causing a lot of grief. It's causing us to spend more money on pipelines that we re running this test over and over again like what can we do to make this more robust?

B

What can we do to make this execute faster or- and that would be the second thing that I'd like to see for historic test insights is which tests are the slowest I want, I, guess the top ten tests that are the slowest that I haven't changed yet, and so, when I change them it'll reset the average counter and then kind of move on if I had those two tables, I can do a lot with that in terms of engineering productivity, for example.

B

A

Expand beyond that we're like what's the this is the best experience I can imagine beyond just some tables, with some tests.

B

Yeah, for me, I think it would involve some like static analysis of the test, like you were kind of hinting at earlier. So if you can identify the tests that have API calls in them are the most likely to be flaky. Why not extrapolate that a little bit and test API calls are the most likely to be slow or test with that call, this one function and take 10 times longer than test that don't call this one function and stuff like that and kind of identify full-blown okay, you asked for it's like.

B

Maybe you had like some sort of artificial intelligence that could analyze the static analysis reports that could then kind of like based on the data that it's been fed, I, identify trends and then I present them to you. So like this, these tests that I'll call this one function are really really slow. You need to address this function and make it better I kind of like elevate that sheet right in front of my face from the dashboard that I'm looking at does.

A

That happen, or could that happen, web IDE as well.

B

Yeah it'll be kind of neat if you could have like and again pie-in-the-sky if you could surface that type of information in the IDE when you're when you're looking at the test, maybe you have like a little exclamation point you hover over it. It says: hey this test is the slowest test in the whole codebase. It should really. Maybe you should think twice before you put some more stuff in here, yeah.

A

And I was the oh I'm slowly trying to get you to a clippie, it's what I'm really trying to do, but aside from trolling, you guys, if you're looking at like in your file view of your repository and you can start to see, hey I'm getting in and I'm writing new code and that I'm not going and I'm gonna write. Some new tests I can see already in that test.

A

Suite there's three tests that are identified as slow and some sort of then mechanism, like you said of these, are probably slow, or these are flaky rather not slow. These are probably flaky because they all hit this API. That is inconsistently up or down, or we just identified that these three are flaky. You should go look at them and try not to write tests like that and made me fix those as well, but.

B

You could also, then put it in the web. Id like as you're writing a test yeah. It could pop it up and say: hey just so. You know you're, adding this API call here and historically, when you add this line of code to a test, it becomes more likely to be flaky or slow, or what have you yeah I think? What's really interesting, I.

E

One thing that I was thinking is I, mean I, don't unright code everyday, but when I have to deal with tests, I almost have this mindset that it's either gonna fail, because it doesn't address the things that I'm writing like the code, I'm writing or the the test is gonna fail, because what I'm doing is wrong.

E

You know it's kind of like a different two different mindsets, so I feel that many times when you're writing tests you're this when you're modifying tests you're just trying to kind of like address or like make it work for whatever you're doing, but that feels to me I'm I might be wrong, but that tells me the urges those are truly.

E

They are not the holistic holistically well reading tests, because if the test is it has like good core like it's good in our white rule, I mean the way that I see it is that if I, if I write code, then that it's bad like it's creating box, then the test will tell me that right, but I feel many times that I'm on the mindset of like oh I, need to write a test. Just so it fit whatever I'm doing you know and I don't know if there's a way to visualize that maybe like.

E

If someone is changing a test, a lot you know like if they change it every day, like developers, are always changing. This particular suite that sounds like that's a bad test. You know, like tests, are being changed. A lot I, don't think tests are meant to change that much, but I might be wrong. You know I'm just seeing it from my perspective, I, don't know why I would like to hear some thoughts about that.

E

Maybe I can clarify that more. Basically, what I'm saying I.

B

Was just say, I'm and correct me if I'm wrong, I think you're, like you're talking about the idea of tying other metrics like churn or age of a test to the to the pass and fail rates historically right. So so because so we could look back across. You know all our failures and say you know like do.

B

Is it that the most consistently failing tests we have are the oldest ones or the ones that change the most, often to sort of to be able to say when to be able to pull out test management, behavior that produces passes and failures and reliable or unreliable tests right.

E

Yeah, that's that's one way to see it. The other way that I was seeing it is like we run these stats tests every day like every time you know and they pass. You know many of them pass like without without any problem right, which means that, like whatever I'm doing it's not affecting that particular test right like it's, so the test itself, like the the the usefulness of the test, is that if I change something that breaks, those conditions that are being tested there, then it should tell me: hey you broke these.

E

You know like now. You gotta fix it right, but I feel that many times when you're developing you end up changing more the test than the actual code that you're working on because you're just trying to address, like whatever new conditions you are adding to the code. You know so I'm wondering if there's a way to visualize how much you're doing that type of behavior.

E

You know like the like, just kind of like those up rate, fixer type of changes to the code, to the testing code to the Swedes in a way that allows you to like anyone to see if you know like you, I think it comes back to the flakiness of the text of the tests. But yeah just seemed like that. Like volatility of the tests, us you go through time, like maybe I'm wrong I, like maybe self, like cool, like like with his expertise, eh I mean but I.

E

Think they're, like the perfect suite of tests, shouldn't change that much right like they should be always the very consistent good time and just changes you out new features. You know I, don't know if that makes sense, but that's just one perspective that I'm, seeing of something that would be measure yeah.

D

There there are lots of measurements that we can pull in churn or volatility is one. We have other tests that, like crew brought out, can show hey. This is how old this code is. We can show coupling of this code with other code that always ends up getting changed. I used some tools, code, Matt and I- think it's our service called code scene, but I think some of the stuff that we use right now can also give us those kinds of measurements, there's another one too, where we you lose knowledge over time.

D

They I ran some some analysis that tracked people that had worked on the code and then who were no longer with the company, and so we actually lost expertise in this area of code.

D

So you could kind of put that on a hot spot too, but I think what you're talking about more wan is is being able to we're really talking about test design, because if you got to go back and and refactor a test to address for something else, that to me is as a code smell that says, maybe are designed to begin with.

D

Wasn't that great and I don't know of tooling that we have it's really good at handling things like that, not unless we had just a you know, one way to test which we yeah yeah.

E

Yeah I think you hit the nail I think test design is like: how can we help our custom to give them metrics or data that allowed them to create their test strategy design in a better way? You know it's like. Perhaps we're not creating the right tests. You know we just need to have that or use a different framework. You know, like kind of like give them more actionable.

E

Yeah items out of the data, for instance like it seems that you have been using karma. You know on carmine prolly like we can give them like that actionable pad from like why you should move from karma to jest, for instance, like you know, and that's very probably, very opinionated, not something that you can like like extract and show, but if you can give them that those insights of why I, like certain things, might not be designed the right way. I think many customers will appreciate that and it seems like all I can bring.

E

No girl idea, but I think that that's something that that that people will really like. We.

A

E

A

Opinionated way to go, but it is certainly an option as you get into test management of here's all your tests, like some sort of code, quality tool that says these tests are really bad like you should you should think about rewriting this or you should think about a different coverage report, because this one is, you know we find lacking for some reason or another quick.

D

One question for developers: do we use like the cops for r-spec Robocop.

B

No, what we explicitly ignore a lot of test code in repo cop I, think that is official.

D

We don't use it in QA, so I was just curious if it was used in I. Think I'm trying.

B

To remember, because, yes, because we we run Google cop through code climate, and we exclude spec directories from code, climate I, think because of the duplication is the thing that springs to mind of like code coding, styles generally very different and tests that I think a lot of the same analysis doesn't apply as well.

B

So right now, I believe all that's excluded. I I could be wrong. No.

D

You're, probably right and I'm probably incorrect as well I. Think QA has at one time used some cops but I. It's not something I generally see running, but but I kind of brought it up as an example yeah, that's kind of the situation either an organization has one way to do something or they don't and and if they don't there's, not very there's, not much tooling out there to help us with design yeah.

D

All we can do is measure some things that indicate there are some some code smells and some of those things that you brought up on we're good like churn and.

A

Great so it'd be interesting to maybe as you're looking at it at a test history you get pass/fail like Ricky talked about with the test hasn't changed, and this is its history. Since its last change or hey, this test failed and here's the last time it changed. So you can see. Oh it just change like right before I started working on this code and maybe I didn't make that change. Somebody else did so yeah that can be interesting as well. Oh.

D

Man Ernie's going.

A

To take my boss.

E

D

Easy way to diff what it's testing right between current state and last time, the test passed yeah so.

A

I think we've alluded to a little bit how we could differentiate with some of the unique because uniqueness of gitlab and the potential of having both the source and your CI in one tool and one interface. What are some other ways that potentially we could differentiate from competitors with a big pie in the sky? We have some smart, it's about your tests and your test history and where there might be flakiness.

B

The thing that I'm thinking about right now isn't directly related to what you're saying sorry to be tangential, but I think a lot of points were brought up in this conversation about like how can I tell if my tests are good or not basically right so when I think about that the kind of bespoke industry solution is mutation, testing right! That's where you have a dynamic analysis, saying for your code and it just starts messing with your code and changing it.

B

A little bit then running your tests to see if your tests caught it messing with the code. So if it doesn't catch that, that means that your tests aren't good and you get an unwashed comunication count. So every time it changes your code and your tests don't catch, it that's bad, and so you get a countered, and so it does that over and over and over again to kind of validate the strength of your test.

B

Suites, that's something I'm actually super interested in as well, and then some of the other things I was thinking about a while everyone was talking was there are some people who actually advocate for your test, Suites being the architecture of your test, Suites being divergent from the architecture of your codebase. So instead of having an r-spec file for every Ruby file, it would be more like you'd have a separate testing application scaffold that you would build and grow differently as the needs of your application changed and it kind of grew over time as well.

B

So it's not not a one-to-one mapping mirror, but more of like a an application. That's purpose-built to test your other application kind of thing so that that's something that like Robert Martin advocates for and I, think there's a couple other books about that kind of idea out there I haven't read them, though, so a little.

D

Bit ignorant no eat, so we have both here right. We have the end end framework with with QA, which is a completely separate framework, not tied to the structure of get lab at all. Then we've got our built in r-spec tests. Of course, if we went completely to the other realm, that would that would eliminate usage of being able to use like Drew's, fail-fast template and things like that.

D

So I think there's they're still married using be the other way, but I think separating that the framework is definitely useful, but but then you have the whole mapping issue. How do you know what you're, what you're testing because and then test lease in our case? They're mapped to a idea of a feature, whereas our spec tests are literally mapped straight to the code.

A

We are at time or over time, even like I said at the beginning. We were gonna do just the think big part of this this week and since we have another think big session next week, we will transition into think small and thinking about what is the smallest piece of functionality. We could deliver in the next milestone to move us forward towards kind of that big vision of a holistic view of your test history and test mutations, and the great discussion that we've had all right.

A

Any closing thoughts before we cut this off and I get it uploaded.

D

One now I can ask y'all flan sorry I a.

A

Great discussion team. Thank you very much. Thank you Ricky and you enough for helping take, notes, appreciate it talk to y'all later Cheers hi.