GitLab Pipeline Insights Group, 14 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Testing Group Think Big #3

Description

Today we think small about slow tests. What is the fastest way to test if users value knowing the test that is the slowest.

A

This is the think, big for the verify testing group for july 14th. 2020. um today is the second part of talking about test history. um So a quick recap. We did the think big part of think big, just a little bit of a misnomer.

A

A week ago, you can read through the notes in the agenda or view the recording, which is unlisted and available on the youtube unfiltered channel, and we really talked about where the where the discussion went was providing actionable um suggestions about tests that are slow or flaky based on their past performance, and we thought that a good experience for that for a developer would be tips in the ide, the web ide about how to not write those kinds of tests.

A

So I added a sub bullet in our agenda today about. I think that the problem that that solves is it's hard to know which tests are slowing down our pipelines, the most and impossible to not write more tests like that on a distributed team where that knowledge might be spread out, which is a subtle way of saying. There's two problems here, um so the problem that I want to tackle um today and in the think, small part of the think big thing small, is the first half of that.

A

We just don't know which tests are slowing us down.

A

We know that there are tests that are slow or getting slower, but we don't have a view into that data today, and so the thing small part of this there's a couple of questions in here to test into the discussion is that the goal to come out of this is we want an issue or a couple of issues there to the smallest possible change that we can make and when I say change, we could do do it as an issue where we deliver something in a milestone.

A

We could do a research spike. We could do a survey, we could say the next thing that we need to do to validate and move forward on. This is additional research. That looks like this to answer this question and do that through you know, customer interviews, so we're looking for a tangible issue that comes out of this. That's one of us can act on in the next milestone so that we can move this forward towards that direction of hey.

A

Now we have this great view into which tests are problematic and slow and we're not going to write more of them as we proceed in this project and a preemptive. Thank you to ricky for taking notes of my rambling as we go, so um that's the the thing small, uh so I'm gonna tee it up with what could we do in our next milestone? Thirteen three thirteen four um that moves us forward uh towards that vision of hey now we know which tests are.

B

Slow good question, uh I guess my first question is: what is our definition of slope? Is it one minute ten minute an hour, I think it's going to be subjective per group.

A

Let's uh focus on our persona of like the internal stakeholder, um and really this, I think, goes to the manager so ricky or even darby or sam. What would you say is a too slow test as we're looking at a team.

C

I'll uh say something I guess uh so I think I think maybe not like a definition of of slow because I think that's relative, but it could be just like the slowest test. So if I have the whole test to be like what's the slowest one- um and maybe maybe the slowest is still plenty fast for my project, but maybe it isn't so that would you know, show me the slowest ones and then I'll go fix the slowest one and save that much time. On my build.

D

Okay, would we be able to get some sort of average, so it was like you know, most tests take you know if you've got end-to-end tests, they take a lot longer than unit tests. So you look at a suite and say most of these tests take 10 seconds ish this one's taking five minutes. This is a slow test or you know, like some sort of deviation from the mean kind of thing, might be more helpful than figuring out. You know how slow slow, how long's a piece of string.

C

Yeah, I kind of I kind of agree with what was said already. I think slow is going to be subjective per group like there might be an enterprise company that just has one really slow test, that tests everything and it's all in one and that's how they they like to work.

C

So I think that a lot of the uh discussion around averages might get a little bit strange in those types of situations where maybe you just have five real slow tests that test your whole application kind of kind of thing, so averages aren't going to be as useful there. I I think we talked about last time what darby brought up like give me a dashboard, that's ordered by the slowest and then I'll address them in order as we go.

C

I think that was one of the kind of key points that we talked about last time and I think that would still probably be the best way to go about this and you could add additional metrics in there. They might not be useful for everyone, but for some people they might be, like you said, sam having the average and the flags like this one's slow, because I think a lot of times what will end up happening as people go down that list from slowest tests is they'll, come to a point where they're like.

C

Oh actually, this test needs to be slow because of x. I can't fix this and so they'll skip that one every time and start moving down on the list for for the ones that are, after that, and maybe after four years of doing that, you'll have a whole page of tests that are the slowest ones that you can't improve anymore. You know what I mean.

A

So, let's assume that we have some sort of view for the unit tests or the the test history. Sorry back up. Let's assume that we have a view to sort tests and you can see tests by how long it took them to run. um Let's limit that view to say the last pipeline, uh what could we do in our next milestone to gauge? Is this helpful for our manager persona for our um it's not slash as the developer, I'm blanking on the developer, persona name, delaney,.

C

A

Delaney, thank you, the manager for summoning I'm used to them being an alliteration and that one is not necessarily an alliteration. um So what can we do in? What would that look like if we had an issue um that was written up to go test? This hypothesis.

C

That like having having visibility into your slowest tests is provides value to the manager. Is that okay.

C

I would I would think this could potentially be a ux research item where we could kind of maybe mock up a wireframe or or jj could mock up a wireframe, and then we could start showing it to people and figuring out if it would be useful to them or not yet does it make any sense to think about test duration?

C

Historically, the same way, we think about test failures. So if you have you know one tet, you know you end up with that whole page of tests. That's been slow forever and it's going to be slow, because you know that's that's the way they're going to be. um If you could look at it over time, you might see when tests are becoming slower or I think right now we have duration by like different.

C

We have duration per suite that I think we even get from duration of individual tests, so we can get as granular as we want, but we could. We could see not necessarily that just this test is slow, but our tests that might be slow, but it's 50 faster than it used to be so that's pretty good um or all of a sudden. This test got really slow.

C

You know it's, it's always been slow. It's a big! You know it's a big end-to-end test, but now it takes three minutes. Instead of one minute, you know like what what happened there.

A

And it sounds like an iteration could be sorry, I'm cutting you off.

C

A

So an iteration could be not only do I want to look at the slowest right now like in the last test run what was slowest, but I want to see, what's changed over the last either time frame or number of runs, the most what's decreased or uh gotten slower as a percentage or as an absolute.

C

Yeah, so it's my my proposal for the first iteration would be find out if a historical view has value to people you know kind of like render it the same way. We render passes and failures and see if see, if there's anything interesting. If there's interesting data in there.

E

Okay, now I just want to clarify something because we've been talking about duration also, how does it mean we've determined that focusing on which tests are slow over which steps are flaky is our goal. Now it's all first.

A

I think where we've settled is that it's hard to define flaky and so we're going to work within the known problem. Space of well slow tests are problematic as well.

A

So as we think big, we can think big or we thought big about hey here's, what flaky tests look like they're dependent on other things or their performance changes over time as we narrow the scope down, we can say well, flaky is hard to define and it's not something that we can move on next. The next thing we could do because we have the data, is show you here the tests in order of how long they took and start with slow ones first, so that helped clarify for you eric.

E

Yeah because I guess because I was just looking at that one design issue where it was showing the number of past failures, so I thought it was like we were focusing on like the past field tests. Instead of like the past slow test.

A

Nothing yeah. It took, I think, small, a little bit different direction than what jj mocked up from our discussion last week or based on the the historical test data that we have, uh because we started talking about not necessarily which tests have passed and failed, but more about that flakiness and so scoped us down into well. Let's talk about slowness first and then we'll expand into flaky as this moves on. Potentially I.

C

Think, there's still, I think, there's still value in talking about past failures, especially with a limited scope, I'm curious to know which is more valuable to our customers. Like would you rather know the past fails of the last 10 tests that we ran, or would you rather know which of your test? Is the slowest one like what like I I understand. That's like a false uh dichotomy. There like it's, not one or the other. We can do both but like which one's more important right now.

E

How about for us.

F

That, personally, I'm more concerned about pass fail to begin with, and then speed is secondary.

F

That speed is something we always want to make sure we focus on, but um pass fail seems to be the most uh important thing from a historical standpoint, at least yeah.

C

Do you have any thoughts on like what minimal amount of functionality you would need from the pass fail standpoint to make the next most important thing be speed like what's the minimum, we could give you to shift your priority.

C

uh I think it's going to be a.

F

F

What would be the minimum to shift my priority.

F

That's a good question.

F

I wonder if we could capture, even just the uh even just comparing the last run to the next run. If we can see that there is a great variance in in speed between, there is a pretty good indicator that that something's happened.

F

Of course, a lot of tests are dependent upon infrastructure and I think, as as an app grows over time, tests might become slower uh just by their nature, yeah because of a turn within the app itself, but um I think just uh understanding when there's a large variance between the the previous, the previous run and the existing run might tell me: hey. I've got to go switch gears and look into the performance of this particular test and understand why I have such a big variance compared to the last one. Now.

A

So I just want to make sure I understand your user flow there stuff. You had data that showed you which tests um like historically in previous runs failed past whatever, as you're looking at those you've satisfied your questions about those tests now you're looking at a view of, and then here are the tests that differ from the duration of time. The amount of time that it took to run the test, regardless of pass or fail.

F

I'm not sure I would do those things sequentially. I would probably be looking at them in in parallel that way, if something was uh had a huge variance and all of a sudden, a test took 10 times as long as it did before. I'll need to stop and go. Look at that. Yeah.

A

So I just want to make sure that I'm sharing the context. What I've done, I think internally and not verbalized, is that I've split this problem where I'm thinking about for delaney the manager they're looking at slow tests because they're thinking about how can I help the team work more efficiently and they're, looking at a lot of tests or a bigger set of data versus the looking historically at? How does this test run compared to previous, because I'm maybe triaging a pipeline and that's more of our um our sct persona or our individual developer persona?

A

And so I think I keep tracking back to the manager problem um in our discussion and I think that we're bouncing back and forth a little bit. So I just wanted to share that that context that I'm actually thinking about these as two separate problems and two separate solution spaces as well.

F

Yeah, no, no you're, you're you're right about that. I was uh extrapolating on on ricky's question there yeah, um which is a good question ricky that that would definitely make me change my uh my priorities uh for your question james. um I think the the biggest thing right now is probably just exposure right now.

F

You've got to go into the logs r spec and any other tool will give you uh individual um timings for tests um just to prevent a manager from having to dig out dig down into the logs of a pipeline to be able to see those is probably the the first step just scraping that information out and then exposing it at a higher level. Yeah.

A

So if we were thinking about slowest tests, just for a single run, would we is there a very small iteration, where we just give somebody a button to click and say, show me the 10 slowest tests from this pipeline and it just downloads the csv, with the test and some of the other data that we have out of the junit file and the duration time or as opposed to even showing it like just gather the data up and drop it to me in a csv.

A

Is that a reasonable test to test our hypothesis that there's value in providing that manager persona, slow tests.

D

Per pipeline or per testing suite, because I imagine like the slowest tests are more likely to be all in one suite where it's more like end-to-end tests or integration tests, or something I think per suite. If we could do, that would maybe be way more helpful.

A

Yeah, I think we could probably get any of that data. Potentially I was just trying to think of a really really simple one.

C

Yeah, I think first like either her suite would be easier to do than per pipeline from my understanding of the way that it works right now cool.

C

Can I ask you a question? um What is the? What is the method for determining if this is useful or if it provides value? Is it like, because I know we don't.

B

C

uh Telemetry, so how are we tracking um some sort of click action or something like that or what's you know, how would you think about that.

A

Yeah we'd um do it on.com potentially do as a dot com only test and dump the data into snowplow, probably set some sort of target for it to help gauge. um Is this successful or not? um That's how I would, as I wrote up the issue like what are our success criteria?

A

Well, we want to see one percent of users click this button to see what they get got. It thanks.

A

Probably try to do something clever, like insert a message into the bottom of the csv of like leave feedback on this issue with a issue that is there for feedback as well something low tech for a feedback mechanism.

F

uh I might say I like the the idea of the of the button I think um per suite makes the most sense to be able to differentiate.

F

But if I look at our pipelines, it's not likely that I'm going to go through every single job and click that button for every job within the pipeline.

E

E

um Sorry, just a random question in my mind: um if ever you focus on showing like slow tests and then we have this other feature in run modified test first, so not necessarily, if you're, like um touching specific files on your. Mr then.

E

Anyway, um I'm not sure if, like because we're solving this the other way with new modified files or specs. First.

A

A

I think I run those modified files first, and that is all about it's about getting you the feedback loop faster as a developer.

A

The give me just a download with the slowest test data is going to be as a manager. I want to see what's slow within the suite that my team owns, so that we can focus on improvements there for overall efficiency of running our pipelines. So, while individual changes locally, you want to run really fast. We also want those other pipelines to run fast because maybe they're charged with and they have responsibility over even budget and p l for their ci systems like I need to pay attention to runner minutes, and so anytime, we're spending.

A

You know half an hour on a single test when it could be five. That's 25 runner minutes that I got to pay for and if we're running that pipeline every single day, that is a whole lot of runner minutes that I could be saving and saving for budget for other exploratory things from that perspective, so adding efficiency to the team. That way is more of what the what they're thinking about versus. How do I just get that feedback loop, faster.

A

So we have a couple of minutes left. I want to transition us off of this one. Thank you for the discussion and I will take the writing up an issue that we can iterate on further in the issue of let's just get a button. I.

C

Just had one thing I want to bring up, uh so I don't forget. uh One thing I've been thinking of is there's sometimes changes that can affect the whole swath of sweet timing at once, and I think alerting on that would be a really neat feature like if you committed a change that made all your tests. Take 10 longer that'd, be something that I'd kind of want to know about.

A

um I'm gonna do this as a two-part question of what's going to be the hardest part of this to potentially solve or the riskiest part of introducing this feature.

C

Yeah I had something written in here from before the meeting so I'll. Just I think the the answer. My answer for both of those questions is the same. It's going to be striking a balance to find how we can store just enough information to provide useful insights to the user, but not so much information that we're bogging down the application, and we end up with another 10 million 100 million row table in in every gitlab instance.

C

C

And conceptually, it seems like a tricky problem to me to make sure that what we show is actionable um that you know we're gonna, we're gonna put up all this time and data everywhere, and I think I think it'd be easy to build something that somebody would look at and say: okay, so what um so? Keeping sort of keeping it concise enough that it's obvious what somebody should do because of what we told them. um I think I think it is tricky yeah.

A

I think that we've like where that starts to get valuable and how you solve. That is where you start to do that comparison over time. um It doesn't matter if this test is slow, it's always been slow, but if now it's slower than it was a month ago or even three mrs ago, then that's something you should go focus on and get it back to the performance of this app before, and maybe we I mean, even if you just label it as opposed to minutes as runner minutes.

A

Maybe then people start to care or something like that or you start to roll that data up into here's the increase in runner minutes just from tests that have slowed down and here's the increase from additional tests that have run like all of a sudden that director and vp persona start to care about. How much are we spending on this now yeah.

C

All right bring it all the way back to dollars, figure out how much they spent on the runner minutes and be like this. This test cost you this much money next month. Hope you like it.

A

Bring it back to the dollars yeah, I think that it is risky to just show a bunch of data and not have it be actionable. It's also risky to store a bunch of data and have it just slow down the app.

A

I think that striking that balance, as we start to figure out how much we need to store and what to store, is going to be tricky. I think that we can.

A

We could give you, though, a download of here's the slowest tests in all of the suites, without storing any data today, just by kind of reshuffling, the junit data that we already have. Am I remembering what we're st? What we have on the pipeline page correctly.

C

Yeah, I think the challenge is going to be like.

C

Do you want to parse those files every time somebody clicks that button, or do you want to persist that somewhere ahead of time? Like do you want to process that, in the background and then save that csv as a file in object, storage to hand to them? And then what are you going to put in that file?

C

And then what do you use that file for later, like you're, going to make your whole dashboard thing, backed by a file instead of backed by a database, then that gets a little bit weird and then there's a whole thing about historical data, like how much historical data is going to be useful and are you going to track it for every single test? I think gitlab has like a for argument's sake. Let's say a million tests. Are we going to store information about all one million tests in perpetuity and like how do we figure out?

C

Which of those million tests are important, and and how do we figure out um how to store a minimal set of data that brings value without actually having to store a million rows one for each test and then n million rows one for each test run like? How can we not do that and still have data that is valuable to our our end user?

C

A

But I'm looking forward to this group figuring it out.

A

Cool um to quickly answer your question before we wrap up, I think that we could just make that slow, because what we want to track is the button clicks. We can caveat it with a tooltip that this may take a minute, but I think that that'll that would still test our hypothesis that yeah. This is valuable. Somebody gets value out of this data if they're clicking the button to get the data, and then we can iterate on that to make that performant or store the data somewhere else, so that it's viewable within the app.

C

Yeah, I think it'd still be interesting to me, to see from like a user interview perspective like here's, a wireframe of what we're thinking. Do you think this would bring value to you and then why or why not I'd love to get some feedback from from customers, and maybe internal customers as well is just what they're looking for out of that type of feature set. Sure.

A

Yeah we're getting some interviews lined up and I will show those back to the team as we complete them all right. We have probably like 90 seconds left. I have a to-do to write the issue. That's come out of this and I'll share it back to the team. um I'll also take a to-do of updating our direction page for code testing and coverage.

A

I think that we had a great discussion last week about where we could go with this, of showing you showing you which tests are potentially flaky and that's a great direction item that we should put out there and update the direction page so I'll make that update as well and share that back to the team, um so that you know where we're going I'll, probably link to these uh the first video in that and say we had a great discussion about the direction that we want to take identification of flaky tests, here's a reference to it, uh just everybody's in the loop and I'm always open to feedback on how this went.

A

um I think this format went really well, so I'll propose it that we maybe tackle this again next month. Take a couple of weeks off and then do back-to-back sessions or try to extend the session so that we can get a full hour in all in one day and do the think, big and think small together.

A

We can talk through that in the team meeting later this week. Anything else before we cut off the recording and let everybody go.

A

Thanks james for organizing this yeah, it was a good discussion. Thank you. All nice job, james, all right cheers.