Kubernetes SIG Testing, 25 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Deflaking Kubernetes Tests

Description

@liggitt walks us through finding and fixing test flakes

Notes: https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0

(taken from Kubernetes SIG Testing - 2020-08-25)

A

I threw a link to these notes in the agenda as well. I will probably clean them up afterwards and post them somewhere, less ephemeral than a gist, but if you want to follow along or if my screen is not working well, that link is there for you.

A

So my goal for today is to sort of help you get into the mindset of deflaking and fixing flakes and then show you some ways to find things to fix and give you some tools and techniques to make you more effective at fixing flakes and avoiding them in the first place. So I thought I'd start with the mindset um in general.

A

A flake means that we have a problem and the problem can be in one or more places, but uh a failing test is not a thing we want and building that into our mentality as a community will help us a lot, and so um thinking about where the problem is, it could be in the thing that's being tested. That's the ideal right like if a test fails. We really want that to be a good signal that we have a thing we need to fix.

A

In code that we ship and run, but sometimes there's a problem in the test itself, the test's making bad assumptions or is written in a fragile way, and then the third possibility is that the thing running the test has a problem. So like infrastructure issues, and so as developers. The sensation is to assume that our code is perfect and our tests are perfect and the problems are always in the infrastructure and sadly, that has been more or less true at various times.

A

But we've worked really hard over the past month to improve ci infrastructure consistency to make it a better signal when a test fails. So the goal is in an ever-increasing way when a test fails. That means there's a problem in the thing being tested or in the test itself, and it really needs to be looked into so then the next temptation is to assume that flakes are a test only issue. So if the test is timing out well, we should just increase the timeout on the test.

A

If a test job is taking too long, we should just increase the runtime of the job, and sometimes that's true, but the mindset is to understand the reason for the failure before you try to compensate for the failure.

A

So some examples, uh if you're seeing a flake in a test and if you add a timeout or at a pole or something and the flight goes away, make sure the thing that you are pulling for or waiting for is supposed to be an asynchronous thing.

A

That's not always the case we've discovered times where a test by pulling or waiting, we were actually changing what the test was verifying. Another example is lengthening timeouts, so I have a couple examples here. um So this is an example of a test which was depending on garbage collection, and it was running in our ede tests and in our e-test we run a lot of things in parallel and new api types show up and disappear and when new api types show up and disappear, that can actually put garbage collection into a back off state.

A

Briefly, where it says I need to clean up this thing, but this thing doesn't seem to exist anymore. I'm gonna kind of wait for 30 seconds and resync, um and so it's not unexpected that garbage collection would sometimes take 30 seconds longer than other times in an e to e test, and so a test, that's depending on garbage collection should in in our parallel ede tests should tolerate a delay like that. So in this case uh adding a timeout was appropriate in another example.

A

An operation that we expected to be very very fast was actually very, very slow, and so, by digging into the logs, we were seeing that a particular operation that we expected to take like on the order of a second was taking 15 to 20 seconds, and so, if we had just blindly added a one minute, timeout uh toleration to that ede test, we would have missed that we had a pretty severe performance bug um and so the the fix, let me see if I can find the fix for that.

A

I think this was where he fixed it yeah. So by fixing the bug uh he reduced the run time of this method from sometimes takes 15 seconds to consistently takes about two seconds.

A

So adding timeouts in the appropriate places are fine, but we want to make sure we understand the root cause before we do that, and then the last thing, I'll call out is just make sure that the changes you make to the test are still testing what you expect. So this was one we discovered recently where there was a flaky test in the plugin watcher and by changing the test by initializing things in a different order. We could make the test run consistently, but by initializing things in a different order.

A

We actually weren't exercising reality like in reality, the cubelet starts and it could start before or after or during plugins, and so our test shouldn't care. What order we start things in it should be resilient to any order, and so michelle did a really good job of noticing that the initial fix was actually breaking what we were supposed to be testing and- uh and we ended up fixing a real bug, and so this turned into a bug fix.

A

Instead of a flight fix all right so now that we have that that mindset, uh how do you find plates to fix? uh You would think this would be easy as much as we complain about flakes, but sometimes it's actually kind of hard to find things that are are actually important to fix and.

B

So here are some.

A

Good places to start issues that people have already reported, we have a label kind flick, so looking for issues that have already been reported, and that can help you see if someone's already working on this or how much this is getting mentioned. If people are saying yep, I saw this. I saw this. I saw this that's one place to look and you can filter these by sig label to see flakes relevant to your sig.

A

If that's where you want to start another place to look: is this report which reports on our flakiest jobs and the flakiest tests inside those jobs?

A

um This actually looks better than it has in a long time, uh which is excellent um so used to you would open these up and there would be like five or six really bad flakes in each job. So that is less the case now, which is next one, but that's one place to look.

A

um Of course we all have encountered flakes in our own pull requests, so uh that is also a place to start. uh If you, if you see a failure, uh take a look at what it was and see if there's an open issue for that flake uh before you retest and then these two resources test grid and the triage board.

A

Are good ways to find.

A

Flakes in our different jobs, so.

A

Here's our gce container d test grid and you can make this super small and then you can see which tests have been failing. It looks like we kind of have a variety there's, not like one test. That's repeatedly failing except this one, which we already have an issue for, uh but this can be a good way to sort of identify once you zoom out and see like several weeks worth or weeks worth of runs.

A

uh If you see one test failing repeatedly, that could be a good place to start, and you can also filter this down to anything in the test name. But sig is a a good.

A

A good thing to filter on so that, if you're looking for things specific to your sig, that's a good way to to find them, and then, lastly, is the triage board. um I love this. This just got really rewritten uh in go it's much faster now and this. This is one of the most powerful tools that I use um it lets you filter by sig.

A

Now these are the sig titles associated with the tests, so that doesn't always show you exactly what you want, but it can be a good starting place, um but then it also lets you filter on failure, text or the job name or the test name or any combination of those things and then exclude specific things. So if I wanted to find something related to that cube, control.

A

Flake about standard- and um I can put in failure text, uh find uh how often it's happening and then jump down and see all the specific jobs where it's failing and then even links to specific instances.

A

So that is a good place to start as an example. um As an example, uh I went through some of the sig off attributed failures and found some really noisy tests that had been marked, flaky and just kind of ignored for a long time and actually cleaned those up. So the sig off filter signal is, is much clearer now and we'll be working on getting these cleaned up as well all right. So what are good things to put in a flake report?

A

uh Let me pull this over and we can talk through some of the helpful things to put in um if something is failing in multiple jobs. We see this, especially in our end-to-end tests, where we have different variants of them: different container runtimes, different network setups, if something's failing in multiple jobs, that's helpful to know if we're seeing something fail in only one variant. That's also helpful to know, because it might be something specific to that variant.

A

If there's more than one test that is having the same failure text, that's helpful. The triage board is great for figuring this out and then specific links to the test grid queries uh the reason for the failure so that, when you search github for like some random text and failure, you find it links to the triage board and the specific failed examples. All of these are super helpful uh for helping someone who wants to dive into fixing this get context right away and- and most of this is in the flake template the flake issue template.

A

um But I thought maybe a specific example of like what good things to put in. There would be helpful all right great. So now we've got. We've found the test, that's flaking that we want to. We want to fix. um I thought I would go briefly through ways to reproduce uh flakes in each different kind of test.

A

So one thing I really like about unit tests is that you can reproduce them locally, so um I am in my kubernetes folder and there is an open issue about a flake for this unit test, and so, if I just run that test, sadly it passes now you notice it says that was cached.

A

You have to watch out for the go cache, it will cache test results, and so you can bypass that by telling it uh some uncashable argument like how many times you want it to run, um so that is no longer a cache result, but it's still passed. So that means we've definitely got a flake like it's it's passing for me.

A

So what's the next thing, we can try to reproduce this flake um the race detector, and I have a link to the discussion of that- will actually rewrite the code when it compiles it to sort of put in delays or detect.

A

Weird interactions between asynchronous operations, so sometimes running with race mode will reproduce a flake, but it turns out that passed as well, so we have to. We have to go deeper.

A

My favorite tool this year is the stress tool, and so I linked to that. I already have it installed, but what that lets you do is build a binary for the test so go test. I want to build it with race detection enabled and instead of telling it what test I want to run. I'm going to give it the dash c argument, and that is telling it to compile the test, and that is going to create this binary in my current directory.

A

So now I have a binary which uh I can run standalone and I can give it this same argument. So when you run standalone, you have to give it the test run argument.

A

I can run it standalone and it still passes. But now, if I stress it, this is going to run a bunch of instances in parallel over and over and over and over so in about five seconds. It ran almost 300 instances of that test and, as you can see, it's flicking immediately like the thing that we're observing reproduces.

A

um That's super super useful and it's just sitting there like running 300 times, and so now we have a reproducer. Now we can start to dig in and you know try to figure out where the where the problem is. So that's that's what I love doing for unit tests uh integration tests, uh you can actually do really similar things. uh Most of our integration tests expect an std instance to be started, um so you can do that pretty simply by just starting std in another tab.

A

And then uh can't you can run the the integration tests using the same method, either directly with your test or you can build them into a binary and stress them. The question was: where is the stress binary? I linked to it right there and you can install it with that. Go, get command.

A

um So for just simple flakes: where we're seeing an integration test, failure that same approach, works uh works pretty well. One interesting issue that I ran into was a deadlock or a timeout. That would fail the entire package on an integration test and when that happens it it barks out about 10 trillion guild routines and you had no idea which test even failed.

A

um The question was why it could be compiled into a banner. So it's it's actually compiling the tests for that package into a binary, and the reason is so that we can invoke that with stress. So it's not re-recompiling the tests. Every time we build the tests once and then we invoke them a bunch of times in parallel.

A

So this is something that we've seen on a few of our packages. We don't see it super often, but I wanted to talk about how you could track that down. um So this was uh the sig scheduling. Integration test was dead, locking and timing out, um and so the way that we tracked this down.

A

When we, when we just stressed the whole package like this, we could see it was completing 10 runs of the package and it was completing nine runs of the package at a time. Then eight runs at a time, then seven, then five, but gradually the test runners. I think it defaults to eight or ten in parallel. Gradually the test runners were getting hung up uh on some deadlock and then fewer and fewer were completing at a time and then, finally, after two minutes, we got a timeout.

A

um So the way that I broke, this down was to uh stress individual tests. So first I would just run one test in the package and I would see how long does this test take normally, and so, if it took, you know a tenth of a second normally, then I would stress that one test and I would give it generously like a hundred times as long as it normally takes to complete, and this way I didn't have to wait for two minutes for the timeout to happen.

A

uh If it was gonna timeout, it would time out after 10 seconds, so I'm running one test, giving it a 10 second timeout, when normally it takes a tenth of a second or whatever. You have to figure that out per test, and then I stress it in parallel, and basically I let these run for you know 20 or 30 seconds, and once it was happy for 20 or 30 seconds, I said.

A

Well, that's probably not the culprit, and so I just went one by one by one by one- and this was the culprit, this one took a tenth of a second on average when I stressed it with a timeout of 10 seconds after 10 seconds immediately I was getting time-out flakes, and so that gave us a particular test to look at um once. We had a particular test to look at. Our job is much smaller, we're just trying to break the problem down and find where the particular problem is so looking at that test.

A

It was timing out on this weight. So then we start adding debug logging to the places where we could return early or see the thing. That's going to unblock this, and once we once we did that the issue was pretty quick to resolve. It turns out. We weren't uh we weren't waiting for all of our caches to be synced before we were starting the test, and so the event we were waiting for in the test happened before we kicked off the weight all right. So now, everybody's favorite topic ede tests.

A

Okay, so uh remember we said the problem in a flake could be the thing that's being tested right. Well, the bad news is for an e to e test.

A

The thing being tested is everything pretty much um so, on the one hand, that's good, because we do actually want to make sure our system works when you're on the whole thing we've all seen like the comic of like unit tests passed, integration tests fail with like two drawers that each open individually, but you can't open them at the same time or whatever, um but the bad thing is an idiot ed test can fail because of something completely unrelated.

A

So an example we ran into today a gluster volume sub path test. Like you look at the test title and you're like oh man, we must have a gluster problem or a subpath problem or volume problem, uh but nope. The problem is that the namespace, the test was using, got deleted by something, and so like the setup for the test failed because the namespace was being deleted and so um just be aware that you can't just look at the title of the test for an ed test.

A

You actually have to dig into what the problem is, so the takeaway is uh prefer unit and integration tests. If those are sufficient to test the thing you're looking at uh and then yeah, don't assume the title of the ede test identifies the problem. So the steps that I follow for deflating an ede test first step is just gathering information right. So uh this link is gonna, go stale because we reap our artifacts from ede runs. But hopefully you get the idea. uh There's a lot of things we capture from ede runs.

A

So if you are used to seeing a screen like this with some random failure- and you click on this- and you think well now, what like this- is clearly not enough information to figure out the problem.

A

This artifacts tab is your friend under here we have the build log which is all of the output from the test when it was running, but then under artifacts we capture, tons and tons and tons of logs. So for the control plane. We capture logs, like the api server audit, which will tell you in detail every request that was made and who made it and what order it was made in and don't forget that there's archived rotated versions of them. These are big.

A

But if you need to know what order things happened in they're, very useful, the api server log, the controller manager and the scheduler, those are the main logs that you normally care about on the control plane and then for each node and most of our ed tests set up three node clusters. uh We capture um the container runtime logs, so that's either a docker log or container d.

A

We capture cube proxy and we capture cubelet. Those are the main things you might care about for for most ede issues. So once you have those things gathered, the next step is to filter and correlate that information. So if your first step is to kind of pick likely candidates like the the things you know that interact around this issue might be, uh the test is doing something. So you care about the test log and the api server log, and then the controller manager is going to do something and the cube is going to do something.

A

That's a pretty typical set, maybe the scheduler, but that's a pretty typical set of interactions, and so you can pick the logs for those components and then filter to things that happened around that time or things that happened around the objects or the name spaces in the test, um depending on what it is.

A

If, if you're seeing an issue where a controller is taking too long to do something, maybe you just want to filter to things that happen around that time, because maybe it was busy with some other namespace like it was spending time doing something else hanging on some other name space, which is why I didn't get to your test.

A

So if you only look at your namespace you'll miss, maybe the root cause, if you're, trying to if you're, trying to figure out how a particular object got into a particular state, a pod got into a particular state or something then filtering just to uh that pod, or that namespace is probably reasonable.

A

So you can filter the logs for the relevant things I like to keep timestamps at the beginning of the logs and then right after the timestamp put something that identifies the component and then merge all the files into one file and sort by time, and so you end up with something like uh like this. Let's see, if I can find there, we go so this was when we were trying to debug a uh garbage collection issue. So you see the timestamp. So this is the api log api log cube controller manager.

A

So this is the controller's perspective. The ipi's perspective.

A

I thought I put cubelet in here. Maybe it was just those two, oh yeah and then so this was the output from the ede. So this was the test code, that's running so when the test started. Looking for the thing to go away, maybe I put maybe that was it, uh but but you get the idea you take the logs from the relevant components.

A

You filter it down to sort of the time you're interested in or the objects you're interested in, and then you collate them. So you can get a complete picture of this distributed system and where things went wrong.

A

Another thing that can be confusing: we have a lot. We, we have a lot of nested systems, so we we invoke things which invoke things which invoke things, and sometimes the logs that you see surfaced actually come from like a component three levels deep.

A

So this is an example of a bug in run c, uh which was actually an old version of run c configured on this particular job, and so the clue to this was the line numbers of the message. So in the error message, you know we see, process linux line, 449 and then stuff happening, and if we look at the version of process linux that we have in kubernetes, that line number doesn't match uh what, where that message comes from, um and so tracking that line number down actually pointed at the run c component.

A

And then, when we started searching, run c for issues around c group files, we found that they had fixed a bug where they were retrying an interrupt, and so we picked up a new version of run c and resolve the issue. So don't make any assumptions about like where the where the problem originated.

A

Just because of which log it showed up in so matching up line, numbers can be super helpful and then, finally, if you, if you are trying to figure out like which branch is being taken or what timing issue is happening, uh and we don't have log messages like feel free to add them, adding debug blogging to track down a flake is totally acceptable. uh So I'd like to do an example of that all right, so now, you've reproduced the flake you've found, like maybe sort of the area where it's happening.

A

What are the types of things you can do to to look for and to sort of force the flake to happen? So the first thing does the test assume that something that's happening. Asynchronously is happening. Synchronously, so is the test gonna do something and then immediately check a condition when uh really the thing that's going to make that condition pass might not run right away, um and then there are ways to stimulate this.

A

So if the test is kicking off a go routine or the component, that's being tested is kicking off a go routine, put a sleep at the top of the go routine uh and that will simulate the go routine. Taking a while to start- and you I mean I say it a second, but it could be five seconds or whatever um try that and see if that makes the flake happen reproducibly.

A

uh These are all the types of places where we normally do asynchronous things. So if you have a watch event handler.

A

A

So if you have an event handler like this to simulate watch events being delivered slowly,.

A

Do something like that and that will make your handler act as though it's getting watch events slowly, which can happen under load.

A

Do things like that to see if that makes the uh the flake reproducible and then a normal pattern is to observe watch events and then queue up work to do, and so try doing the same thing in the the worker, if you put it at the beginning of the worker, then that simulates, a worker, that's bogged down and is going to react slowly to to work and if you put it at the end of the worker that simulates a worker that does work but then gets distracted with other things before coming back to get pick up new work.

A

um So you can try sort of each of these in turn and and see if you can trigger the flake.

A

um This is sort of similar, but uh sometimes a test will do work and assume that it can complete its work. But there's a background process running. That's gonna do conflicting stuff. um So this was a good example where the test was doing some setup.

A

It was creating a service object and then creating an endpoints object and most of the time that was fine, but we actually have a controller that when you create a service object, we'll create endpoints objects for you in the background, and so if the test setup lost the race, uh the test would get an already exist error when it was trying to set up its endpoints and we could trigger.

B

A

Putting a sleep in between the steps of the test, so this is just ways to force timing errors, force the test to win a race or force the test to lose a race by making individual steps artificially long.

A

A couple rules of thumb um tests that assume things are going to be fast, uh something that takes like a second or less locally uh could take a few seconds in ci environments. For a couple reasons, ci environments normally have more resource constraints than like a local, powerful depth machine and often we run multiple tests in parallel. So maybe it happens really fast when you run just your test, but if you run 10 or 15 or 20 tests in parallel things slow down a little bit.

A

So unless your test is specifically a performance or timing test, don't put super tight tolerances um weight, dot forever test timeout is set to 30 seconds. That's a reasonable thing to use, for you know, quote unquote things that should not take very long, um that's useful when we don't want to test to hang for 10 minutes before failing. We wanted to actually fail quickly, uh but we don't care for the purposes of this test if it takes one second or five seconds or ten seconds.

A

um This is not a performance test, it's a functional test, and so we just want to wait a while before deciding that the the functional test field.

A

Another thing we see a lot of is assuming deterministic output, so these are just your friendly reminders that map iteration and go is non-deterministic, and so, if there is something being uh a list being compiled or a set of steps being done by iterating over a map, those are going to happen in non-deterministic order, so either sort and compare or tolerate any order. So there's a link to an example of that this was a fun one that we found. Sometimes we have things that will do random allocation.

A

So if you create a service, it will randomly allocate you a cluster ip.

A

We also can request a specific ip, and so we had a test that was creating one service randomly and then creating another service with a specific ip and one out of every 256 runs the randomly allocated ip would be the same as the static ip which we later requested and we'd get a conflict. So just be aware, if you're mixing those- and in this case um there is actually a bug that we could fix to improve things for everyone, so that kind of goes back to. uh Where should we make the fix?

A

Is it a test only issue, or is it a uh a a real bug we should fix. So in this case it was a real bug we could fix uh and then the last one I was going to call out if you're using a fake client- and you have like an informer watcher on it, it can do a read list in a rewatch at any point, and so, if you're, making fake client calls and then expecting like exact actions to be output, those can get interleaved spuriously with the informer.

A

In the background, so it's better to look for the specific things you wanted to happen uh instead of just asserting exact matches.

A

That is everything I could think of in 35 minutes.

B

Any questions for.

B

Jordan, how long would you say it? It takes to sort of run one of these down.

A

um Once you know the tools and kind of get a workflow to where you can do the gathering and the filtering and correlating um that usually takes- uh I mean just that bit once you get it down takes you know. Five minutes like it takes a while to get that workflow down, but um once you have something correlated, it really varies. Sometimes the issue will jump out at you immediately.

A

um Sometimes, uh like you saw the one where we had to add more debug logging, because we didn't have enough information about the timing stuff uh there. There is no usual, it could be five minutes. It could be a month.

B

Right, I totally understand it's kind of kind of a long tail for some of these things, but it's just sort of a gut check like I feel like I've seen you and some other folks go through an impressive number of these uh lately. So it does feel like there's a bit of a rhythm, at least as far as uncovering some of the lower hanging fruit in unit and integration tests.

A

Yeah, the unit and integration tests are way way easier and faster to figure out. Just because of the rapid, uh you know make a change reproduce make a change, reproduce make a change reproduce, um so those you can actually normally resolve or at least root cause within you know a couple hours uh sometimes once you find the root cause the root cause is this test is fundamentally wrong and we need to rewrite it, uh and so that can be tricky but root, causing unit integration issues.

A

You can normally do within within a day and often within a couple hours.

B

So, if folks are like looking at a test and they're trying to figure out like what is actually being tested, is the test right or uh is the the test incorrect? What advice would you give folks for like trying to find the appropriate expertise there.

A

Yeah I mean the the more realistic. The setup of the test is the better. So if you can use the same constructors to set up the the controller or the component that you know are really being used when we run the thing in production, that's nice um sometimes we'll see issues where the setup code was faulty and we were sort of hacking together, fake clients and artificially running, go routines and waiting for cache, syncs and informers in totally different orders than happen.

A

When you run the component for real and so the more you can use uh the normal constructors, the better um thinking about uh like behavioral testing, like we're gonna we're gonna trigger some input, either by calling a go function directly or by creating some api object and waiting for the component to observe that we're going to trigger some input, and then we have some expectation of behavior, the more you can limit the test just to the inputs and the expected behavior the better.

A

uh Instead of sort of this extremely fragile, like I expect this call to be made, then this call would be made. This call will be made. It must happen in this order. It must happen with this timing and like for a functional test. That's probably not what we care about like we want an invariant of. I create a thing, and then this happens to that thing, and so the more you can scope the test to just those things better.

B

Okay, since you mentioned, integrations necessarily or just one fyi, the part of the reason that the flight query that you linked looked way better than it has in a while uh is because integration tests don't show up on that right now, it has to do with some of the crowd. Jobs use one mechanism.

A

Versus another mechanism to specify.

B

The repo they climb- and that means that some of the data that's consumed by our flake queries, uh doesn't uh doesn't plan for those. So for integration tests, especially the triage tool, is the better place to go looking for what are the integration tests that are failing most often right now, which may be a hint or clue into like what are the flakiest tests that we should look at addressing.

A

B

A

Point out, too, that uh that the trios board and test grid actually um for some jobs, do not differentiate which branch and so always like. If you're looking at failures that you found via the triage board or via test grid, always make sure that the pr that, especially for pull request, flakes um make sure that the flake was actually running on a pull request against the master branch.

A

uh You will drive yourself crazy, trying to track down flakes that were actually already fixed uh and don't exist anymore in the development branch.

B

Yep, that is a super valid one.

B

uh Any other questions for jordan or comments.

B

Thank you yeah all right, thanks thanks, so much for your time, jordan. I really appreciated that.