Kubernetes SIG Testing, 25 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2020-08-25

Description

https://bit.ly/k8s-sig-testing-notes

A

uh Hi everybody uh today is tuesday august 25th. uh You are at the kubernetes sig testing bi-weekly meeting. I am your host aaron krickenberger and this meeting is being publicly recorded and will be posted to youtube later. So you can all watch yourselves adhere to the kubernetes code of conduct by being your very best selves and not being jerks on the agenda. For today's meeting, uh we have jordan liggett who's going to give us a deflating.

B

Demo or tutorial a.

A

Demonstration of how he and others have deflated many a2e and integration tests over the past couple of weeks, and if we have time after that, I thought I would update everybody on the progress we've made on implementing the kubernetes ci policy improvements. We discussed a couple meetings back um so with that, I'm getting a handoff to jordan, all right.

C

Let me share my screen. If I can, it says I cannot.

A

I am going to make you a cuddle post so that you can share your screen right. Give that another shot.

C

All right, how is that? Can you see a couple things up there?

C

All right. I threw a link to these notes in the agenda as well. I will probably clean them up afterwards uh and post them somewhere, less uh ephemeral than a gist, but if you want to follow along or if my screen is not working well, that link is there for you.

C

So my goal for today is to sort of help you get into the mindset of deflaking and fixing flakes and then show you some ways to find things to fix and give you some tools and techniques to make you more effective at fixing flakes and avoiding them in the first place. So I thought I'd start with the mindset um in general.

C

A flake means that we have a problem and the problem can be in one or more places, but uh a failing test is not a thing we want and building that into our mentality as a community will help us a lot, and so um thinking about where the problem is, it could be in the thing that's being tested. That's the ideal right like if a test fails. We really want that to be a good signal that we have a thing.

C

We need to fix in code that we ship and run, but sometimes there's a problem in the test itself, the test's making bad assumptions or is written in a fragile way, and then the third possibility is that the thing running the test has a problem. So like infrastructure issues, uh and so as developers, the temptation is to assume that our code is perfect and our tests are perfect and the problems are always in the infrastructure and sadly, that has been more or less true at various times.

C

But we've worked really hard over the past month to improve ci infrastructure consistency to make it a better signal when a test fails. So the goal is in an ever-increasing way when a test fails. That means there's a problem in the thing being tested or in the test itself, and it really needs to be looked into so then the next temptation is to assume that flakes are a test-only issue. So if the test is timing out well, we should just increase the timeout on the test.

C

If a test job is taking too long, we should just increase the runtime of the job, and sometimes that's true, but the mindset is to understand the reason for the failure before you try to compensate for the failure.

C

So some examples, uh if you're seeing a plague in a test and if you add a timeout or at a pole or something and the flight goes away, make sure the thing that you are pulling for or waiting for is supposed to be an asynchronous thing.

C

That's not always the case we've discovered times where a test by pulling or waiting, we were actually changing what the test was verifying. um Another example is lengthening timeouts, so I have a couple examples here. um So this is an example of a test which was depending on garbage collection, and it was running in our ede tests and in our e2e test.

C

We run a lot of things in parallel and new api types show up and disappear and when new api types show up and disappear, that can actually put garbage collection into a back off state. Briefly, where it says I.

B

C

Up this thing, but this thing doesn't seem to exist anymore, I'm gonna kind of wait for 30 seconds and resync, um and so it's not unexpected that garbage collection would sometimes take 30 seconds longer than other times in an e to e test, and so a test, that's depending on garbage collection, should in in our parallel e to e tests, should tolerate a delay like that. So in this case uh adding a timeout was appropriate in another example.

C

uh An operation that we expected to be very very fast was actually very, very slow, and so, by digging into the logs, we were seeing that a particular operation that we expected to take like on the order of a second was taking 15 to 20 seconds, and so, if we had just blindly added a one minute, timeout uh toleration to that ede test, we would have missed that we had a pretty severe performance bug um and so the uh the fix, let me see if I can find the fix for that.

C

I think this was where he fixed it yeah. So by fixing the bug uh he reduced the run time of this method from sometimes takes 15 seconds to consistently takes about two seconds.

C

So adding timeouts in the appropriate places are fine, but we want to make sure we understand the root cause before we do that, and then the last thing, I'll call out is just make sure that the changes you make to the test are still testing what you expect. So this was one we discovered recently where there was a flaky test in the plug-in watcher and by changing the test by initializing things in a different order. We could make the test run consistently, but by initializing things in a different order.

C

We actually weren't exercising reality like in reality, the cubelet starts and it could start before or after or during plugins, and so our test uh shouldn't care. What order we start things in it should be resilient to any order, and so michelle did a really good job of noticing that the initial fix was actually breaking what we were supposed to be testing and- uh and we ended up fixing a real bug, and so this turned into a bug fix.

C

Instead of a flight fix all right so now that we have that that mindset, uh how do you find plates to fix? You would think this would be easy as much as we complain about flakes, but sometimes it's actually kind of hard to find things that are are actually important to fix and.

B

So here are some.

C

Good places to start issues that people have already reported, we have a label kind flick. So looking for issues that have already been reported, and that can help you see if someone's already working on this or how much this is getting mentioned. If people are saying yep, I saw this. I saw this. I saw this that's one place to look and you can filter these by sig label to see flakes relevant to your sig.

C

If that's where you want to start another place to look: is this report which reports on our flakiest jobs and the flakiest tests inside those jobs?

C

um This actually looks better than it has in a long time, uh which is excellent um so used to you would open these up and there would be like five or six really bad flicks in each job. So that is less the case now, which is excellent, but that's one place to look. um Of course. We all have encountered flakes in our own pull requests, so uh that is also a place to start.

C

uh If you, if you see a failure, uh take a look at what it was and see if there's an open issue for that flake uh before you re-test and then these two resources test grid and the triage board.

C

Are good ways to find.

C

Flakes in our different jobs, so.

C

Here's our gce container d test grid and you can make this super small and then you can see which tests have been failing. It looks like we kind of have a variety there's, not like one test. That's repeatedly failing except this one, which we already have an issue for, uh but this can be a good way to sort of identify once you zoom out and see like several weeks worth or weeks worth of runs.

C

uh If you see one test failing repeatedly, that could be a good place to start, and you can also filter this down to anything in the test name. But sig is a a good.

C

A good thing to filter on so that, if you're looking for things specific to your sig, uh that's a good way to to find them, and then, lastly, is the triage board. um I love this. This just got really rewritten uh in go it's much faster now uh and this. This is one of the most powerful tools that I use um it lets you filter by sig.

C

Now these are the sig titles associated with the tests, so that doesn't always show you exactly what you want, but it can be a good starting place, um but then it also lets you filter on failure, text or the job name or the test name or any combination of those things and then exclude specific things. So if I wanted to find something related to that cube, control.

C

Flake about standard- and I can put in failure text, find how often it's happening and then jump down and see all the specific jobs where it's failing and then even links to specific instances.

C

So that is a good place to start as an example. um As an example, uh I went through some of the sig off attributed failures and found some really noisy tests that had been marked, flaky and just kind of ignored for a long time and actually cleaned those up. So the sig off filter signal is, is much clearer now and we'll be working on getting these cleaned up as well all right. So what are good things to put in a flake report?

C

uh Let me pull this over and we can talk through some of the helpful things to put in um if something is failing in multiple jobs. We see this, especially in our end-to-end tests, where we have different variants of them: different container runtimes, different network setups, if something's failing in multiple uh jobs, that's helpful to know if we're seeing something fail in only one variant. That's also helpful to know, because it might be something specific to that variant.

C

If there's more than one test that is having the same failure text, that's helpful. The triage board is great for figuring this out and then specific links to the test grid queries uh the reason for the failure so that, when you search github for like some random text and failure, you find it links to the triage board and then specific, failed examples. All of these are super helpful for helping someone who wants to dive into fixing this get context right away and- and most of this is in the flake template the flake issue template.

C

um But I thought maybe a specific example of like what good things to put in. There would be helpful all right great. So now we've got, we've found a test, that's flaking that we wanna we wanna fix. um I thought I would go briefly through ways to reproduce uh flakes in each different kind of test.

C

So one thing I really like about unit tests is that you can reproduce them locally, so um I am in my kubernetes folder and there is an open issue about a flake for this unit test, and so, if I just run that test, sadly it passes now you notice it says that was cached.

C

You have to watch out for the go cache, it will cache test results, and so you can bypass that by telling it uh some uncashable argument like how many times you want it to run, um so that is no longer our cash result, but it's still passed. So that means we've definitely got a flake like it's it's passing for me.

C

So what's the next thing, we can try to reproduce this flake um the race detector, and I have a link to the discussion of that- will actually rewrite the code when it compiles it to sort of put in delays or detect.

C

Weird interactions between asynchronous operations, so sometimes running with race mode will reproduce a flake, but it turns out that passed as well, so we have to. We have to go deeper.

C

My favorite tool this year is the stress tool, and so I linked to that. I already have it installed, but what that lets you do is build a binary for the test so go test. I want to build it with race detection enabled and instead of telling it what test I want to run. I'm going to give it the dash c argument, and that is telling it to compile the test, and that is going to create this binary in my current directory.

C

So now I have a binary which uh I can run standalone and I can give it this same argument. So when you run standalone, you have to give it the test run argument.

C

I can run it standalone and it still passes. But now, if I stress it, this is going to run a bunch of instances in parallel over and over and over and over so in about five seconds. It ran almost 300 instances of that test and, as you can see, it's flaking immediately like the thing that we're observing reproduces.

C

um That's super super useful and it's just sitting there like running 300 times, and so now we have a reproducer. Now we can start to dig in and you know try to figure out where the where the problem is. So that's that's what I love doing for unit tests uh integration tests, uh you can actually do really similar things. uh Most of our integration tests expect an std instance to be started, um so you can do that pretty simply by just starting atd in another tab.

C

And then uh can't you can run the the integration tests using the same method either directly with gotest or you can build them into a binary and stress them. The question was: where is the stress binary? I linked to it right there and you can install it with that. Go, get command.

C

um So for just simple flakes: where we're seeing an integration test, failure that same approach, works uh works pretty well. One interesting issue that I ran into was a deadlock or a timeout. That would fail the entire package on an integration test and when that happens it it barfs out about 10 trillion routines and you have no idea which test even failed.

C

um The question was why it could be compiled into a banner. So it's it's actually compiling the tests for that package into a binary, and the reason is so that we can invoke that with stress. So it's not re-recompiling the tests. Every time we build the tests once and then we invoke them a bunch of times in parallel.

C

So this is something that we've seen on a few of our packages. We don't see it super often, but I wanted to talk about how you could track that down. um So this was uh the sig scheduling. Integration test was dead, locking and timing out, um and so the way that we tracked this down.

C

When we, when we just stressed the whole package like this, we could see it was completing 10 runs of the package and it was completing nine runs of the package at a time. Then eight runs at the time, then seven, then five, but gradually the test runners. I think it defaults to eight or ten in parallel. Gradually the test runners were getting hung up on some deadlock and then fewer and fewer were completing at a time and then, finally, after two minutes, we got a timeout.

C

So the way that I broke, this down was to uh stress individual tests. So first I would just run one test in the package and I would see how long does this test take normally, and so, if it took, you know a tenth of a second normally, then I would stress that one test and I would give it generously like a hundred times as long as it normally takes to complete, and this way I didn't have to wait for two minutes for the timeout to happen.

C

uh If it was gonna timeout, it would time out after 10 seconds, so I'm running one test, giving it a 10 second timeout, when normally it takes a tenth of a second or whatever.

C

You have to figure that out per test, and then I stress it in parallel and basically I let these run for you know 20 or 30 seconds, and once it was happy for 20 or 30 seconds, I said: well, that's probably not the culprit, and so I just went one by one by one by one- and this was the culprit, this one took a tenth of a second on average.

C

When I stressed it with a timeout of 10 seconds after 10 seconds immediately, I was getting timeout clicks, and so that gave us a particular test to look at um once. We had a particular test to look at our job is much smaller, we're just trying to break the problem down and find where the particular problem is so looking at that test. It was uh timing out on this weight. So then we start adding debug logging to the places where we could return early or see the thing.

C

That's going to unblock this, and once we once we did that the issue was pretty quick to resolve. It turns out. We weren't uh we weren't waiting for all of our caches to be synced before we were starting the test, and so the event we were waiting for in the test happened before we kicked off the wake um all right. So now, everybody's favorite topic ede tests.

C

Okay, so uh remember we said the problem in a flake could be the thing that's being tested right. Well, the bad news is for an e to e test. The thing being tested is everything pretty much um so, on the one hand, that's good, because we do actually want to make sure our system works when you run the whole thing. We've all seen like the comic of like unit tests passed, integration tests fail with like two drawers that each open individually, but you can't open them at the same time or whatever.

C

um But the bad thing is an idiot ed test can fail because of something completely unrelated. uh So an example, we ran into today a gluster volume sub path test. Like you look at the test title and you're like oh man, we must have a gluster problem or a subpath problem or a volume problem, uh but nope.

C

The problem is that the namespace, the test was using, got deleted by something, and so like the setup for the test failed because the namespace was being deleted and so um just be aware that you can't just look at the title of the test for an ed test. You actually have to dig into what the problem is, and so the takeaway is uh prefer unit and integration tests. If those are sufficient to test the thing.

C

You're looking at uh and then yeah, don't assume the title of the ede test identifies the problem, so the steps that I follow for deflating an ede test first step is just gathering information right, so uh this link is gonna, go stale because we reap our artifacts from edu runs. But hopefully you get the idea. uh There's a lot of things we capture from ede runs.

C

So if you are used to seeing a screen like this with some random failure- and you click on this- and you think well now what like this is clearly not enough information to figure out the problem. uh This artifacts tab is your friend um under here we have the build log which is all of the output from the test when it was running, but then under artifacts we capture, tons and tons and tons of logs. So for the control plane.

C

We capture logs, like the api server audit, which will tell you in detail every request that was made and who made it and what order it was made in and don't forget that there's archived rotated versions of them. These are big. But if you need to know what order things happened in they're, very useful, the api server log, the controller manager and the scheduler, those are the main logs that you normally care about on the control plane and then for each node and most of our ed tests set up three node clusters.

C

uh We capture um the container runtime log, so that's either docker log or container d. We capture cube proxy and we capture cubelet. Those are the main things you might care about for for most ede issues.

C

So once you have those things gathered, the next step is to filter and correlate that information. So if your first step is to kind of pick likely candidates like the the things you know that interact around this issue might be, uh the test is doing something. So you care about the test log and the api server log, and then the controller manager is going to do something and the cube is going to do something.

C

That's a pretty typical set, maybe the scheduler, but that's a pretty typical set of interactions, and so you can pick the logs for those components and then filter to things that happened around that time or things that happened around the objects or the name spaces in the test, um depending on what it is.

C

If, if you're seeing an issue where a controller is taking too long to do something, maybe you just want to filter to things that happen around that time, because maybe it was busy with some other namespace like it was spending time doing something else hanging on some other name space, which is why I didn't get to your test.

C

So if you only look at your namespace you'll miss, maybe the root cause, if you're, trying to if you're, trying to figure out how a particular object got into a particular state, a pod got into a particular state or something then filtering just to uh that pod, or that namespace is probably reasonable.

C

So you can filter the logs for the relevant things uh I like to keep timestamps at the beginning of the logs and then right after the timestamp put something that identifies the component and then merge all the files into one file and sort by time, and so you end up with something like uh like this: let's see, if I can find there, we go so this was when we were trying to debug a uh garbage collection issue, so you see the time stamp. So this is the api log. Api log cube controller manager.

C

So this is the controller's perspective. The ipi's perspective.

C

I thought I put cubelet in here. Maybe it was just those two, oh yeah and then so this was the output from the ede. So this was the test code, that's running so when the test started. Looking for the thing to go away, maybe I put maybe that was it, uh but but you get the idea you take the logs from the relevant components.

C

You filter it down to sort of the time you're interested in or the objects you're interested in, and then you collate them. So you can get a complete picture of this distributed system and where things went wrong.

C

Another thing that can be confusing: we have a lot. We, we have a lot of nested systems, so we we invoke things which invoke things which invoke things, and sometimes the logs that you see surfaced actually come from like a component three levels deep.

C

So this is an example of a bug in run c, uh which was actually an old version of run c configured on this particular job, and so the clue to this was the line numbers of the message. So in the error message know we see, process linux line, 449 and then stuff happening, and if we look at the version of process linux that we have in kubernetes, that line number doesn't match uh what, where that message comes from, and so tracking that line number down actually pointed at the run c component.

C

And then, when we started searching, run c for issues around c group files, we found that they had fixed a bug where they were retrying an interrupt, and so we picked up a new version of run c and resolve the issue.

C

So don't make any assumptions about like where the where the problem originated, just because of which log it showed up in so matching up line numbers can be super helpful and then, finally, if you, if you are trying to figure out like which branch is being taken or what timing issue is happening, uh and we don't have log messages like feel free to add them, adding debug blogging to track down a flake is totally acceptable.

C

So I'd like to do an example of that all right, so now, you've reproduced the flake you've found, like maybe sort of the area where it's happening. What are the types of things you can do to to look for and to sort of force the flake to happen?

C

So the first thing um does the test assume that something that's happening. Asynchronously is happening. Synchronously, so is the test gonna do something and then immediately check a condition when uh really the thing that's gonna make, that condition pass uh might not run right away, um and then there are ways to stimulate this. So if the test is kicking off a go routine or the component, that's being tested is kicking off. A go routine, put a sleep at the top of the go routine uh and that will simulate the go routine.

C

Taking a while to start- and you I mean I say it a second, but it could be five seconds or whatever um try that and see if that makes the flake happen reproducibly. uh These are all the types of places where we normally do asynchronous things. So, if you have a watch event handler.

C

C

So if you have an event handler like this to simulate watch events being delivered slowly,.

C

Do something like that and that will make your handler act as though it's getting watch events slowly, which can happen under load.

C

Do things like that to see if that makes the uh the flake reproducible and then a normal pattern is to observe watch events and then cue up work to do and so try doing the same thing in the the worker, if you put it at the beginning of the worker, then that simulates, a worker, that's bogged down and is going to react slowly to to work and if you put it at the end of the worker that simulates a worker that does work but then gets distracted with other things before coming back to get pick up new work.

C

So you can try sort of each of these in turn and and see if you can trigger the flip.

C

This is sort of similar, but uh sometimes a test will do work and assume that it can complete its work. But there's a background process running. That's gonna do conflicting stuff. um So this was a good example where the test was doing some setup.

C

It was creating a service object and then creating an endpoints object and most of the time that was fine, but we actually have a controller that when you create a service object, we'll create endpoints objects for you in the background, and so if the test setup lost the race, uh the test would get an already exist error when it was trying to set up its endpoints and we could trigger.

B

C

Putting a sleep in between the steps of the test, so this is just ways to force timing errors, force the test to win a race or force the test to lose a race by making individual steps artificially long.

C

A couple rules of thumb um tests that assume things are going to be fast, uh something that takes like a second or less locally uh could take a few seconds in ci environments. For a couple reasons, ci environments normally have more resource constraints than like a local, powerful death machine, and often we run multiple tests in parallel. So maybe it happens really fast when you run just your test, but if you run 10 or 15 or 20 tests in parallel things slow down a little bit.

C

So unless your test is specifically a performance or timing test, don't put super tight tolerances um weight, dot forever test timeout is set to 30 seconds. That's a reasonable thing to use, for you know, quote unquote things that should not take very long, um that's useful when we don't want to test to hang for 10 minutes before failing. We wanted to actually fail quickly, uh but we don't care for the purposes of this test if it takes one second or five seconds or ten seconds.

C

This is not a performance test, it's a functional test, and so we just want to wait a while before deciding that the the functional test field.

C

Another thing we see a lot of is assuming deterministic output, so these are just your friendly reminders that map iteration and go is non-deterministic, and so, if there is something being uh a list being compiled or a set of steps being done by iterating over a map, those are going to happen in non-determined. Stick order so either sort and compare or tolerate any order. So there's a link to an example of that this was a fun one that we found. Sometimes we have things that will do random allocation.

C

So if you create a service, it will randomly allocate you a cluster ip.

C

We also can request a specific ip, and so we had a test that was creating one service randomly and then creating another service with a specific ip and one out of every 256 runs the randomly allocated ip would be the same as the static ip which we later requested and we'd get a conflict.

C

So just be aware, if you're mixing those- and in this case um there is actually a bug that we could fix to improve things for everyone, so that kind of goes back to. uh Where should we make the fix? Is it a test only issue, or is it a a a real bug we should fix.

C

So in this case it was a real bug we could fix uh and then the last one I was going to call out if you're using a fake client- and you have like an informer watcher on it, it can do a read list in a rewatch at any point, and so, if you're, making fake client calls and then expecting like exact actions to be output, those can get interleaved spuriously with the informer. In the background, so it's better to look for the specific things you wanted to happen uh instead of just asserting exact matches.

C

That is everything I could think of in 35 minutes.

A

Any questions for.

A

Jordan, how long would you say it? It takes to sort of run one of these down.

C

um Once you know the tools and kind of get a workflow to where you can do the gathering and the filtering and correlating um that usually takes- uh I mean just that bit once you get it down takes you know. Five minutes like it takes a while to get that workflow down, but um once you have something correlated, it really varies. Sometimes the issue will jump out at you immediately.

C

um Sometimes, uh like you saw the one where we had to add more debug logging, because we didn't have enough information about the timing stuff uh there. There is no usual, it could be five minutes. It could be a month.

A

Right, I totally understand it's kind of kind of a long tail for some of these things, but it's just sort of a gut check like I feel like I've seen you and some other folks go through an impressive number of these lately. So it does feel like there's a bit of a rhythm, at least as far as uncovering some of the lower hanging fruit in unit and integration tests.

C

Yeah, the unit and integration tests are way way easier and faster to figure out. Just because of the rapid, uh you know make a change reproduce make a change, reproduce make a change reproduce, um so those you can actually normally resolve or at least root cause within you know a couple hours uh sometimes once you find the root cause the root cause is this test is fundamentally wrong and we need to rewrite it, uh and so that can be tricky but root, causing unit and integration issues.

C

You can normally do within within a day and often within a couple hours.

A

Okay, um so if folks are like looking at a test and they're trying to figure out like what is actually being tested, is the test right or uh is the the test incorrect? What advice would you give folks for like trying to find the appropriate expertise there.

C

Yeah I mean the the more realistic. The setup of the test is the better. So if you can use the same constructors to set up the the controller or the component that you know are really being used when we run the thing in production, that's nice um sometimes we'll see issues where the setup code was faulty and we were sort of hacking together, fake clients and artificially running, go routines and waiting for cache, syncs and informers in totally different orders than happen.

C

When you run the component for real and so the more you can use uh the normal constructors, the better um thinking about uh like behavioral testing, like we're gonna we're gonna trigger some input, either by calling a go function directly or by creating some api object and waiting for the component to observe that we're going to trigger some input, and then we have some expectation of behavior, the more you can limit the test just to the inputs and the expected behavior the better.

C

uh Instead of sort of this extremely fragile, like I expect this call to be made, then this call will be made. This call will be made. It must happen in this order. It must happen with this timing and like for a functional test. That's probably not what we care about like we want an invariant of. I create a thing, and then this happens to that thing, and so the more you can scope the test to just those things better.

A

Okay, since you mentioned, integrations necessarily or just one fyi, the part of the reason that the flight query that you linked looked way better than it has in a while uh is because integration tests don't show up on that right now, it has to do with some of the crowd. Jobs use one mechanism.

B

Versus another mechanism to specify.

A

The repo they climb- and that means that some of the data that's consumed by our flake queries, uh doesn't uh doesn't land for those. So for integration tests, especially the triage tool, is the better place to go. Looking for what.

B

Are the integration.

A

Tests that are failing most often right now, which may be a hint or clue into like what are the flakiest tests that we should look at addressing.

C

A

Yeah, I will point.

C

Out, too, that uh that the trio's board and test grid actually um for some jobs do not differentiate which branch and so always like, if you're looking at failures that you found via the triage board or via test grid, always make sure that the pr that flick, especially for pull request, flakes, make sure that the flake was actually running on a pull request against the master branch.

C

uh You will drive yourself crazy, trying to track down flakes that were actually already fixed uh and don't exist anymore in the development branch.

A

Yep, that is a super valid.

A

uh Any other questions for jordan or comments.

A

Thank you yeah all right, thanks thanks, so much for your time, jordan. I really appreciated that.

A

um So, with the time we have remaining, I just kind of wanted to give folks a brief off-the-cuff update on where we're at with the ci policy stuff that we, jordan and ben, and I sort of raised a couple weeks back.

A

So I will share my screen for this. uh Maybe not all of this is cleanly documented in linkedin places. I will work on following up and landing this stuff, but I've been using an umbrella issue called kubernetes ci policy to sort of find my way to all of the appropriate work streams. Lori apple's also put together a ci policy improvement project board, which is another way of sort of keeping track of what works in progress. And what are we monitoring?

A

I personally haven't had the time to go through and scrub everything so like. I said this will be a little bit of the cuff.

A

The thing we decided that was most important first was to set um make sure that all of the critical jobs, all the release blocking emerged, rocking jobs for guaranteed pod quality of service, which meant they had to declare resource requests and matching resource limits for cpu and memory to the best of my knowledge, we've basically gotten that done so, there's a test in place that will prevent you from adding any new release blocking jobs unless they have those limits set.

A

um I created a tool to generate a csv which can then be imported into google sheets, and I did some unconditional formatting filtering here, but uh what you see is anything uh everything on here is in a dashboard that has the word blocking on it, and I've specifically manually excluded, like the relinj blocking dashboard and a few other things. This should cover all the critical jobs.

A

um The yellow jobs are jobs that have not yet migrated out of the google.com owned, build cluster.

A

The green jobs are those that have, uh and I have a column specifically for qos, guaranteed true and pretty much everything is either green or yellow.

A

If I get rid of that dashboard filter um see, I can remember how to do this, so this is all 1700 and something jobs that are defined to run uh via crowd, and you can see a lot of them are red because they don't declare any resource limits.

A

uh This is part of why step one was to define the resource limits, but step two was to take all of those green or yellow jobs and move them to their own cluster, because right now, if they stay in the google.com cluster, they're still competing against all of these jobs. That.

B

A

Declaring their resources, but cldr, you declared resources for everything uh of the ways we were thinking about.

A

Declaring victory for this was we have discussed how jobs that failed to schedule due in the bad times around july 9th we had a lot of jobs, end up in arab state, and this usually seemed to happen because jobs were unable to successfully clone repo or they were unable to successfully schedule on a cluster due to lack of resources, and so we figured if we got resources set up correctly, uh then we would have fewer jobs end up in error state, and you can see that over time, I'm looking up here at the orange.

A

These are all the pre-submit jobs that are scheduled and like they, both kind of go down at roughly the same rate, uh but around about the time that we decided we're doing this and we're implementing we're declaring resources. The number of jobs that have ended up in air state has not gone directly to zero, but it's significantly flatter uh and less correlated to pr traffic than before. So I feel like you can use this to say. Yes, we have made a difference.

A

uh This step has had an impact and I was going to go ahead and declare declare this closed. Unless anybody had any objection.

A

B

There's still a few items open, so I was also out last week because many of us were in kubecon or moving with my case. um I think rob mike rob and dan might have the best sense of what's going on with the remaining items related to this. Every one of them wants to speak up.

D

I think I think you can call that closed. That is you that I want to close these guys.

A

All right, I couldn't quite catch that rock.

D

Yeah me neither yeah sorry about that guys, yeah! You know I I agree with this. I agree with erin that we can call this issue closed. Okay, okay,.

B

So what about the items that are still in the monitoring columns so they're? Are they done or.

D

I will go speak through those uh okay today, yeah. If you want to.

A

um So the next item was about migrating all release blocking jobs. These are all the ci jobs that run submit or periodically um status on that as of 17 days ago. I don't think this has changed. These are the jobs that have yet to migrate over to the cluster.

A

Some of them are related to building kubernetes and they involve pushing it to google cloud bucket that only google.com clusters have access to uh there's an open issue where I'm working with the release engineering team to try and migrate to a different bucket that the community-owned infrastructure can write to, but that there's no progress being made on that until 119 is out the door uh there's also the uh bazel test job which probably can be moved over, but I didn't want to change the behavior of how basil runs its unit tests. While jordan was deflaking.

B

All of the units.

A

For slightly more context, uh many of the bazel jobs that run against kubernetes today use, what's called remote, build execution. It's an alpha feature that google cloud offered for a little while, but I'm not sure, is publicly available anymore or basil would do some gathering of resources and what not locally, but then ship them all off to some remote execution, environment somewhere and that's where all of the running of tests or building of kubernetes would actually happen.

A

In order to migrate these jobs to community infrastructure, we need to use what's generally available, which would be running bazel directly on the nodes of quite a question instead of farming them out to some mysterious place.

A

So we can move on that once we sort of feel like jordan's uh fought off all the unit test, flakes uh that you can. um That is, as far as I know, the story with these uh merge blocking jobs. So there are a whole bunch of uh issues related to migrating, merge blocking jobs, just click through to one for verify. For an example, most of these have been taken by people. As far as I know, I tried to describe uh what to do.

A

How to do it and then different dashboards or tools to look at and questions to ask yourself as you look at these dashboards to see if the job has become healthier or not. I haven't swept through to see how people feel about all these, but in aggregate I'm starting to have concerns that we are bumping into quota issues.

A

So I will link to this comment here.

A

uh So uh pictured here is a graph of the number of vms in our community-owned build cluster, the auto scale, as we have more jobs that are declaring their resource requests, and I can sort of see that traffic has gone down quite a bit, but even at this low reduced level of traffic, we were peaking up at around 60 vms and we were starting to bump into uh related quotas.

A

So I've been trying to bump up those quotas and not have much success. So in the interim there are a number of suggestions I put here on ways we could work around this. I would really love to just do the simple thing and raise the photos and forget about this, but in the event that we can't, I have concerns that when we uh sort of open the floodgates and allow pr traffic for v120 in the kubernetes, we might start to bump into these quota issues again. So I feel like we need to take mitigating steps.

A

uh The simplest one which uh jifang suggested here was moving jobs out of the community visible cluster back into the google.com cluster.

A

I kind of didn't want to do that just right now, while we do have the capacity so that we can get more data about how these jobs are behaving, um because if we move them back, we lose visibility into what the jobs are doing and why they're failing the the cluster that lives in google.com was stood up many years ago and uses a different, different monitoring, different metrics. uh So the dashboards that some of you have been using to keep track of how the jobs are doing.

A

uh You don't have equivalence to that, even as googlers, so it is more difficult for us to troubleshoot. What's going on. So as an example, one of the things we tried to do to maybe mitigate these quota issues was instead of having more smaller vms. What if we had fewer larger vms, larger vms, get more io, so maybe that would mitigate the effect of stacking, more jobs onto larger vms.

A

uh So this is one of the dashboards you can see if you're, a member of kate's infra, proud viewers, this tries to show resource usage and aggregate for the community of those cluster.

A

We migrated from the larger uh start, the more small nodes, uh node pool to a fewer larger nodes network kind of over the weekend, and it became pretty apparent pretty quickly that the verify and integration jobs were not doing well.

A

We can sort of correlate that to these I o related graphs that show throttle disk operations.

A

You can see that we started having a lot of throttled read operations uh when we were running on those fewer smaller nodes, so we've migrated back hooray, um I'm still sort of keeping an eye on our node pool size um because we're still kind of we can't get too far over 60 vms. At our current limit, I'm trying to correlate what's causing these things to spike up.

A

We've got more metrics available for like the pods and jobs that are running on the various nodes. This graph, for example, shows you that, like more pre-submits sorry, this should be highlighting when I'm moving over, but a lot of pre-submits started spiking around this time, which correlates pretty closely to the um number of vms spiking as well. So why were there so many pre-submits running like they're down here? I can sort of see the pre-submits aggregated by which pull request uh was triggering those pre-submits.

A

uh So here it looks like pull request. 94115 probably like had a bunch of commits pushed to it, or maybe somebody was spamming, retest or test all a bunch, and so in a one hour, interval about 34 uh pre-submits were all scheduled.

A

um So I just sort of feel like if I look at the the uh I lost the graph. If I look at the graph that sort of showed precipitate traffic going down over time, I feel like even at this reduced level of traffic, we are already killing our vms and hitting so we may need to consider mitigating that, but otherwise I'm hoping these graphs have been more useful to the ci signal team and everybody who has joined the crowd viewers group to sort of better understand. What's going on, and I will stop talking.

B

Do you know what your next step might be for resolving or coming up with the mitigation for this issue like what? How would you go about it.

A

uh So, like I said I, I tried to describe a bunch of potential workarounds in the the linked comment. That's linked off of what's like. What's stopping merch blocking jobs from moving forward um and my bandwidth has been limited to move on these or to further spell them out, so that other people might move on them.

A

The like the quickest lowest buy solution that people can do is to move jobs back to google.com, but that's going to move them back to that sea of red jobs, and so we may find that we still don't have the greatest situation.

A

The longest term solution would be to get our quota raised, which I am attempting to escalate, both internally and externally, and if anybody here has suggestions or ideas on how to do that, I would appreciate that help. uh An even longer term solution would be to re-architect how we do our build cluster.

A

Today, the build cluster is constrained by a quota of external ip addresses if we redid our cluster so that it didn't require external ip addresses. We wouldn't be blocked by that quota anymore. However, that introduces the complication of how do we handle uh inbound and outbound traffic.

A

This is something we haven't had to deal with uh yet so there would be new engineering effort that we haven't had to do historically. Thus far, I've been trying to keep us as apples to apples as possible as we migrate jobs from inside of google to outside google.

A

uh That puts us a ton. uh I appreciate everybody listening to me talk for so long. uh Hopefully that was useful. um uh Loris is asking great questions and I hope to do my best to answer them on issues scrubbed through the board and take more questions and slack if people have them.

A

As I said, my bandwidth has been a little bit limited and will continue to be limited on executing on this a lot. So anything you folks can do to kind of like update the individual issues that you have taken on to sort of status check. How good or bad things look would be helpful and appreciated.

B

I guess on the trip topic, we'll wait until we resolve this immediate issue. Is that the plan like hunting, that for now, given your time constraints plus this other concern.

D

A

A

uh We'll see I gotta.

B

Okay, just for like what people can work on yeah things seems like. Maybe now now that's not the best time for the metrics topic, but.

A

B

You'll figure it out.

A

Yeah, I think, sort of trying to continue to figure out how we can identify and measure the pain that we are experiencing would be more helpful. I still feel like we're doing an awful lot of. Maybe it's this or maybe it's back and forth, which can make it scary to change things. uh I would like to figure out how we can change things with confidence rather than doing like big, slow, very cautious moves, but that's just a stylistic choice.

A

Okay, uh thank you, everybody for your time um and with that I will stop, recording and see you all in two weeks, happy tuesday.