Kubernetes SIG Testing, 29 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2021-06-29

Description

https://bit.ly/k8s-sig-testing-notes

A

What's what's the rule, if the professor doesn't show up in 10 minutes, classes cancel for the day.

B

That seems like to be the case usually, but isn't that? Because people don't want to be in the class, but that's.

B

B

To that everybody's watching england and germany play.

A

Looks like there's nothing on the agenda.

B

I'll, just I'll just introduce myself since there's nothing else on legit um my name is trent, I'm from the um I'm from salesforce. I work on the uh unlike the cicd team. There, um I'm just joining to kind of listen in and see where I can jump in and help got a lot of background in ops and go lying, and so it looks like you guys, work on a lot of code things. um So any suggestions on where I can jump in or whatever would be great but assume there's a lot on the board here.

B

So it should be a lot of opportunity to help out.

A

Yeah welcome trent. Thank you. There's. Definitely a lot of need. um One of the biggest things that we've been dealing with is flakes just tons of like intermittent flaky testing and a large part of that's just coming from the nature of you know a distributed testing environment, and so it's a lot of timeouts and stuff and um we're coming up on code freeze right now, but hopefully that will lead way to a little drilling down into you know more reliability. Oh there's ben hooray,.

B

Is that mostly in prow.

A

C

B

So is it just wrapping a bunch of calls with back off and retries.

A

No, yes, I think it's the root of it. We have to drill down into yeah, hello, ben.

D

Hey sorry about that y'all bad timing. This morning, uh we're.

A

About thinking about the 15-minute professor role, if the teacher doesn't show up.

D

I guess uh I know aaron uh was probably not going to make it impacted by the pacific northwest heat wave this year. I would imagine the same as also uh likely for steve. So um welcome everyone. Sorry, I'm super late. This is the kubernetes sig testing meeting for june 29th 2021.

D

uh This meeting is under the cncf code of conduct, essentially be excellent to each other. This meeting is recorded and will be posted to youtube at a future date.

D

So usually, we have a hard cutoff for agenda items ahead of time uh the day before, there's a calendar entry for that, and we would uh announce that the meeting was cancelled if there's no items, but uh we didn't have the template entry in the notes this time. So I thought we ought to just meet anyhow, given that there wasn't a good place to add them, they've been confused.

D

So if you have any topics, please go ahead and add them to the doc now, um so that we can at least sort of go through them in some sort of order.

D

That dock is linked at the top of the subtesting slack channel. In the description of the room.

E

Thanks ben really, I just uh came in to see what was going on and what I should be involved in.

D

Great yeah, um I don't know like I said uh we don't we don't have an agenda today, so um I can discuss a bit of that, but, uh uh depending on what everyone else is uh here for or may be, interested then might be better to follow up offline about.

D

That uh just give a moment for any.

D

D

Okay, um so uh sure, let's start on that, so if you're looking to get started in the sig, um definitely you know come to our slack. I think it's the most active part of the sig. By far like we have the mailing list. We have the zoom meetings, but you'll find most activities happening in slack or github. So join our slack. um It's sig testing in uh sick dash testing in the kubernetes slack um and uh from there I'd say we have github.com kubernetes sig testing.

D

uh So this repo is where we're starting to like document some of these things. We've been experimenting with github discussions, so I'd say that experiment is not going well, but so you're aware they're there, and we have some issue tracking that is sort of like sig issues that aren't really repo specific uh one of them is actually about how we need to surface the project boards better.

D

um We have a project board in kubernetes and a project board in the kubernetes sigs org kubernetes cigs uh kubernetes dash cigs is uh where, um like six subprojects, go, that aren't more or less legacy stuff. That's still in the kubernetes org pretty much all new new endeavors are there.

D

Unfortunately, github projects can't track issues across org, so we have one for each org and uh you'll find in our project boards that we have a help wanted column and we attempt to make sure that any issues that have been tagged as help wanted that are relevant to the sake are added there. uh I wouldn't say it's 100. Another thing you can do is go to our repos. um Those are documented in.

D

If you go to the github.com kubernetes community, repo, the community repo in the kubernetes org, there is a dock listing all of the sigs and there are subfolders for the sigs there's a dock there that lists all the sub projects. The sub projects that we own, like test infra, will all uh use the help wanted and good first issue. Labels are appropriate and then we are currently manually sweeping through those and making sure they're in the project port as well.

D

But the most up-to-date thing is just the issues directly, so you can do a github search in those in those repos as well, and maybe take a look through those repos and see like if any of the projects interests you.

B

Is that you said there's two boards there's I got the projects 52 board. Is there another one.

D

There's another one in the kubernetes org. um Let's see, I think I think I linked back to the slack discussion where we had it, so you grab that, but it's just it's also called the sig testing. It's pretty much the same thing, but it's in the it's in the other or oh.

B

D

Yeah, uh I did not.

B

So it's not 52, it's the other one that your time on.

D

Yeah because, like those numbers are just like it's just an auto increment on adding projects, um so there's no good way to it's 11 in the other org.

F

D

Down so here we go like I said uh currently at the moment, that's uh aaron and I uh going through and more or less going through and like manually, adding things to this board. um There is a little bit of automation for, like you know, finishes close and moves through the board, but um so the hope wanted there things. There should be pretty good that we've like gone back over and identified again as like.

D

Yes, this looks like a help wanted issue, but if you're looking for more, there are definitely issues that haven't made it into the board like since our last pass-through, um and this topic is also something else that we are looking to do so in the sig testing repo.

D

I have another issue filed about um improving issue triage and how we might do that. um One thing that comes to mind is there's a tool uh triage party um that basically hosts a custom web page for doing triage, workflows, sort of multiplayer um and having a bot enact the actual changes on github. um Some parts of the project have experimented with this. It originated from google's mini cube team, um so we ideally don't just want to like explore this.

D

For sick testing, but we want to explore how with work group, kate simpfra um if this tool works well, we can sort of make it easy to stamp out for other cigs or or just go ahead and host some more instances or so on. What's that called uh triage party and the tracking issue is here I'll link it in the dock.

D

D

Issue number seven in the sig testing repo, uh and that also came out of the discussion that I'm having about a bot that we on the source go to, but the policy is owned by uh the contributor experience sig, which is also sort of generally how the dynamic works in kubernetes um that uh automatically labels stale issues and closes them. um I uh made the mistake of luckily complaining a bit about this and started a large discussion about it recently.

D

um So one of the more productive things we can do is uh try to improve issue triage so that we sort of can keep up with incoming issues better, so that so rolling out like providing tooling for that sort of thing is, is in the scope of the sig. We pretty much do project infra tooling, um particularly as it relates to ci and github um contributor.

D

Experience, though, uh is sort of responsible for like the workflows uh at this point and um like how how should we uh be behaving as uh as like processes and things around github, and then uh that's also worth knowing there's a work group, kate simphera that span out of um so a lot of the infrastructure uh came out of a team that I was on, um though, started before me at google, that sort of set up all sorts of things for the project early on um so like ci infrastructure, image, publishing and because that was just kind of done.

D

Real, quick in a convenient way to get the project going. It was all in projects that are like under google's google cloud organization, so they're not really accessible to the community, because they're like google internal, so um maybe two or three years ago, um one of the more funny googlers from hawking arranged uh a large like credit donation in the cncf.

D

So we could basically, instead of like billing it to google through google projects, we can have like credits to the cncf and then cncf hands into steering and then steering hands them to work group, kate, simphora, and then we set up some gcp projects in a kubernetes.io organization and then, if you want to like, have access to any resources, that's all controlled with like yaml in github uh and terraform and bash, and uh we are like migrating things there.

D

So, whenever we're standing up new infra in particular, especially, we are looking to collaborate with that work group and make sure that you know we're doing this uh with their resources um so that it will be like a sustainable thing that the community can collectively manage right.

D

One of us set up on our own sorry.

B

What repo is that? All the infra pieces.

D

So yes, uh the bulk of it, is in the kubernetes, slash, kates, dot, io repo, um originally that just had some things like uh there's an nginx instance that handles things like god.kates.io shortcuts um that go to different things, but now it has. um We have we use. Google group management. uh Kubernetes has like a g suite instance. So we have google groups that we use for like these.

D

Are the owners of this thing and then we're able to do things like synchronize that to our back rules within clusters that we run for the project or access controls in google cloud. uh Most of it is in google cloud because of uh that, like large credit donation, but there's also some manage. There is some management.

D

There are some things like there's some aws accounts that are managed for um some end-to-end testing, in particular a couple of the sub-projects use uh like k-ops, um and uh so there's a little bit of the code for that there. Some of the other code for like the tools themselves are in repos like test infrared, but pretty much. All of the like definition of like deploying the infrastructure is now in the category, along with things like um like images.

D

If you want to publish an image to our our case.gcr dio, then you like pr manifest there, and that's also the trend that we use for pretty much everything and something that like sig testing has been pretty big on pushing. Is that uh pretty much everything is, to the extent possible manage declaratively, just like you would in kubernetes, where you're going to like send a pull request to like a yaml file. That says, like this person should be a member of the github org. um This person should be in this group.

D

uh This, like image, should be published. This infrastructure should be deployed. This domain name should be, it should exist. We have kind of gone all in on pretty much. Everything is declarative, yaml to the extent possible. Some things are still like. You know some bash groups that haven't been replaced yet or some terraform.

B

What's the back end that drives all that.

D

uh We have a number of custom things uh so, for example like domain names, uh what started with octodns and a bash script. But now we have something that like reads a simple yaml spec, that we came up with and maps that into that. So there's various tools that are sort of like ad hoc, reconciling these things and there's some that are doing it continuously, um and some of it is just like off-the-shelf stuff like kubernetes or terraform.

B

Okay, okay, so it's kind of like yeah, it's a mishmash.

D

Yeah, it depends on the tooling so like, for example, for all the github management we needed to build a tool, um because uh I mean most people are doing this manually, um which is you know, pretty easy to do low friction if you have some admins, but we've strived to have just a very few folks that actually have administrative permission over everything um and then pretty much. Everything else is delegated through robot accounts, and you know publicly audible auditable. You can go see like okay.

D

We added this person to the organization here that sort of thing yeah.

B

Yeah, that makes sense so pretty modern devops approach. I would say.

D

Yeah, I mean a number of the leaders in the community really want to. Like you know, we want these doing what the project says like okay, declarative management makes sense, so we should declare it if we manage everything. um I'd say the project's doing pretty good at that um generally. If we stand up some infra tools, there is like some declarative spec for what it's doing.

D

uh So that kind of is like the project level infra there they're the testing infra. um I give a pretty quick rundown, but I might actually uh want to point you to like one of our uh talks at kubecon or something like that to go.

B

That's fine, too yeah sure absolutely.

D

Anything the gist is uh we had a jenkins that was another one of those things that, like kubernetes, just sort of set up and then over time they started containerizing workloads on it for like easier setting up the correct environment, that sort of thing, um and uh they had some issues with like the github integration for triggering and things.

D

So someone built a little web hook handler that would watch for github web hooks and like kick off jenkins jobs, then we had a yaml spec for that, and then somebody realized okay, we're doing yaml to create containers. Why don't we just like run these on a kubernetes cluster um and since there's google team doing most of this, we just set up a gke cluster like create no more managing clusters, just schedule pods and now pretty much most. All of the ci looks like that.

D

Now um you sort of write a pod, spec and then some additional containers are injected to do things like code, checkout or upload the results, um and we have some like yaml that describes like this is a this is a pre-submit job. It runs on prs, or this is a periodic. It runs on the schedule um and then we have some like web dashboards for viewing results and pretty much all of that works by uh uploading uh things to a gcs bucket and where possible we produ have.

D

Where possible, we encourage people to produce um junit xml, and then we have sort of like structured results, that various tools can read. So, for example, we have a tool that goes back through all of these results and uh does some big query, queries and then some further processing and produces a dashboard of like what are the most common, uh like error, messages we're actually seeing across all of our jobs so that you can start to identify like okay, we started having this failure mode spike.

D

At this point, we we had a bad change, go in awesome. um That is that in the chat.

D

um So almost everything revolves around this, like we use containers and like kubernetes clusters to run things, and then we produce pretty much all of our results as files on gcs that are either junit, which was so uncle level standard or some json files that contain some metadata about like when we started it and like which job it was, that sort of thing yeah and then there's something of a spec for that in test infra, and then a bunch of different tools are able to like consume those results and process them.

D

We also have a service that is uh partially open source. Google operates called test grid that we use for sort of like historical.

D

This job has here's the results over time uh with all the individual structured results as rows, and so that's used pretty heavily by the project like releases, are pretty much there's a dashboard of these jobs must be green to release and then periodically against the branch we're developing um and then there's like a team that monitors those and says. Okay, we have this important unit. Tests are not passing, oh no.

D

We can't we can't release um and like file issues and fall throughout that, um so we also work quite a bit with uh sig release there on that sort of thing yeah so like.

D

Similarly, we provide most of the tooling for this, and I think we provide quite a bit of direction on like what those jobs are, but at the end of the day like if you want something to be released blocking we um some testing faults, wrote the policy, but it's owned by sig release and you need to go talk to sick release, so, like kind of similar to contrabex might be building stuff. But like we're, you know we're working directly with other cigs for, like the ownership model. All these.

B

Things right so it's like this seems more like providing the tooling and the infrastructure and the automation, but not the process per se right. The process is more for a sick release.

D

Yeah, so okay, cool yeah, honestly.

B

If there was enough return on.

D

Investment for it, uh I would be being harder to sort of like rename the cig. A lot of people come to us and assume we just like write all the tests. Yeah.

A

I know no one does that.

D

No one wants to do that. I know write your own tests we'll help.

B

You just like figure out what you could do it.

D

Should just be sig.

B

Ci, cd or something I don't know, yeah.

D

I think it came out of one of the things that was early on was like the end-to-end test framework, um but at this point, there's really not that many folks working on that directly. um It's one of those things where, like you kind of improve it a bit when you need to use it, but like no one's actively like that's not their day job um or anything like that, where they're, like main focus, people are working on like the infrastructure, mostly than this thing.

D

So we have a lot of overlap with work group kate's infro, um they kind of like own the the credits and sort of like the broader, like uh maybe some of the infrastructure that isn't remotely testing oriented like the domain, names and stuff. But again you'll find like. If you look at the like leadership between the groups. There's a lot of overlap like aaron is a is a lead of uh um both with both side testing and work group, kate, simpro, and doing a lot of things there or.

G

D

You'll see, I think, yeah arno is in this call, and it's a lead of that group, and it's also pretty active here.

D

um I will follow up offline with uh some more places that you can get context yeah since we've uh thanks for that sometime on this, uh if anybody has any questions, answer them and then otherwise, um I think we should move on to the next student guide.

H

Actually hi um mariana, um I'm I want to be a contributor, I'm just lurking at this moment, and I was late to the meeting. um I saw the notes and I was wondering what were the, what did you say were the tags that uh were supposed to be looked at for for new contributors or beginners? Was it just help wanted, or is it was it was it specifically for you? I remember there was attack for new people.

D

Yeah, so throughout kubernetes uh we've, this is pretty standardized, there's a help, one tag and then there's an additional good first issue tag um yeah.

G

D

Are strong rules that are expected by the project around what is a help wanted issue like if you're asking for help, if there should be some pointers to like what actually needs to be done, and it should be an issue that some folks that are active in the project have like agreed to like there will be support for actually taking this action. It's not up for debate still um good first issue is a bit looser, but is applied by folks where they think, like.

D

Probably, this is a is a good starter issue and um those are also labels that github itself uh consider standard. So um I think, there's actually a page in github, that's a little obscure on repos. That will like point you to uh these issues.

D

H

Cool, um but if there's like some sort of like offline more and for sharing, I want to subscribe to that.

D

Oh yeah, uh sorry, I just wanted.

C

To raise my hand.

D

I'll uh I'll pull up some of those resources and pass them along I'll. Do that in the six stack, I guess um and uh try to poke the right folks there um I I would share them here, but I probably need a little bit to dig those up.

H

No worries and again I'm I'm a little I'm still kind of overwhelmed by the hugeness of everything and I haven't decided where to go and everything. That's why I'm kind of flirting.

D

Well, welcome and honestly, the hoover only doesn't totally go away. The scope of this project.

G

D

Pretty massive, um the infrastructure itself I'd say is a little bit madness. We run a lot of stuff.

G

D

um But a lot of it also just comes.

G

D

Lot of off-the-shelf stuff for, like I don't know, doing, github, doesn't really scale up to 100, like 200 repos and like 1000 plus contributors and all that jazz or like our big end-to-end tests, or things like that. um Tried really hard to use off-the-shelf stuff. But.

B

That's also cool because it's problems, you may not see elsewhere, so some fun things to solve.

H

Yeah thanks, I still, I also have eddie's uh talk on. um I want to look at it at some point. Didn't get to it yet.

H

I saw you there.

A

H

I also lurked on the cli channel and you have like uh a thing where you talk about contributing as well.

A

Oh, the the lgtm from last week.

H

A

Gotcha, okay, yeah.

H

It's definitely.

B

Do you want to add a link to that in the um meeting.

B

D

Thank you all for contributing to the notes as well. You usually we have someone. You know it's like someone else might be reading doing the meeting and I'm taking notes, but uh today we're a little short stat so appreciate it of.

B

D

Okay, well, I will uh like I said I'll post some more uh links where you can like learn more about the sig in the infrastructure uh later today um and the sig channel, and uh we can also see if we can get some more of these things sort of more centrally documented, I'm hoping that we can. We've only had the sick testing repo for since, like maybe end of last year. Hopefully we can flesh out some of the documentation there and have the read we kind of point people towards that right.

D

Now, the testimony for repo points you to a bit, but you know it sort of runs out of scope for things that are no longer part of test infra.

D

So I think hopefully we can get some things like these project boards surface, better somewhere a little bit more discoverable.

D

uh So next up we have uh eddie's point about uh simple pod flakes with network timeouts. I want to talk about that. Eddie.

A

Yeah, so I know you've been keeping tabs on this one a bit too the as we're hitting code freeze. I just want to know what work we need to scope out for these types of flakes. um The original fix was to kind of give it a more of a grace period for the timeout, but we obviously don't want to just keep increasing timeouts for things. So what I noticed for when a lot of these flakes still happen, they happen in batches.

A

So there's like a you know like 12 to like 30 tests that all fail with like a network timeout, do we have that work tracked anywhere.

D

I don't think we do. um uh The only thing I can think of is um we have some suspicion that some of that we're still having issues with um disk throttling uh so like we run like I said, we run most of our infrastructure on gcp because we have like the big credit donation for that. That's where the infrastructure has been historically when you run uh workloads on gce on, like kubernetes um throttling is done at like the vm level.

D

So if you have like multiple workloads running on a node in a cluster- um and they are really really I o heavy, um you can run into like a special sort of contention where, like the infrastructure, is turning around and saying, okay you're using too much like iops, uh your like, I o operations are going to get throttled and we've had that in the past, where, like we weren't setting resource requests on things in particular um that we still aren't necessarily everywhere, it's a bit hard to roll that out to all the ci.

D

We have some clusters where we do guarantee that, but um you know there's still no way to request io. So if you, if you still kind of make the mistake of scheduling, multiple workloads to a machine that are, I o heavy, um they will thrash pretty hard or even one workload. That's just really.

D

I o heavy, like a cluster with multiple nodes, with that cd running a bunch of tests, um we're running our local clusters, so one issue we do have tracked under catesio is that we'd like to make a new ci pool with uh local ssds, where, like there's, actually an ssd physically on the machine that you are running your vm on that is dedicated to you.

D

So there's there's no throttling donuts right to the ssd as fast as you can, um whereas right now um at best we're using pd ssd, which is like network back like virtual disks, um we're not certain that that will solve it. So we want to like make a small pool uh taint. It um allow a couple of jobs to like opt into running there and see if that improves things.

D

um Otherwise, we have just tried to make sure that, like all the critical ci is on uh the newer clusters, where we have pre-submit that enforces, your configuration must set requested limits which helps, but it still doesn't guarantee. Like you know, we still could be we're making.

D

Maybe small requests on something that probably should be making larger ones um so like, in particular, the kind cluster testing we've kind of had some debate still over like how much resources that you really need, because, if it weren't for disk, it really shouldn't need many cores to like run some end-to-end test somewhere.

D

But I have some suspicion that uh uh we're running back into the queer issue and it's it's not that easy to find, because you're gonna have to like correlate what we were running on the vm and then look at the like vms, like disk, throttling uh monitoring or something and uh not many people have like have access to that. um You can get access by a pull request for like read-only information and gcp pretty easily. But it's one of those things that's like kind of obscure and not that many people have done.

A

Yeah, I think I'm in that info viewers, group and you're you're actually right. I didn't notice that I just pasted a recent proflake and it's a bunch of sig storage in sig cli, so these are all ones that are definitely writing to xcd and disk. Quite a bit.

D

Yeah, so I'm like and suspicious that like what's happening, is we're just like hammering fcd and like well in this case, though, um so that only implies we're running time clusters or, if we're running like a local integration test or um like builds stuff like that in this case, uh because it's a gci, dce end-to-end job, we are like the only thing that's actually running in the those pd ssd nodes is like. We grabbed the binaries and we stood up a cluster and then we're running the e to e test binary.

D

The actual tests are running against um a cluster that was stood up for the duration of the job um like using some bash, and so those vms are not running anything else, but the tests and are like should be large enough. um So also one thing that can happen is like. If you saw this spike more over time, we might have had a regression.

D

In one of the components, we've upgraded a lot of things in that cycle in pretty big ways.

D

This one's always fun these these timeout things. It's it's hard to say like well like what uh really failed here.

A

Yeah, this is where I always get stuck, because I want to get folks digging in and figuring it out, but.

D

Yeah and, let's just say, like timed out waiting for the condition and you're like well, what was the condition? What yeah it's one of those things where, like we have some generic matches and a lot of the tests, don't really tell you, so it takes some time just to even find out what might have timed out.

A

So I know for our tests: it is the condition is waiting for a busy box pod to spin up or an it's hdpd, or some some very simple small pod to spin up and that's the condition. That's timing out.

D

Yeah we I mean it might be something in the uh in like the container run time or what are we running in this.

B

Job, so is that just like surfaces that context deadline exceeded in the pond?

B

So if you look at the pot and see what the error was in the pod is that what we're seeing here in this log file example.

D

We're seeing that a deadline, it was exceeded uh waiting for the pod to be observed, ready, like the test, was waiting.

B

Are we one thing like we were hitting this like? Oh, we hit this a lot and we got. We got those errors down by making sure that we're clearing out the pods. You know the the finished pods. Are we doing that in the clusters?

B

D

I think tests may, like our tests, are expected to clean up after themselves, but um can't necessarily guarantee and like we're running concurrently. So we have like, uh like in threads of running like running through the test cases.

B

But yeah, if you leave those jobs laying around even after they're done, it can really it can bog down kubernetes, at least that's what we've seen and we're running an earlier version that we're running 116. yeah. We have like a operator that we have running that just removes just anything out lower than 24 hours, just delete.

D

The pods, the the infrastructure itself, the persistent parts, not that like test cluster we've brought up, um but we tend to call them build clusters that, like sort of just workload, pool that we're using to execute like running the tests and things like that on the ci. Those are in a component called syncer. That does pretty much that it has some awareness of the of like the like job abstractions. So it can choose to keep around, maybe like one instance.

D

The most recent run or something like that, but it pretty much goes through and completely pods, because uh with our jobs we were running into, they tend to write to like just ephemeral, pod storage and building kubernetes for even just a single architecture. If you build everything in the repo could be like 30 gigs, and if you have a whole bunch of jobs doing that pretty soon, you just run out of disk on the ci machines, since those aren't released until the pods actually deleted.

D

We had pretty pretty excessive issues with that but norm. That would not be super expected during an end to end test. I would say that, generally, these close end-to-end test clusters are like a little over provision and that's one of the reasons we were doing um kindness like well. We can probably test most of the stuff just with some little tiny simulated ones. We don't really need like a full-blown cluster to run like can you use cube cuddle to run a pod like that he's super cheap.

B

Yeah, like I don't know if the gcp has the equivalent of standing up a cluster with spot instances, something like that. Maybe I don't know.

D

Yeah we have some stuff like that, but we're still kind of just using most places, they're still using whatever the config was years ago for the defaults for creating a test cluster, um so they're for most tests, they're, probably like a bit a little bit more than we need, uh and that is helpful because of clear signals. So when we're having all these timeouts, the first thing I look for is: did the job timeout like did we reach? Did we reach just like?

D

We have a bunch of tests that didn't finish because we didn't finish running all the tests, but that doesn't appear to be the case in this instance.

D

uh So, um and you can see that they're just so, you can see if you scroll through just like the the whole lot, you can see them timing out throughout the yeah. So it seems to me like more of a like there's something systemically wrong with just like starting pods in these clusters. Right right.

B

D

Because, like we might in the past, have blamed like oh well, this is one of the jobs where we switched to like container d or something- and it wasn't like there's some bug in this version, because it's an early version or something like that. But this one's running docker um and we don't upgrade it super often, but it should be pretty stable.

D

It looks like we're running 1903 on dyno 9, so it could pretty patched version of knight of uh like a pretty stable, pretty commonly deployed docker version um and it probably hasn't changed uh other than the little than the patches for some time. So.

H

Like you went through the through the list, but not everything is a timeout.

H

And the is the other are the other things like I see should not be passed when uh plot info mount equals null or flaky cube, ctl, explain blah blah resource name as built-in object. Are those significant in any way or.

D

So the flaky tag is um we that's one. That's manually added to jobs like we've decided like this is known: flaking we don't have a solution and the main purpose of that is, for example, the tests that we run on your pull requests to kubernetes. They exclude any tests that are tagged flaky so that we're not like wasting everyone's time with okay. We know this is flaky and like we don't have a fix, and uh maybe we don't think this is the most critical text, so we're just gonna tag with that.

D

We can let it run, and this periodic ci job over here until someone fixes it. The fact that a flaky test failed, um like not surprising, but that we have all these other tests that aren't tied flaky that are uh like timing out on like, and he said some of these things. If you dig through it, you can see they're just waiting for a pod. To start, um that's I mean you can imagine like.

D

If, in your production cluster, you tried to schedule a busy box pod and you waited a couple of minutes and the busy box pod wasn't ready. You would be concerned yeah right kind of the same thing here, even though we are, even though this is like a weird test cluster on some test version of kubernetes and um we're just running test workloads. You would still kind of expect that some super cheap pod would just come up um one of the things we can also do, there's a link at the top um artifacts artifacts.

D

Is uh we populate an environment variable when we're running these jobs? It says like put things here. If you want them to be preserved and then the tests will do things like dump all the cluster logs under that directory and at the end of the run, um the sci will like upload all this to gcs.

D

So if you click through to artifacts, the first thing you see is like the actual metadata that the ci uploaded the build, the like the log of the test, uh the like the pod that, like there's a json format of like the pod, that we were actually running to run the test and that sort of thing, and this again in artifacts directory yeah another little tool, a project built for like viewing our results and under there um you can see a couple of directories that look suspiciously like node names because they are and those have all of the like all of the system logs.

D

So we can like look through this. So that's how I would say like. Oh this cluster is running this test. Cluster is running docker, um we're not doing anything exciting, like maybe it's one of the jobs where we're running a super recent version of container d, because we're trying to test the latest run c to qualified against kubernetes like collaboratively before the project's release. um We're not doing anything like that here, but you can see that because we have all the like system logs so see that yeah like here.

D

You can see the logs for the control plane mode. uh It's also small note there. This is using the super old bash tooling that no one in the org really wants to own uh called cubeup, that's still in the kubernetes repo, and so it has some pretty archaic things like naming nodes, master and minion, which are both terms that are not used in the project anymore. Except in this thing, because, even though it runs like most of ci pretty much, everyone just wants to pretend this. This stuff doesn't exist anymore, delete it. uh No.

B

How hard would it be for us to add alerts like if we wanted to test the disk? Io theory, I know eddie, you have access, or is it read only it doesn't sound like someone who can add like oh, you know something just check disk io and then send it send an alert if it hits it.

D

uh So if you wanted to add something like that to the infrastructure like it's, you know all this is just like managed through uh like get ups anyhow, but in this case uh this job is not. It should not be really using disk on the shared infrastructure. It's creating it's creating a cluster just for the purposes of testing and scheduling a set, there's no other workloads, only the test workload um so.

B

It's the cluster, it's creating! That's! Actually, the one. That's timing out right.

D

Yeah and then the cluster is creating has like, in this case a master node that is running the control plane and is uh tainted. So none of the test pods should be on the machine that has the lcd.

D

uh So even if they were writing that c heavily like there's, nothing else happening there, there's just like it's running clusters at cd, so it should be able to keep up, um whereas we have other tests we're running kind clusters, which is this one of our sig sub projects, um where their docker containers are the nodes for the clusters and they're running on their rock.

D

The cluster is actually running in the ci cluster at the moment, so you might also have like some ci build like co-scheduled to the vm that is running all of the nodes uh of the test.

D

So that's uh what I originally thought referring to um and something that is tracked is we should we should look into like sort of what happens if we just throw more. I o at this problem, um but in this case we've been running this sort of thing for a long time it runs its own ephemeral, dedicated infrastructure and there's no reason it should be having like, like disco issues to schedule, a pot or something like that.

B

Unless it's a bad test right, something consuming all the resources.

D

uh It's possible, uh for example, we've had some things, hammering the cluster pretty hard, but they usually don't show up that easily in in this kind of job, because, like I said it's, maybe a little bit larger than you really need like that, most of the tests don't need much and they start.

D

This is a totally dedicated machine like nothing else is running on these vms, just the kubernetes itself, um but we have had things in the past that showed up like more heavily unlike those kind clusters where they're running on a shared, possibly like under-provisioned vm, um and uh we had things like at one point.

D

There was a side car in the csi testing that monitored, like all nodes, all pods enough, like a number of the core resources that was like, like watching all of them pretty aggressively and like I like increased the the load on the cluster and then that caused tests of flake, and we couldn't like relate them to anything else. And so that's where, like the triage dashboard comes in and we can say, okay, we can see that we just having a bunch of random failures increase in these jobs.

D

Over this time frame, so since they don't seem to match up to anything in particular other than the time frame, what did we change in that time frame?

D

That might be what we have to do here.

G

Is there things like looking at the actual storage pools underlying where so, if multiple jobs are talking to storage pools that we see that contention doesn't come from disparate, so a different workload with completely unrelated things going on on that disc, um I.

D

G

They share a fair amount of that is their way to look.

D

No, there shouldn't be any because uh so these are running on ephemeral clusters that are gcp vms, so they have their own uh like disks that the the cloud provider is guaranteeing like here's, your disk. um In reality, those are like abstracted disks that are over the network and stuff and there's more things going on.

D

But um we haven't observed any issue like that, like even if it turns out that, like part of how that is ultimately backed is shared like there are other systems between us and the disc that, like make sure we get the I o we're supposed to like the throttling of someone else's workload.

D

Okay. So what we were seeing is like when we're running these kind clusters um or like builds or some some workloads like that or like there's, some integration tests that, like spin up just a couple of the components directly uh and then run tests against them. Those are running in the ci cluster and then in in the ci cluster. We may be like ourselves, co-scheduling workloads to one of those like virtual disks,.

D

um So I want to also before we share. I want to find a dock and link it to you all the since we've kind of wound up on the just general topic of flaky test debug. We actually have a uh doc.

D

I think it's a community repo, um there's a there's, a dock, that um our members wrote that's sort of just about like how to do flaky test debugging in this project, um and it's quite good. There was also a we've, also got, I think, there's a yeah there's a video from um jordan. Liggett came to our meetings and uh uh gave uh sort of like a little talk about this, um since he does quite a bit of it, and you can also watch that talk. It's linked in the.

D

Dock we have, uh we have a lot of uh tools that can be used here that this goes into like more detail on, um but for all that I'll say that a problem like this is still kind of the hardest, where it doesn't obviously seem related to anything it's across multiple different tests. They don't appear to be doing anything crazy.

D

uh The infrastructure is like boring, old stuff. We've been running for a long time other than the kubernetes version, so um that makes me say that, like probably, we want to look in the direction of like did this get worse over time, and can we pinpoint when that was and look at what changes happened in that time? um But uh you know going through all of those motions will pro first requires like getting a little bit up to speed on like okay?

D

How do I look at the tests over time and how do I know what versions they were running and that sort of thing and there's quite a lot to unpack there? This dock covers a fair bit of that. The flaky testosterone.

D

Like the triage, dashboard itself is a very powerful tool, but it takes a little bit of like getting used to and um it you know it doesn't have much in the way of docs or explanation about things uh or like tool tips, and uh it ties back.

D

You know pretty heavily to like our concepts around like what is a job and like what is all the metadata we have around it that sort of thing yeah, it's pretty much a power user tool that, like a handful of people, depend on pretty heavily, but it's uh like hasn't been optimized for like bring new folks on board to it. So uh I'd say this talk in this docker, like the best resources for uh like getting up to speed on that.

D

If you're interested.

B

Which talk are you referring to? Is this one? In the meeting note.

D

The test stock, that's in the uh the community repo. um I think that uh me dave it's not that I think that was like his one. The current doc is here, marina, I think marina gave us a break. I.

F

Threw a link to these notes in the agenda as well. I will probably clean them up afterwards and post them somewhere less uh ephemeral than I just, but if you want to follow along or if my screen is not working well, that link is there for you, so didn't realize you're here, hi.

D

Hi did we lose him?

D

Oh, this is uh that was.

B

D

B

The video I actually like to play in the video.

D

During the meeting.

D

If good sound, I I I it's.

D

Well, uh thank you all for coming. uh I will try to uh find some time soon. I need to follow up on a couple of things at work, but I'll try to find some time soon to like surface some more of these resources um and if you are looking for anything, feel free to reach out for the most part.

D

At the moment, I'm probably gonna point you to the project boards to start, but um you don't find anything uh follow back up and just in general like ask questions in our slack everyone's super friendly and helpful, and uh you know, there's probably someone else that wants to ask the same question, but it hasn't yet. So.

D

Thank you. Thanks for coming have a good rest of your tuesday.

H

Thanks for sharing yeah.

H