Kubernetes SIG Testing, 21 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing 2018-08-21

Description

https://bit.ly/k8s-sig-testing-notes

A

Okay, hi everybody today is Tuesday August 21st. My name is Aaron curtain burger. This is the kubernetes state testing weekly meeting. This is being recorded publicly and will be posted to YouTube later today. Assuming I remember to do this so on the agenda. Today we got Steve who wants to talk about disruptive tests on AWS with cops Timothy st. Claire wants to talk about issue, routing and tags.

A

I have a couple things I want to talk about with respect to you, automating all the things and just what I've been up to in the conformance domain and then at the end, Patrick's heading some agenda items on adding additional secrets to prowl, so that I will hand off to Steve Steve. Take it away.

B

Yeah, so basically I'm some feedback that I got from some engineers is that it's basically impossible to write any disruptive tests on AWS, because the way that the cops set up works is all the nodes get put into auto scaling groups. So, basically, if you want to disrupt the cluster and take a note down like the actual AWS, auto scaling group will terminate that instance start a new one for you very handily and I wanted to know what was the larger direction there.

B

Cuz I did like those tests were like I, think there was there's a PR fairly large PR that merged and it was merged with, like sort of a bunch of hand-wringing about the fact that they couldn't actually test the functionality that was going in, and is that like something? That's just, never gonna changing cops? Do we have any plans on on allowing this or two disrupted tests.

C

As someone from sick cluster lifecycle, cop you'd have to talk with Justin he's the pretty much the main maintainer for cops coming forwards and it's behind the times. I've long been frustrated by the fact that it is in the blocking PR jobs and that it is not periodic because it is out of date right.

C

So this weird conundrum of it I'd much rather put kuba DM, which lives on master and actually is released with master and all the other artifacts in the main line to do main line testing for a lot of these other things and push that into AWS. Along with the cluster API implementation, which folks are worker than working on. It makes a ton more sense for the long haul for support ability or if you want a blocking PR job, then that, but that's that's kind of been my opinion. There.

C

There I know just an edited to get more signal to prevent issues that would go down the line, but because there aren't maintained errs who are actively working on latest versions. It is now kind of about.

A

Anybody else lose Tim, or is it just me I.

D

Still here, I can still see in here. Yes,.

E

B

You're saying there's.

A

People who are side, maintaining it hey and my connection is horrible. Can everybody hear me okay, okay, so like? What can we use today, aside from cops as the canonical way of standing up clusters on AWS.

C

Nothing yet: okay,.

A

C

A

That's why we're using cops, but it doesn't need me PR blocking now so hang on because, like I, think I'm kind of a joker who helped get it into PR blocking, and it was partially that, if we're gonna block PRS on cloud VMS, then it seems like we should have two different clouds to make sure that the failures were chasing are with the tests themselves and not the cloud itself longer term I think blocking on cloud vm's and standing up an entire cluster in a cloud for pull requests tests is a horrible idea.

A

It takes forever, and so we want to actually move in a direction where no clouds are involved in blocking presubmit tests and they are all post submit tests. But the world grand today is that those two give us sufficient coverage of e 2e tests and we're working on a solution that gives us comparable coverage of the e to e tests. But we're not there yet so, like I agree with your opinion that the state of today is bad but I'm not sure that we really have any appreciable place to go right now.

A

Cuz everybody is working on the things that make it better. Okay,.

B

Great, that's cool and I guess the other part of this is I, think that was, it would be a specific functionality to you. So I guess you know it could be argued like that. Should you know that should be cloud provider federates listening I'm, not necessarily yeah.

A

That's probably like a larger discussion that I um I really hate being the blocker for it. If I'm like I, have an interest in you know trying to write that document that describes the path so like. How do you submit your results? How to you submit post submit results? How do you get from like? How can we verify that you're generating good signal? How can we make sure that that's worth blocking on thing things of that nature, and not we're not quite there? Yet there.

C

Should also be all over the code, there are GCP specific tests inside of the end-to-end test suite, and you could simply wrap that portion with a issue and reference that issue. So that way it can be fixed over time as folks tread into cluster API and if they have a certain incantation that uses a SGS or not. That's that's an implementation choice, not necessarily limitation of the cloud provider itself, but a limitation of with how cops is using the provider yeah.

A

I, like that's something I plan on getting to as I sweep through and try and identify a bunch of tests that make more sense in this conformance tests.

A

It could be that there are a lot of tests that were shoddily written that are skipped if the provider is GCE, because the person who originally wrote the test only had the GCE cluster to test it on, and nobody with access to an AWS cluster has taken the time to make sure that the test is written well and meaningful as such that it triggers a you know an appropriate fault on the AWS cluster and that recovery happens. Similarly, so like what's the plan for corruptive tests, I'm not sure I would go.

A

Ask AWS, but what's the plan for cops I don't know, I would go, ask the cops maintainer but like neither of those disruptive things are really things that I want to talk about in the context of PR tests.

A

Does that make sense? Okay,.

A

Anything else on that, or should we move to Tim's topic next.

A

C

Have the floor, the there are tags that exists inside the test because it was meant to help to route issues.

C

The problem with that is that some of the sweets are stood up for a given provider and when the test fails, there should be like a canonical ordering for who to route the tests to, and this occurs for a bunch of different things. So we haven't really thought through how to properly route things. So sometimes the CI signal lead. The pourcel has to go to all the SIG's to try and find the right person and then eventually hits paydirt, but we should probably do.

A

You have an example of one of these tests.

C

Yes, I do and I can lick it into the into the meeting notes. This specific one was the cluster upgrade tests that have been failing for GCE for a long time and has nothing to do with sequester lifecycle. In fact, it doesn't even have anything to do with GCE it's it's deep down somewhere in GC storage. So it's like it's orthogonal.

C

It almost belongs in six storage, so the that routing is basically the poor CI signal person was poking us forever and we're like we kind of ignored them for a while, because it we weren't seeing in other Suites and then they came to the cig meeting and we dug into it and were like no. This has nothing to do with us at all, so that routing process which probably be refined and thought about so.

A

To provide some context on that, I helped write the routing process for a CI signal and the way that I wrote it was. There are two six and in fact, I talked about this at the community meeting last week, but just to rehash it there, there should be two sinks that shego contact first write each first take a look at the sig that owns the job who's in charge of the the health of the job as a whole and then who's in charge of the health of the individual test case.

A

Generally speaking, the individual test cases have the little fake name in the tag there and then the job itself it used to be. We had a mass of Jace file for all of the jobs and there was a sink filled.

A

There I think we're moving in a different direction, so we don't have a great way of canonicalizing a sake that owns the job but I think like both of those need to be consulted in the poor ci signal Souls case, because they don't necessarily have the subject matter, expertise to define whether it is a failure of the job itself or whether it is a failure of the specific test case and I understand.

A

That said, cluster lifecycle is put in this really unfortunate position, because upgrade tests are almost always where all of the failures are with release related stuff, but I'm more interested in figuring out how we can like work through that quicker, as opposed to having people be ignored, with no real reasons. Given we I.

C

Think it was just a matter of people didn't have enough bandwidth, people looked at it, they gave back feed back then it wasn't progressing the way they wanted to use. So they showed up to the same meeting and that's when progress was made so I think the problem in general is there's a ton of signal inside of the system on average, I get upwards of 150 emails a day, turner easily, so getting signal out of the system.

C

So that way is not you know it's properly triage in a reasonable time frame is there's always the conundrum.

A

Do you have any proposed solutions, I.

C

Think if we had an updated process which routed from job first right so like then specific jobs should have been routed through GCP folks, first right and that would solve the problem faster. The.

A

Difficulty I think, is that generally and maybe we're getting off until the least related weeds here, but I believe people associate need cluster of great jobs with clustering life cycle, because how a cluster gets upgraded is functionally like cluster life, so they're.

C

There there's two separate jobs: there are actually dozens of jobs, so the the problem with this one is that it's their COO medium upgrade jobs and their GCE specific upgrade jobs, and they are totally different control code mass.

C

So the problem there is that the slash cluster repository, which was like supposed to be deprecated for years, still lives on right and that's owned. Only by Google Google are the only people that maintain and run that stuff I'm.

A

Not sure that's correct that.

C

Is it's 100% correct because Robert Bailey is the only owner and he has to poke people to get it to move right? Okay,.

A

But in con in the context of the project, there is no such thing as a single company owning a thing, that's a part of the core project right, so it does ultimately fall to say cluster lifecycle as a whole. Even if you want to say that Roberts the bottleneck on the.

C

Owners file their only successor, lifecycle is not the owner and shouldn't be alias. Does the owner, because we, those folks, don't maintain it right? So the me it's all about ownership and maintainer ship of code by definition, cyclists or life cycle does not maintain or own that code right. We aim maintain and own all of the stuff around Kubb, ATM and close to API and that stuff, but no one from the sig actually owns and maintains that code. So none of the leads do only Robert.

C

Bailey is the last maintainer from Google that routes things properly so again, because there are two control poke paths that are totally parallel. The tests themselves are useful, but the jobs exercise the stand-up completely differently, and because of that, we should probably route via a job first and then test. Second.

A

Okay, I think that's fair and I. Think like you, should probably go propose this through sake release because, generally speaking, like you know, we've fallen back on this wonderful situation where we're leaving on a human to do the triage and I have a dream. I've had a dream for a while of automating and scripting that away, but I think that the process would work best.

A

If we take the documented process that is being followed and automate that so I would change the process that's being followed, and then we can see what we could do to actually automate that way.

A

But I like I agree: I'm, not super happy that right now we're living in this kind of in-between state where jobs aren't I'm, not sure if it's clear anymore, even switch job. So in that sense, perhaps we're due for another sorting of jobs amongst six and then I think that's a perfectly reasonable argument to say that I'm not sure, like the entire cluster directory is owned by Google, but I certainly think the cluster GCE directory is a by Google I'm, not sure about things like jutsu or kubernetes anywhere, but yeah.

A

It's like whatever it's the GCE directory, that's pretty much pertinent here, okay!

A

So next up I yeah! You know what I'll go ahead and share my screen. Just so, I can walk along here. I wanted to give you a heads up on some of the automate all the things work I've been doing. I gave a heads up about this, a contributor experience and again at community basically trying to take another push on this issue. Here, we're a set of merged automation for every single kubernetes repo in every single, actively managed kubernetes award.

A

So I've got a table going here, which is probably a little bit out of date as of today or I, want to make sure that every single repository lives somewhere in the C's yellow file, so I know that it is a owned by a sig making sure that there's an owner's file within that repository, making sure that we have approved functionality turned on so that people can use /l treaty and slash approve making sure that it's it's in the tied queries, so that you know when somebody /l g, TM and slash approves a PR.

A

It gets merged and then finally, are any of the branches protected automatically by our branch protector, which runs daily, so I've been rolling through on all of this, where we're at right now is I. A lot of our automation depends upon certain labels. A couple of the plugins are written in not the greatest fashion where, if the label doesn't exist, it just goes ahead and creates it. So I am pretty creating all the labels in all the orgs I have a pull request out there.

A

That I asked for lazy consensus on I got okay from the owner of all the kubernetes client repos I'm, not really sure I can hope for the same from all the kubernetes incubator ethos. So I'll put this in by Friday. That then leaves getting the ammo done and getting all the owner files done again.

A

I kind of need to go through and update this, but there's one repo to be deleted and contribs I have proposed, gets owned by sitting architecture under the guise that can trigger this, a repo that really shouldn't have existed for years and years. At this point, it's one of those things we all love to say that o contribs dead, yeah pull requests they'll keep showing up into it. So I think that if cig architecture is the city in charge of making sure the projects code is well organized, they are the cig.

A

That's in charge of getting code out of control into other places. Similarly, for odors files, mostly repos in the kubernetes client work, don't have owners files working through those any close.

A

So then, the two questions, I kind of have for the group and I'll run through the rest of the community as well, is I'm not really sure if it makes sense to go ahead and turn on those two commenter jobs that everybody calls beta bot the the one job that goes through and comments, slash, lifecycle, sale on things that haven't been touched in ninety days and then rotten if they haven't been touched in the other thirty and then closes them after another 30 for a grand total of 150 days of inactivity from open to closed, there's, also a job that will just automatically spam retest on pull requests that are lgt and approved, but aren't passing for whatever reason, and we just assume there must be flakes.

A

So we spam retest, those are way way easier to enable or glides and on a per repo basis. I'm, not sure. If anybody here has any thoughts or opinions.

B

Why are they mm? Why are they easier? Why isn't it just the search query that you're adding on the job there yeah.

A

It's just like less copy/paste I.

B

Mean I feel like the retest. There should be. No one's ever complained about the retailer, how they.

A

Generally know, although it can get some spammy on some PRS I, don't know, I haven't taken the time to go, collect data on that. Okay, I got so much less contingency. Yep next thing that I already pasted about this in the state testing channel, there's a pull request. I would love to see it get merged. Today, where somebody has gone ahead, hippy has modified the e2b framework so that it includes the test name in the user agent string.

A

I know the example here shows file name line number, but I was asking that perhaps we just trim it down to the test name here complete with these tags. That Tim was talking about earlier. With this we can, the user agent gets dumped into the audit log. We can scrape the audit log and generate API coverage information, so we can see exactly which tests are hitting which API endpoints. So if folks from this state can take a look at this and help push it through, I would greatly appreciate that.

A

The other thing to discuss is I in chatting with Tim last night, I kind of want to try and be lazy about conformance and reuse. The fact that we have all these conformance jobs that are continuously running and posting results every six hours I feel like I, really ought to be able to click on one of these jobs. Take these files and hand them over to the C&C F in order to make sure that these jobs are proof of certification, or you know, passing conformance tests.

A

The thing we discovered last night is there were a bunch of tests that we're getting run, that sauna boy doesn't run, and it turns out that sauna GUI has a bunch of additional regular expressions that it uses to skip tests. So it skips all the tests that have Q cuddle client in their name. It skips any tests that are flaky, disruptive has a feature tag in them or our alpha I get it. These are all like fantastic criteria.

A

Honestly, if there's a test, that's tagged as conformance, but it's flaky it shouldn't have been tagged, as conformance in the first place. My concern is this: has allowed us to end up in a situation where the list of tests that kubernetes defines as conformance as guarded by code here in the kubernetes repo, and it goes through and it blocks through all the code, and it looks for anything with the conformance tag. That's basically about it.

A

It doesn't bother to exclude any tests that have flaky or any tests that have slow, I'd love to demonstrate that to you here, but it actually strips out all tags when it dumps out the test names. But so there are tests in this list that we treat is the authoritative list of conformance tests that sound boy isn't running, and this is an issue because sauna boy is listed as the way to run conformance tests.

A

As far as the CNC F is concerned, this is a discrepancy that bumped me and I'd like to get this discrepancy back, fed back up into the project.

A

Any thoughts or comments.

C

A

C

Yeah, we should do that. The question of coop control stuff will affect OpenShift with ACLs to run it it's respectively. So the there are some issues that we've uncovered a long way: we've already fixed a bunch of them. I know DIMMs is on, but we fixed a long time ago. You couldn't even introspectively, run the tests at all, but that's been fixed because it uses the in cluster clients that was like going around like the one nine time frame, so this is just legacy that should be fixed and I.

C

Think just no one has gone through and fixed it. I know that MML was originally the person from Google who was working on this stuff and we had been conversing about fixing some of the things and we fixed some of them. We just haven't gone through and updated again, so I think I'd be happy with making it chive. Ideally this isn't really suitably.

C

This is actually the coop conformance container and I also talked with Matt a while ago, and I would like to have that published as part of the build artifacts of upstream so everybody's, using the same coop conformance container with the same reg X's and the same relative process right. So anyone could take that particular container. They couldn't take the Coons container. They never could because it had issues, it's very Google specific to the test and fro.

C

But if we can publish a conformance container with up with upstream I would gladly like to take that off of hefty O's hands and push that in upstream, because that's what we wanted to do like in the one 10 ish time frame. Oh no, even earlier than that, yeah.

D

So I had a question: I had a question about this, so I learned last week or this week, I don't remember any more been a colleague of mine was running conformance tests on some VMware stuff and on master, and it turns out right that sonobuoy apparently doesn't work with up to master. It's only on specific tagged releases if things get switched over to an image that uses sonobuoy. Is that not going to work with this.

C

Again, this all real life isn't necessarily so nobly. So nobly is a raptor runner right. It's a basically executor framework for how to run many get the artifacts. This execute the coop conformance container. Ideally, if folks want to use it against master- and that was also part of animals, mats and I objective was to get that container into what's called into upstream. So that way upstream publishes this the container as part of the release process itself, and then son of we can execute for anything right.

C

You could use that itself if you wanted to and you'd be good.

A

Yeah, so it's it's all like wrappers upon wrappers upon wrappers, ultimately, there's an e to e test binary that also needs like a coop, CTL, binary and I think all right, there's game food. So then Genki that calls the e2b test binary. But then there's a shell script that calls ginko to call the e to e test binary and then.

D

Cleaned all that right, but it's like, if you so Naboo, is the driver. It wasn't working with head so.

D

C

You could you could just specify a different container.

F

Yeah so cubes cans. Yes, it's possible to do what we want. It's not that bad right now, I pasted a resist which essentially starts local up custard, but instead of running local up cluster, you can just run the test to just changing the command-line parameters, to cube test. It's self-contained and it's easy to run it's just that we need to. We probably have to streamline the UX a little bit more and publish how exactly to do it.

F

Yeah the coop contains a bunch of stuff that you know it's creepy so like, but is it easy to run? Yes, if you see is just it just as not very yeah.

A

And also just to be clear, like sauna boy ends up plumbing right down to e2e, test tube test also ends up plumbing down the e to be test. I would but Sano boy like calls a container that happens to have you to be test in it.

A

I think that I wish more of my teammates were here so I apologize if I'm putting words in their my house, but I think, like we'd, like to find a way to make you test the canonical way of running tests and if it turns out that has to invoke a doctor image in order to do so.

A

Okay, but like that's kind of the path to making sure that the way the tests are run in CI is the same way that the tests are run for conformance is the same way that the developer can run the tests for their local experience.

D

A

Q test can be used for a local development experience. We use contest for CI I'd like to get to the same place for conformance yeah.

F

Absolutely and it's we think it's possible even now: I just have to go and poke at it, but then we might have to strip it down a little bit more. Yes for sure, yes yeah! It's.

A

True, like Q test kind of, isn't the fully self-contained thing that a doctor image is because it does ultimately still end up calling out to you hacky to me. Ginkgo Sh was caused incredible yeah.

F

My usual problem, it's on, oh boy, was how do I get it to run against the court that I modified right. So that was other issue that but.

D

Extending cube test to be able to use an image to stand up the infrastructure and I've poked around at that I haven't had the bandwidth to really work on that, but maybe that the same process in terms of you know can be leveraged or if there's going to be worked on this, at least at least be aware of it. So maybe it could be extended to also be used to stand up the infrastructure, the.

C

The only I would prefer to be looped in any canonical container that gets published there. There are fixes that we have inside of the coop conformance container that allow it to clean itself up afterwards, because not of the cleaning right now part of the testing infrastructure destroys your provisioning and if you're doing this, on someone's premises being being a good steward is super important and we have that auto fix in there and that's that's important yeah.

A

I I think that might get some traction as a cute for goal. I, don't know what your bandwidth is like, but I'm guessing that's what our bandwidth! Oh! That.

D

Was a I mean I would figure. That would be a two part thing right: one standing up infrastructure and the other is running and cleaning up tests. I mean you could all be a single image, but I would think it would be two images. You'd have the infrastructure image and then you or multiple infrastructure images. Ideally they could be interchange, yeah.

A

In your case, Andrew I was suggesting like if the canonical way to stand up a cluster in your cloud of choice happens to be running. G cloud great run, g-cloud, that's what we deserve if it happens to be running image. That does all that great on the image like yeah.

D

Exactly but ideally, instead of it being specific to me, any extensions to cube test would just be running image, and it just you know, adheres to some type of we need this output I think it's crocked and you know, tells me what the cluster of see, how the other you didn't.

F

D

What it's standing up or where it's standing it up it just knows it stands something up and.

F

Digging to the provider and deployment those oh yeah.

D

That's what I've been looking at.

F

Yeah I just I'll talk with you, offline Dems cos yeah. It wasn't that hard to plug in local up cluster and other people have tried to do the same. I think the last one I saw was Microsoft trying to do the same against their cloud. So yeah, that's not hard to do. Yeah.

D

All right I'll talk to you flying about it: okay, okay,.

A

Finally, we're over time, but I did want to make sure we got the Patrick's things. Who is the keeper of the clusters, specifically secrets related to getting access to other clouds? That would at the moment, be the tests infra folks, specifically the gke inch broad team here at Google I have longer-term dreams codified in issues of making on-call the thing that everybody can do.

A

The way we get to that is to have these clusters stood up in a Google cloud account that is managed by the CN CF, so that non google.com people can be added to that Google cloud account, but in the meantime, I think. If you get in touch with us and make sure we are made aware that these are important issues to get merged, we should be able to work with you on that.

A

We typically have one person who is on call for testing for us, I think it's you can find out who it is and we find out. If this is right. Now it's not on call. Maybe it's on call yeah. So today this week you can go to that URI and you can go find out that you need to poke Erik theta has the point of contact, San, Liu or then the elder may also be of use.

A

Does that answer your question along and getting secrets into pro I.

B

Said is a site I think we all downstream for having a similar issue with like just solving the problem of we have. You know if, for us, it's also as you're like secrets that we don't own that potentially other who might want to be giving to us to run jobs.

D

B

Then also giving them a means to rotate those without giving them access to our clusters.

B

E

Yeah I mean cuz, ideally I'd want something that you know to service account. It has to be renewed every time period you know maybe daily if I could automate it. That would be great, but.

A

Yeah I think that's that's a larger problem than I have the time to hash out yeah.

D

A

Okay but I think yeah, like Steve, said it's an unsolved problem, so the way we're dealing with it right now is with humans in and trust, but I agree. Those are things we should figure out how to solve. Okay,.

E

All right, thanks for that cool.

A

Anything else from anybody, I'm, shocked, I haven't gotten, kicked out of my conference room, yet all right, cool, happy, Tuesday everybody. Thank you. So much for your time see you next week see ya.