Kubernetes SIG Testing, 27 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing 2019-08-27

Description

https://bit.ly/k8s-sig-testing-notes

A

We have icons in the agenda I'm not used to running this meetings. It's probably my first sign but I'm happy to do my vessel drivers through the agenda, or we can always reschedule next week for.

B

Those who don't know me my name.

A

Is Nico Hagen and my team works with sick testing? I sit next to Aaron and we also help maintain proud, I, see Steve on the call, also proud, maintainer yeah. So those are our options. I think we can go through the agenda results, I, don't exactly know what Aaron had in mind.

A

I mean will mrs.

C

Bute but I think we'll be fine without.

A

D

C

A

Sure, let's do it, we can do it. We can do it together. So are there any new admin bees in the meeting there? Any anyone knew their first time joining us and wants to introduce themselves I.

C

See a couple that I know.

E

E

Sure my name is I, mostly work on cryo and container runtimes, but I am yeah. So this is my first time coming to sig testing yeah nice to meet y'all I refer in hat.

A

That's your you Peter nice.

F

To meet you Peter, so my name is Nicholas Moretti's I'm working in record with Steve.

F

Couple of working pro of about the opposite option and also I'm CI signaling there Burnett is release team for one point: sixty release.

F

That's me, barolay next.

G

Hi I'm Alicia I also work mainly on cryo, with Peter.

D

Hi I'm Alex I'm, also from Red Hat I work mainly on intend creating event and tests for for the Red Hat, CI and I've been working on sky operator stuff in Red Hat as well.

H

Right, hello, everyone, I'm, Petter, Miller I'm, also from Steve's team from Red Hat running open shifts. Prowl instance also recently started contributing to the corner east as infra hello. Everyone.

A

Hello, any anybody else. We said anyone else knew and wants to do some search.

A

Okay, next item in the agenda so before we move forward, welcome everybody very happy that you're here and looking forward to work together with you, I'm super new to the team and started learning how to code and go and so I look forward to contributing to the testing for every as well.

A

Google makes item in the agenda is kind and it's a su project. I su project updates, I, don't see, Erin had a link for kind. So we can. Let me try see what's under like.

C

Okay, usually I spent think.

A

Ben would be best to present on this. Let me think he's not on the call, so maybe we can I skip.

C

Over for this time,.

A

Testing Commons.

C

This is usually team, st. Claire.

I

Okay, I can give a brief update for today for testing comments. So for a for those of you that are not familiar with the play with the project testing Commons, we are currently working on making the end-to-end testing framework a more user-friendly. He said easier to consumer sort of outside of kubernetes.

I

Right now is we're a we're really just chugging away we're really just chugging along, and they a lot of the work that we're folks that we're focusing on at this moment is on taking a lot of details, a lot of utils files from the from the a to e test framework and actually a refactoring them make them into packages that people can use a within their a within their test, a see a job, say, etc.

I

But yeah we really been talking. We would really just been chugging along a there's, not a there's, nothing particularly new, a since a since, probably the last time that team gave an update.

A

Thank you for the update.

A

Do you have any questions regarding testing comments? Okay,.

A

Next to the agenda is: there's the infra there's an issue highlighted I. Don't have any particular updates for that issue. Is anyone on the call and live with that or sorry not an issue, but our list of issues under test and.

C

I'm not seeing anything hyper colleges.

A

A

Merged for requests is a list or requests I purchased. Recently, thank you for your contributions. Everyone.

A

A

Think be not the same same for crowd.

C

For probably did have some some fairly fairly large changes come in in the last week, a half or so I think so we're finally making really good progress on moving to the upstream communities, clients and getting rid of the homemade ones that we've got, and so that's letting us change controllers overusing listeners which is going to make that much more efficient, so I think we've moved sinker and we've updated the clients in plank adding external cluster credentials for certificate.

C

Auth was made easier, but at the same time the migration to the new clients is going to allow for using service account tokens instead. So that should be fairly straightforward as well.

C

And we changed tide, so when a batch or a single some jobs fails, only the failing jobs are retested when the next measurements that just makes it more efficient. We're also moving forward with a Lara's proposal to have in repo config, and so that's changing right. Now, that's sort of a lot of preamble work to change how pre submits are fetched in the in the plugins that use them, but that's moving on a lot.

C

They found and I think critically to I'm, not sure if Eric is here, but we deleted the vendor directory from tests and press and now we're using code modules entirely. So if you're importing tests ensure downstream and you weren't using go modules already you're gonna have to now, but it should be fairly seamless. Since you know peak of mod files and whatnot already checked in.

C

We've had a couple interesting performance blips that I've noticed looking at metrics, one of the ones that's sort of unresolved, and we still need to think about my don't see and you think it of people on the call. But the quick summary is the branch protection job sends an enormous amount of requests to actually make it make it happen, and that's possibly triggering a few subjection mechanisms on next time. So we probably need to think about an output connection throttle on our NGH privacy. Otherwise we're just gonna hit 403 is.

A

That happening already, which part getting.

C

There 403, it definitely is in our instance, I'm, not sure, if is having the proud Act case, that I owe but looking at the metrics for output connections or like API requests rates, that a product case that iota cluster has the same sort of pathological behavior.

C

The brand protection scales with the number of repos you've got in a number of branches on those repos. You know it might just be a case of having a couple asteroid.

C

A

C

That oh go ahead, I was gonna, say one other thing that I didn't mention: Kathryn fix the amazing bug in the type issue page, so it should load now within the hour.

A

That's great I was in to ask is a problem with money out of the API? What is imminent enough that we should have an issue upon it and perhaps probably.

C

Have an issue I'll create one Eric and I were talking about on slide history.

A

Okay, thank you.

A

Then we also had next item in the agendas monitoring for locks and there's a question about alert manager and for me, Hughes I've I am coming away from accused, but I'm not sure I know what alert manager means yeah.

C

So, unfortunately, I don't think we have a lot of people on the call that can answer some of these questions. um So the general question here is: we've got the monitoring that about a case that IO domain set up.

C

That's the Griffin instance. That's showing like nice, pretty graphs for people. There are two other main components of the stack: Prometheus actually collects. The data from applications and an alert manager allows you to configure rules for learning and then send them out to Slide channels whatever.

C

If people are just looking at metrics, the current website is totally fine, but as soon as automated alerts start coming in to the prowl channels, not having an externally available web UI for these two other services stops admins from being able to do what they need to do when an alert fires- and there are potential data leaks and, like other weird API. Things like these, aren't necessarily supposed to be exposed to like the whole internet. So I had asked here, we might want to put North proxy from them, but.

B

I think in general I wanted.

C

To hear some thoughts on whether what we wanted to be expose them at all, since the decision was made not to originally, but it does make responding tours really challenging. But we don't have any Katherine's here, but yeah we're missing like a lot of other people. That I would like to hear opinions from as well.

C

So he might be my ala, maybe we'll just schedule this one for the next one and sort of all the rest of the questions that.

A

Make sense I'm not hungry the remaining questions either.

A

Next item is open discussion and there's a corner shape of input. Balls, that's Alejandro! X1. Are you gonna call? Yes, your honor yeah.

I

Yeah, what the cook so just to provide a just to provide a little overview. A import boss is told that was created a HS I HS. I go for it by africa. Africa by hope, with the idea is I import boss. It's actually used in a its way. It's perfect recipe, it's for the Block, in presume it's for for KK, and it's used to pay to make sure that certain places in kubernetes a they are only important a and are are only important certain certain packages. Events, events of mothers- so they continue.

I

The continuation of this work is actually is actually owned by the cooperate by the code organization by the code organization project within a within single key tech chure, and lately we found a you know. We found a couple problems. They were important actually take into account a test files if we tried to use import boss, a for a for testing Commons to input a to demand certain packages from A to E test framework, we discovered we discover some issues we a week a week and at some point we decided to actually go in a look.

I

Okay, go back into import boss, a work and work on it, improve it and I brought it up from a I brought up at the an API machinery at an API, Machinery Co, and it seems to be you know it's it's a little bit out of scope. Initial initially they wrote it and it's under the code generator a package which is actually something that IPA machinery uses a lot, but they it was a. It doesn't feel right. So we were so we caught the code organization.

I

People were wondering if, if we could move it to some other place or sig testing cool, a.

C

Why do you feel that stick testing is a better honor than a.

I

So sorry, so it's why you actually don't they actually don't have any particular thoughts and thoughts on this I I only asked the dims and he proposed at the he proposal. We should ask a cig testing basic testing. First it. The only a reasoning that I can give is it's part of a import bus is running CI a but a budget. This is this is completely up a up for discussion. If you're bumping the cig testing a will be a, it will be the best owner, then I. Can we.

H

I

Back to the code on a coat organization, people and shows a figure something out I.

C

Think I'm afraid that the members of suggesting don't have the matter expertise to make changes to the import parts like it sounds like this is its bullets restricting Howell coverage is mmm-hmm, yeah fractured it so yeah I'm not.

B

C

Suggesting would be like a set of exploits on this one.

I

So if you write right now, really looking for some a for some advice, take your whole thing. The it just have been a people from a from code organization and which technically is a part of a cig architecture and the you all think that a code organization, Sigma AC arc will be the best owners for import balls.

I

Because a because other than a other than the architecture, people a I, a not the way who will be a Bay? Who will be a good owner for this I.

C

Mean at first blush it does seem like this is a tool, that's meant to dictate organization of code, and it was written by the code organization. People so I mean I, think I would I guess. I would also ask the question like when they're looking for an owner for it. What are they expecting like? Are they expecting a set of people that want to work on a tool? And if you choose suet, are they looking for an interesting features like what? What's spurring the the need to move this I guess.

I

Right now, we are retro actively looking for looking for a place to look is looking for just looking for some ways, someone to say we own a we own it. Currently the people that work at a work on import boss, a it's. A combination of testing comments, a closed reply cycle, a they kill a Cuban in maintainer sabi.

I

You have been using it for a cube, a cube admin and go in code organization, people because import boss, a we're you say we're using it to donate, to restrict it to restrict imports through it throughout the code base. So we're so technically, there's already a there's already a group of people that weigh the work on it. A enters a group of people who own it good, a a right. Now the owner file specifies that this is to be a machine.

I

A this is an API measuring cup editing because someone from API machinery, I wrote it. It would a you know, for a just for I, think they're thinking about maintenance in the future I think it will be a little bit awkward. People from some other group tries to push PR to import boss, and we have to look for an API appropriate look for an appropriate in API machinery who has no idea. What's called a, has no idea what's going on, because they're not we're not actually involved in a way with the tool. Oh.

A

It could be to your point, who do you think would be in the best position to make decisions like that I add a new dependency or barring it and and I think that might be the best owner piece. It sounds like from what I hear it sounds like the problem. You're having us, you don't know how to decide, but then that's going to be true for every sick. If independence is coming in, do we even have a framework to decide whether not we accept that dependency unless, unless I'm not fully getting what you're saying yeah.

I

Yeah, that's that's actually a really good question, so import boss is essentially just an. It really is just a ghost group. The way that you use import boss is that you have to write essentially a configuration. You have to write a configuration file and let's say that if let's say, for example, that you want to restrict a CO imports in cube proxy, then you write a configuration file. It you put it you put it in a kubernetes, CMD, cube proxy and then is, and then it's just gonna do its thing.

I

So in terms of actually using it a deo at the six, the the six groups, whoever wants to use it, they have it a complete freedom to get the user a to use it as a as a a a in whatever way they wish a right a right. Now, the I guess a problem is just with the ownership of a tool. The implements are function already.

C

So I think, if you're having issues with the current owners being listed as an alias for the machinery maintainer like I, think it should be okay to just switch those over to the people that are most actively contributing to the tool. um If you think that there needs to be a more formal effort, it might be appropriate to create a working group under this thing for it, and then the members of the working group can be approvers there, but I think in general, adding a set of people from testing for who haven't really touched.

C

The tool to the list of people who can approve, isn't necessarily gonna solve your problem, because those people aren't going to have really well for the opinions on the changes that are going in there, since they haven't been good for you. Okay,.

I

Yeah, actually, that's actually a super useful. Thank you and yeah I. Guess, that's a really good way of moving forward. Thank you.

A

Thank you exciting. The agenda is your hunt you're talking about run test with CRI OH, some patience upstream, hey.

E

Yeah so um yeah for a little bit of context. Yes, so I mean ever she work on cryo, that's it's! A implementation of kubernetes uses container runtime interface, meaning that, where a alternative to docker the context around here is we our container runtime solely works for kubernetes, and we run lock against all of the kubernetes end to end tests on each of our PRS to make sure that we're conformed with it. The issue and the reason that we're coming here is because it doesn't go in the other direction.

E

Communities doesn't check that nothing breaks in cryo when they make changes. We often find that cuckoo been at ease in upstream breaks us and is not even from like coding, compatibilities on the hard side of psych, some some setup that we didn't do or right now there's an issue in cube test where it's not being fetched with the go, get and that's failing our stuff. There's all these different things that I we've been fighting against for the past month or so so.

E

I wanted to open up a discussion on like what what the kind of steps would like one who to go through what the steps would be and, like you know, if it's even interested in adding some of the different CRA implementations as a for.

E

Like extra test grids, I see there's a.

C

Yeah, so in general, there is like super Elise, has now written a document that sort of talks about how you add a new job that blocks the release, or you know, what's the workflow for adding a job that first starts as it forming.

B

C

You know how do you make sure that this thing is stable and has a good signal to noise ratio, so I think, hopefully a lot of the questions that you're having with the answer is talking with the right place. Okay,.

E

Okay, I wouldn't even necessarily say that we need to block anything on crap I just want to know when there's a change that would break us happens. That's my main thing, but yeah I'll definitely look through this. So.

C

In terms of setting up the tester dashboard and whatnot waterers are the biggest gaps there, which are the parts of that. Do you think, could you be helpful? I.

E

Was just seeing if it would be interesting, hey like if it's something that folks here would be interested in I I, don't I haven't looked into it at all. Yet I wanted to test the waters. First, I.

A

Think it would be interesting, but I think Steve is right in perhaps best place to start it's through seek release. They are at the process to make your jobs. What.

C

Yeah I will note that there's I mean there's a there's, a pretty commonplace practice of creating Pacific jobs that test some subset, some type of functionality and I. Think there are, you know, there's quite a number of dashboards and test grid that sort of look at these jobs. We also have a really long tail of jobs that have failed for months or years, and no one's ever actually looked at so I think adding us into jobs. That gives you that sort of signal miss totally reasonable one and creating a test grid dashboard.

C

You can look at again totally reasonable, I, wouldn't necessarily expect you know, if you add a new dashboard for everyone's if you're looking at it just cuz there's so many of these sorts of dashboards existing and that's the sort of point at which looking at this document is useful, where you can start to think about how do I get someone to be looking at the state of this job all the time? And then you know if it's something that the community thinks is valuable and your jobs are stable or have a good.

C

So you've got some nice ratio like actually moving forwards. Making them release blocking formally would be sort of the best way to make sure that someone actually is looking at that job. All the time.

E

F

Dogs also this bin, but your team will be responsible for the changes over this kind of job partner. What was that, and that means that your team will be responsible about all the changes or about this job, to the bug will configure it and is it's going to be owned by your own team, or this is also what you are asking.

E

Yeah I mean I, haven't really talked through that with my team, but I think it would make sense if we took it I, don't see. Why not.

C

Yeah and again, if release informing them released blocking is something that is in the future. At some point like there, there would need to be at that point a formal relationship between the job and sick.

E

E

Yeah I guess that's all I had and most there any other questions. I was.

A

E

Thank you thank.

A

You Peter next in the agenda is a demo for history. We have a new feature that highlights failures: Mary's, not here, I, don't know. Is there anybody who can demo Hey? Okay.

B

So the explanation is probably gonna be longer than the actual demo itself, um but let me go ahead and say: okay, so hi I'm, Michelle I work on test grade for those who haven't talked to me before Mary was my intern over the summer and she was working on custard feelers for test grade. In order to show like more information about.

B

Kinds of failures that happened on test grade, so the interesting part, is that this is working for both internal and external, but external failures are pretty simple, um which means that this demo is going to be interesting, but yeah people see my screen: yeah, cool, okay, so cluster failures. uh There is this button up here for a display cluster failures list.

B

When you click it, there is a small table that shows the types of failures that appear on this dashboard: um you'll notice that these are all whited out, and if you click on this, you can now see all of the failures under this highlighted um you'll notice that there is only one type of failure here, because it is either failed with no message or not, but one of the interesting things we have here at our site that we can be adding later is: okay, that's interesting.

B

There are some things that we were working on before she had to leave, so I'll probably be following up on some stuff, but um the essential thing is yeah. Why are those different I'll figure that out later so.

B

Right now, because external law has like a pretty simplistic thing for showing whether something is failing or not, we only got the one row on that clustered failures list. On the other hand, I think some future changes will be doing like through this year. We should be able to get more nuance like thinking's, into test grade on why something is failing. So, for instance, there's a difference between tests failing versus like a build failure versus a test failed, and we kind of know why we can show an error message versus like this test was flaky.

B

If we have like multiple runs of the test during one thing and clustered failures actually does differentiate those so, namely it's like this feature, does work. We just need to get a little bit more nuance into the test. Failure stuff pest code is still not open source. We're working on that now, so I will see if there is a way to like yeah someone that code into open source, so other people can play around with. Like hey mone report test, can we report like more nuanced stuff? Can we report specific errors?

B

Do we want to report like specific error cases we want to tie it in with triage, etc. Okay, so.

C

Michelle was, were you expecting like on where you expect in the future there to show us which clusters of specific GI unit failures happen together. Was that.

B

Basically, which ones happen that are the same type of failure so like in internal, for instance, there is a much wider like spectrum of things. That might happen. So, let's just say, like you can get failures, you can get failures of message. You can get something that indicates a build field and we'll say like something indicates: a tool failed instead of a build, but because they're learning.

D

C

In the middle before we adjusted it to the pros we're getting it really simple like yes or no failure for everything, yeah.

B

But we could be doing a bit more like, for instance, I think we actually do have your messages, it's just that. Don't really have them in most cases like I, think the code does parse those right now and they. Similarly there's problems like triage, would be really cool here, but like triage operates completely asynchronously and tests code usually relies on things being reported during the tests like by the time the test is done.

B

So there are some things that we could do too, like if it's reported in the test results by prowl or something else that like feeds into them, then like we could get some extra stuff in there yeah you'll you'll note that test grid has a. Let me see if I can get it again. No, no. That test guard has a info bar that pops up at the top here, where it says, failed at the build number. So that's normally what we expect like extra failure message just run out to pop up.

B

There are a lot more useful if they are not like extremely long. So, for instance, like some people will report like if a particular error that they recognize happened like say, failed ssh or something as a short message but like that, requires a little bit more like hand-picked stuff and say like trolling, the entire errorlog or something the only triage would be great, maybe a little bit shorter to be more useful, but yeah.

B

So yeah um I think I'm excited about it. I think there are some things we can do an external to make that feature a little bit more snazzy, we'll say.

A

The image shelf I think it's really cool feature. It's useful clue investigate, but help me with the failures and Catherine had a comment in the chat. I.

A

Think that is also something that we can work and work on for a future like we should have a better way of presenting a failure messages, especially they're, really long likely cumbersome to navigate them. I just have to find something to do as.

C

Well, yeah and sometimes I think right now, like the crucially because stuff will just take the entire airlock and instant message into June, and so it's useful for stuff, like spyglass, where you want the whole thing.

B

Yeah, exactly like the entire era, log is usually like very good, but you want um like the tester thing, is a good thing for a short indicator of like why things happened and then, like digging into the ones that you need to look at later. um Yeah I'm doing a bad job explaining why this was difficult and why it took a whole quarter, but like married to do a lot of really good base. Work that, like I, am planning to build off of so hopefully we're setting things on this in the future.

A

That's right cool! Thank you for presenting.

A

Does anybody have any questions.

A

Ok we've covered the items in the agenda so far. If anyone has anything that they'd like to chat about in the next meeting, please added to the agenda for next meeting and thank you everybody for being here today and bearing with me I think was my first thing for something and I: don't know if I did a good job or a bad job, but it I did it.

A

Thanks again for being here and see you all next meeting thanks all my.

F