Kubernetes SIG Testing, 25 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2019-06-25

Description

https://bit.ly/k8s-sig-testing-notes

A

Hi everybody I am Aaron as a sick beard and you are at the kubernetes cig testing weekly meetings for Tuesday, June 25th, we'll all be adhering to the kubernetes code of conduct by basically not being jerks on today's agenda. We're gonna have Steve walk us through the new monitoring stack, that's been put together for prowl, and then we have some discussion about sort of configuration and related things that were interested in for prowl. Well, so I will hand off to Steve to show us monitoring drought decades, dhaniya.

B

Cool, thank you so today.

B

It's very important that you previous this with an HTTP, but that won't be necessary soon.

B

Okay, I'm not sure if you guys can still hear me.

A

I press the share screen button. It's on my laptop I can still hear and see you I, don't see a shared screen or anything.

B

Yes, he can see that okay awesome, so this is monitoring the product case that I own, and so we have added a number of dashboards that we've found useful interesting. So some of these already existed in Belgium and are currently just being calculated using the I started, taking one step back, so the architecture of this monitoring stack is a Prometheus's than sake refining instance, and an alert manager instance that are standalone all of the manifested employ those are in the repo and they don't require, like any any other external services other than a PV.

B

If you want your data to stick around for a long time, whereas the velodrome instance requires some other databases and it's complicated and much harder to set up so this data here is for the time dashboard, it's pretty much. The same thing that is exposed, I felt I was just being exposed from Prometheus itself and not not using the commercial gateway.

B

So these are some of the pre-existing dashboards and they're still here and then there's been a couple of new dashboards added like the action actions of singer- and you can look at you know interesting. Hysteresis over a week, these are also some of the metrics underneath this sort of thing can also be used, for instance, to calculate HT, like request and response latency for deck and there's an app tech support here.

B

The metrics underneath all of these plots are also hopefully going to be used to create alerts. So, for instance, one of the alerts that is set up, but it's a little bit finicky, is, if hook, hasn't ingested a web hook from github for five minutes. During what we've arbitrarily called. You know: business working hours, the under alert fires- and you know some of their alerts that might be interesting- could be if the ratio of 500 or 400 here from deck is larger than inspector.

B

So I like that, so this is sort of the the bare-bones framework necessary and we're hoping to add more metrics. As we see them. There's an issue open right now for.

B

Remaining tokens freaking hub alongside like what exactly the calls were making stuff like that, and also we're hoping to add more dashboards, as we realize like which parts of the system are useful to longer. So, if you're looking to get involved with this, all of this, all of the Memphis for this can be found in testing from they are under the prowl.

B

Monitoring directory and currently everything is a flat file just being to plug that Basin. So it's up it's working and hopefully we'll be able to meet a better fit for you. A.

C

Question regarding the working hours yeah, how do you define working hours, yeah.

B

Actually that us working hours, because right so I, don't that's like the version of that that's running on the Proud cluster I personally administer has working hours defined as any working hour where we have somebody who can be on call, obviously some optimal, but it's at least better. We had a fire on one of the u.s. holidays because nobody was working, but you know like it's going to be a fragile alert, but it should be. It should have more false. It should have a few false positives, admittedly fragile.

B

C

Other thing for alerting what are you planning to use there? I saw that you have something called alert manager. Is that that all that you're gonna use for this or yeah.

B

So that's that's like the usually the most common thing, this paper through this and having in fire office life notification all.

C

A

I'm hugely excited by this I haven't really like been paying too much attention because I didn't want to watch a pot boil, but, like just did a quick glance, it looks like you have all the Griffons dashboards stored inside of config maps, and so we can use the same like get ops driven model that we use to like update jobs and stuff to update or Pharma dashboards yep. This is right.

A

So, if I wanted to, like you know, go around and make my own graph on a dashboard would I need to like stand up my own instance and then like download the JSON and shove it into a config map.

B

Or something so, this is currently like kind of an open question. The approach that open ship has been using is we have a staging namespace with all of this stuff deployed and the grifone instance, and there is just using the production as a pack, so you can log into the staging graph on a noodle around manually and then create something that you like see it.

B

What it will actually look like reproduction data Cole wasn't super interested in having to deploy to us as many things so right now, the kubernetes you're fauna has an admin login that you can use and the thought was. You would create the dashboards there and then delete them and then check in yeah, but I think that's still kind of up in the air Kathryn was talking about. You know. Maybe having was staging anyway, so yeah.

A

I mean I I personally am interested in it from the perspective of had a number of people like on the release team or just community members in general, who are interested in like making dashboards to show things. Since we talked about test health and flakiness and stuff like that, yeah I'd love to give them like a canonical place to do that.

B

Yeah and I think that, like part of that is a question of life, are there you're, essentially security problems with that I? Don't know if there are I just want to pull up super quickly, one of the dashboards livers, one of the dashboards that we've got, that I think we haven't migrated yet cuz, it's just specific to what we're doing but would be useful.

B

B

Believe if we have a staging instance of crew fauna with like an anonymous, you do around in.

C

B

Think that would let you do anything malicious with the data I'm not like yes, build cop dashboard I want to see what that is yeah, so this is basically just like by so filtering by org, repo and branch across what time period look at the success rate and failure rate of jobs matching some esoteric thing, and so this is then also used to drive.

B

Like you know, our target success rate for release jobs is whatever, however, any percent, and then, when that drops an over this fire and I think this would be generally applicable, at least the approach to.

A

Okay, this looks really cool. My other question would be like so does this mean we can kill velodrome like what else has to happen for us to use on.

B

A

Of eligibles, so the.

B

Last piece: it's you so I, don't velodrome is like a large concept. I, don't think we'll be dead, because if you want to have all of the github contribution, metrics all of I, don't.

A

Know that we use this is my point like I thought, the one thing that I don't see in this dashboard, that I know we use velodrome for but I think it's. We just use it as like an influx database. Sync is the jobs, the query that run queries against bigquery and then transform the results using JQ into things that can be dumped into in flux so that we can get statistics about yeah, bigquery, metrics, dashboard. That shows sort of like everything that's going on. Yes,.

B

So the part of village room that wouldn't die would be the database. That's backing this.

A

Okay, but it so you're saying a monitoring is using a different influx instance.

B

At the moment, it's not using any influx at all and that's where the that's, where it's like significantly less complicated, it's just literally using a backing store for like oh, it's just using Prometheus entirely got you what's the retention like on that a gen just depends on how large you make the PB I. Think I'm, not a I, don't know if my head with what it's configured for open ship I think we have 100 gigabytes and we have it set to two months earlier. Okay,.

A

But so it's still probably possible to have this up to other data sources, including the Prometheus data source, yeah.

B

Absolutely so yeah like when you're creating your final dashboards like configuring, which genus are State draw from it, is so I'll fill a drum if we continue to have the the pieces that ingest github data and digest that into the dashboard or into the influence, as well as having you carry stuff. Also, you know publishing to that database as long as that database. That's around. All of that can just be transparent.

A

That looks super awesome. Thank you for demoing that Steve before we move on to other discussion type stuff, while we're talking about showing stuff I just wanted to give a quick mention of the most exciting thing which would be documentation.

A

So today, if I go to the testing for repo I landed a PR yesterday that tries to do a better job of explaining a lot more of what is tech in testing for and tries to sort of surface a number of the dashboards and tools that we have and then tries to better explain some, but not all of the directories that are in here I am very open to feedback from people, especially in the form of PRS, if they think this should be different or if there should be more stuff here, like I, try to make sure that what is here is stuff that I personally kind of encounter on a day to day basis that may be different than what you encounter on a day to day basis.

A

This is partially inspired by the awesome Docs that Daniel showed us from the akima project, though I did I still haven't liked that I'm even begun upon the depth. So what we could use from that? The other thing is all this stuff around job management. If I click on this, it will take me to a readme inside of the config jobs directory, which tries to spell things out, hopefully a little bit more I still.

A

Don't think it is quite as ideally step by step is what I have seen from the kheema project and other people, but we try to talk a little bit more now about like hey what even our piece of it's and post submits when we use this language, what are sort of our best practices for the images you're supposed to use for jobs? When even are these preset things that you see everywhere?

A

We're sitting good examples that you can copy-paste from like what are some known good job configs are just sort of the steps that you have to follow.

A

So, just again, if you try following these and you find things, don't work as documented I would really appreciate feedback, especially in the form of pull requests, to make this a little bit better. So hopefully that helps.

A

Okay, so next see, if you had an item here about sort of how we want to handle configuration on a per name space basis, can you help give us a little bit more context for what we're talking about here.

A

C

We want to skip.

A

B

Eric's knowledge, no, he is not. Yes, it's just for everyone else that might be interested, definitely jump into this right here and hopefully you're not hearing the wine jump into the thread and put your thoughts in there. The basic thing is I think you can so right now we were kind of in like a weird halfway State in Prague, where you can either configure, but you can configure proud to operate inside of a namespace by changing config mode. You can do it by changing the credentials that are given.

B

You can do it by actually setting it at a per job level, and so there's there's a lot of 17 different ways to skin that cat and I'm sure that if you, you know choose to skin it, the.

A

Squeaky toy I knew if I asked for it. I would get it.

B

Yeah, so there's probably just many different ways to skin that cat and they're mutually exclusive, and probably don't work well together. So if you'd like to be involved in choosing how we expose that in the future go ahead and drop into that threat, so.

A

B

The proud components themselves, or is this about the derivative derivative resources created to execute job. Give me one: okay,.

A

Right, so this is how, for example, today I think there's a hard-coded field and config yeah Mille I think we use to say all of the pods that are created as a result of crowd jobs within the test. Pods, namespace and I think you further use.

C

A

Cluster field on our out job can fix to say schedule these jobs over in this other cluster in the test pods namespace in that cluster, but then on.

B

Top of that you're also able to like direct like I, think the build controller and the pipeline controller currently also honor a specific field job that tells them which means to used to put things and where things are not necessarily pods and now, on top of that, I believe with those two controllers, as well you're able to change how that works by giving cluster credentials with a default namespaces. So.

A

Say personally, like I, think I can understand the idea that we want to like have sort of same prescriptive defaults. But we also want to allow enough flexibility for people to sort of meet their use cases. So I'm wondering if this is a case of like if there are too many options and they can flick with each other too much or if we just haven't, like documented sort of the hierarchy of like which options should override, which so that we can sort of it's.

B

Partially, that I think it's also partially a question of like what is what what do we expect administrators to be doing, and what do we expect users to be doing um like I?

B

Think if we all, if we give like a blanket, you know schedule like as an attack vector as a user I could schedule my test pod in the cube system, namespace and then do a lot of interesting things in there and so to some extent the argument might be that choosing which namespace things run in is like an administrative level configuration and should so.

A

B

There's a lot of angles at which this just yeah.

A

That's fair, I I have noticed a trend, a lot where we also we're not really sure what a best practices. So we sort of our tools have like really sharp edges that can allow you to cut yourself, but then we have. We try to sort of figure out what same practices are and then lint or test those away via pre submits. So.

C

Like sort of enforcing that, like all jobs, do things in the same way.

A

So, like we I think trusted jobs right trusted. Jobs are really just like a test that ensures that jobs that can be scheduled to this cluster can only live in this directory. That has a very specific set of voters. We don't really have a ton of defense-in-depth there.

C

Which you could probably get by providing templates and restricting new jobs to these templates right, that's very possible.

A

Is that something that you have used over in the chemo project? We.

C

Haven't had force it, but we have a lot of templates and our stuff and we're using them basically for everything. Yeah.

A

I feel like it's also pretty good copy paste examples from templates in some of your Docs and what I had been trying to do was look for like same safe defaults in our stuff, and it's been difficult for me to find sort of clean, consistent job configs that haven't organically evolved over like years and years, and so.

B

I, thank you for saying that I think, especially with the migration modules for something that was like on the agenda for product kids that I oh configs, because I know that for us to be like they're, not templates per se, that we exposed like the six things, we expect the easier to change, and then we generate those so I think having at least a couple years ago, when we lost like talked about this, having like a low-level figuration language that letting people you know come to that with the higher level.

B

Yes, Oh made a lot of sense. That might not make sense today, I.

A

I still appreciate it, I still like that as a model I, just don't think anybody's come come through with the DSL. Instead, we just have like a bunch of scripts that sort of munge the wrong configs, the humans off and coffee copy-paste around, without realizing what their copy pasting and why and.

B

So you're not sleeping intention as part of the migration to hog. You knows like coming to something sensible like especially from like every single job. That's that's or any cute tester, whatever, like those things so yeah coming making that with a piece of useful.

A

Yeah, this is just kind of like off the agenda, but it's like I tried moving the job over to Pocky tales as part of that Docs thing and I find that there's still a little bit of weird unexpected behavior there that is difficult to test locally, like I, would prefer to live in a world where we can encourage other people to migrate their jobs over and the best way to empower them to do. That is to give them like a you, know, develop and test cycle before they open up PR and like right.

A

Now we don't really have a great workflow or place for people to try out jobs and.

C

A

What pod details does differently than just running a straight container, which was surprising for.

C

Example to learn that, like.

A

I guess part of the decoration and stuff smashes the go path environment variable for reasons, but doesn't like smash it fully in a way that makes sense to people who have no idea. That's going on in the background, yeah and.

B

Like capturing the horror, UX of interaction, interaction with low level, config capturing the poor UX of getting feedback for changes, config I- think, oh, these things are actually in the as items again like one of the larger epics.

B

With these like sub sub items in it was like the self service management story for prodigals is like pretty port without layering, on extra tools, which I think but I think, as I mentioned to you on slack, like part of that part of that has always been like the like I think, there's a higher level of trust and other deployments of prow. And so then you know running job changes as a pre submit for PRS are changing, config makes sense, yeah, there's, definitely a middle ground.

A

So I kind of feel like I, don't really have the right people here or enough time to talk about the last thing I have on the agenda, which was breaking prowl into its own sub project and understanding the mechanics of moving proud to sort of the code lives over here and the configuration lives over here. That's that's like what I want to get to and I want to understand what technical or social limitations we have in between here and there I'm also aware that next meeting next week would be July.

A

2Nd I am planning on being around for that, but it is near. It is adjacent to a holiday in the US, which may mean some people are not available. So I will be sending out a notice to the mailing list asking if we want to hold it, but I think it would be a great idea to sort of walk through the prowl epics talk next week and really kind of hash. Some of these things out I also am aware that we have sort of GMT friendly or European friendly meeting happening on Friday Steve.

A

Do you want to speak to that real, quick.

B

Yeah there's a meeting happening of Friday 2:00 p.m. 2:30 p.m. GMT. If you'd like to have anything for that agenda, we may like to talk in the chat. Currently, there is nothing there's.

C

I'll send some of my European colleagues over yeah.

B

Awesome thanks Daniel yeah. Hopefully that would be I think right now we're thinking monthly, just.

C

B

As we need it when, when stuff comes up- and we just want to make sure that everyone's got a chance to.

A

C

It gives us a minute left. Any anybody have anything before I get kicked out of my room. Can you link the epics document? I, don't know if I have to link to that that you were talking about Rome's right sure.

A

And I'll make sure I put it in the upcoming meeting. It.

A

Okay, thank you and there there's my room. Sorry I have to leave now. This is how I ensure our meetings end on time. I will see you all. Maybe some of you on Friday and I see the rest of you. Maybe next week we shall find out about happy Tuesday. You.

C