Kubernetes SIG Testing, 20 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2020-10-20

Description

https://bit.ly/k8s-sig-testing-notes

A

Hi everybody uh today is tuesday october 20th and you are at the kubernetes sig testing bi-weekly meeting. I am your uh host today, aaron kriegenberger, also known as spiffxp on github slack and all the places we're all going to adhere to the kubernetes code of conduct during this meeting. If you have any questions or concerns with my behavior or the behavior of others, you're welcome to reach out to me or conduct at kubernetes dot io.

A

This meeting is being recorded and will be posted publicly to youtube as soon as zoom and youtube have finished processing it.

A

uh So I put a couple things to just kind of revisit on today's agenda based on our discussion last week, um so I guess I can share my screen to kick us off. If that's cool with folks, let's see cool, I think you're looking at the meeting notes.

A

So, uh first up, I wanted to talk about the upcoming docker hub changes that are rolling out november 1st. So the tldr is in the issue description here. Basically, as of november, 1st docker is going to rate limit, pull requests of image images from docker hub.

A

If we don't authenticate it'll, be rate limited by ip address, which works out to about 16 polls an hour if we're authenticated, it'll pull it'll limit us to about 33 polls an hour and if we use a paid plan, somehow we get unlimited.

A

The concern here is that this rate limit applies across all images. So if we focus on the images that are pulled most frequently as I'll show below, I think we're okay, but the concern may be that the long tail of jobs that run across the build cluster may end up, causing random nodes to start getting rate limited by their ip address.

A

So some ideas we kicked around last time were to invest in a pool of paid docker hub accounts that we could distribute to all of the nodes in all of our build clusters, as well as all of the clusters that we stand up for e2e test.

A

My my quick look at this is that it's gonna be a lot of plumbing to do so, um but that may be an option worth investigating.

A

Another idea, folks suggested, was implementing some kind of pull through cache, which again we'd have to hook up to all of the build notes, as well as all of the clusters. Under test um for mitigation, we decided we should sort of audit all of our tests and image builds and see just how bad the situation is, and we should generally advise people to not use docker hub and instead use gates.gcr.io, which is the image registry that or repository that the project provides for everybody.

A

um So antonio rightly honed in on all of the images used under kubernetes, which works out to these. I sort of found the same thing by going and looking at the kubelet logs from a cluster under test and saw that these were all of at least for the default job. These were all of the images that were pulled from docker on a given cluster under test.

A

And then I tried to survey the default build cluster within google.com, but did not have the appropriate credentials or time to survey that um so then we started look so then I sort of surveyed all of our job configs, to see what are all of the images that are used that are not hosted by gcr to io. You can see a lot of golang and node and python, and alpine and stuff.

A

So we looked a little bit further and google cloud provides something called dot mir.gcr.io, which is sort of a pull through cache, but only for really high traffic images that are pulled through from docker hub.

A

I forget, if it has the full list here, but the tldr is that any gke cluster that is stood up is automatically configured to pull from mirror.gcr.io first before it goes and hits dockerhub.

A

So this means that all these high traffic images like node and python and golang should be totally fine.

A

It's not something we need to explicitly specify in our job configs we're just going to get that for free, and it includes pretty much all of the images that antonio laid out here, including alpine and nginx, and pearl and redis, which I think maybe are now kind of verified later down here in the comment stream um so uh cloud, you raised an interesting point that mir.gcr.io may not necessarily mirror multi-arch requests, um and he also reminded us that there's there are dedicated plans for open source projects which I had started investigating right after last meeting and I needed to gather some more data on exactly how many images we pulled where and why.

A

um So. I kind of stalled out on submitting this, but I was going to continue to follow through on this as a possibility for getting us a paid plan for unlimited polls, that's sort of where we are.

A

What do people think we should do for next steps.

B

Is there somebody here who knows a bit more about the uh multi-arch use case, because it uh that part I didn't quite get like it seems to me that for the common things like mirror.gcr.io would work, and so I'm not sure why for a multi-arch solution, it wouldn't.

A

uh Claudia you're here, why don't you pick that one up.

C

Yes, sorry, I just joined so I checked the images on mirror.gcr.io and they are not manifest lists, which means that those images are only for linux. Amd64.

C

From what I've checked, uh there aren't a lot of jobs which uh are currently using the other architecture types. So for most scenarios we should be fine, but uh I think there's still too many of them, which uh will require some solution for that.

C

uh So your question was what about uh multi-arch traits.

B

uh Yeah, uh I think it makes sense that by multi-arch you mean different architectures. uh That makes a lot of sense now. Thank you.

A

Yeah sorry about that, I used some uh a buzzword there. uh Please do call me out when I do that.

C

I've actually listed here the jobs that are using other architecture types uh most of them are. uh There are three three periodic jobs: a couple of pre-submits and a couple of uh node jobs that I'm not sure if they're periodic or pre-submits, or something like that. I.

D

Think most of the stuff- that's not amd 64 is not even on like rci at all. um I think we should also step back and acknowledge that mirror.gcrdio is like a mitigation. It may help us avoid pain, but it's not. It's certainly not a solution, because uh even if we didn't have the multi-arch problem, it may not necessarily mirror the images we use.

C

D

No guarantee on that yeah.

C

E

D

An additional cache that happens to be fast, that we can very trivially roll out to ci and help mitigate some of what's going on our ci. It doesn't help, for example, um end users that need to pull it on in their own environments. um I think we still ultimately either need to get our docker hub, such that it's not re-limited, if that's possible um or any of the ones we use, or um you know, move off of it.

D

So another thing is, I think, most of these images that we've been talking about are in the docker, like library, name, space and I'd, be somewhat surprised if docker shot themselves on the foot rate, limiting that, since it's like their own registry, there.

C

That's a good question uh from docker: we only care about five images which are most commonly used in conformance tests and those are in the docker hub library. As I mentioned, those are the busy box image, two versions of httpd and two versions of nginx.

C

I'm currently looking into trying to remove the http httpd image usage. There aren't a lot of them. There are only a few cases which they are used and in most cases they are used in prolling upgrades tests.

C

So I was wondering if I, uh if I could make it uh use nginx images instead, so that would be uh get from five to three images.

C

I'm currently investigating this as well, because we are also confronted with this issue for the windows test jobs because we are currently hosting those images on docker hub before we finish all the windows test, image support and I saw in the docker hub faq that they are willing to support open source projects and- and I think by this, they mean that that they could give out accounts which are not rate limited, and this is something that we are currently investigating and see if we are going to be able to get some accounts like that.

D

My concern is that unless I don't know that anybody has confirmed this to me yet, but unless I read along my initial impression was that the rate limiting is per client. So if we get an account, that's not rate limited that could be used by like ci, um but it doesn't solve the problem that, like we, have tools that are built around rate. Limited images so then, like all of our users, will need to do the same thing.

D

And also more specifically, someone from docker commented on a thread a while back and mentioned uh contacting the cncf related to that sort of account. um I don't know that.

E

Anybody in the.

D

Kubernetes space has actually done that yet.

A

uh I've been looking into it but, like I said, I thought that stalled out on my end, so I apologize for for not moving forward on that, but, like I said what what would need to happen for that to be completely effective is to plumb that account and those credentials through everything.

A

So my my concern, just personally speaking, is I want to make sure that all of our pr traffic against kubernetes kubernetes does not grind to a halt on november 1st, so I'm principally concerned with making sure that all of the ci jobs that are merged blocking and release blocking for kubernetes kubernetes work, um then expanding out from there. We can talk about ci jobs that involve um kubernetes, but other me are, you know, live on other boards, uh so here I'm talking about uh cluster api jobs, I'm talking about cops jobs.

A

uh You know alternate the aks engine, jobs alternate means of provisioning. A cluster may require alternate means of plumbing the credentials through um expanding out from there.

A

You can talk about ci for all of our 100 and something sub-projects, um and there I feel like we, we may have most of the the common cases captured and we should advise people on how to move to gcr to I o or something else, and just let them know that you know they may they may be experiencing some bumps in the interim and then, and only then do I personally start caring about the experience of our developers and making sure that we don't have anything, that's causing a developer's machine to get rate limited, um because I feel like developers, you know they uh they can authenticate and at least get the 200 per 6 hours instead of 100 per 6 hours.

A

So that's something other concerns I have about stuff running in ci, which I tried to list here in the notes. Ben. Maybe you want to articulate some of this. You know. A lot of the things I believe we have covered are all of the jobs that end up using cluster cubeup.sh, um but not all of those jobs. uh So there are other jobs that use docker and docker.

A

We also have other release blocking and merge blocking jobs that use kind, and it is unclear to me that those will gain the benefits from mirror.gcr.io that everything else would.

D

So docker and docker should be trivial. If we have, we haven't done it already. I think I actually did already. I just need to confirm before this comes up uh to enable mirror.js um it it's a really simple configuration change, and we can we control that um for kind, it's a bit of a layering violation to be using that because it doesn't necessarily make sense for all users and it's not clear how it would inherit that from the host um we may have to think of it about how we do that.

D

um I I think for the ewe test. That's probably the main thing so, like all of the all of the image builds and whatnot for core kubernetes, don't use all of the things we run in pre-submitter post-emit, don't use it, though the only things that do are like say the e-image or something which is built much more infrequently, um but we have the images that we use for e2e pods. That claudio was mentioning for those. I think we probably want to get those onto kates.gcr.o.

D

So another thing that hasn't come up yet is we might look at extending the image promoter system to allow promoting images that we don't control into some registry like at least a staging one, or something uh something along those lines that we can use for uh things like ud, so that we don't actually have to host our own sourced nginx image.

D

um But we can maybe copy it from docker hub to somewhere that we know won't be rate limited that we already have the like credentials or whatnot.

C

So similar to what I suggested uh last week for the busy box image right.

D

Yeah, I think so um I apologize I'm drowning.

A

In stop sharing real quick just so we can see faces while we're talking. um So my impression is the image promoter right now doesn't support promoting from arbitrary registries. I think it only supports promoting from gcr.io.

A

I would see nothing wrong with setting up jobs that are just uh you know that have a very basic docker file. That's from the docker hub, hosted image and ends up landing in a staging repo. My thought would be the e2e test images staging repo that we're already using for things like uh again host and all of the other images that are used by our e2e tests.

A

Does that sound like a.

D

Fair suggestion, I actually think it would be relatively straightforward to just I mean, even without a knocker and without a docker image, we could just write some like cloud build jobs that just uh like push to staging like pull push, pull tag, push um or something along those lines, or even someone that has credentials could do it manually. I think we allow manual pushes to the staging registries um and then promote those um or or serve out of staging.

D

um I forget, uh if that's problematic or not.

D

If just as possibly an option, if we have some concerns about like, why is case.gcr and io hosting like busybox or something um it might make sense to uh just leave it in the uw staging images, which are a little bit less visible.

A

Okay, um so it sounds like one concrete task that I will own is uh reaching out to dockerhub and the cncf to look at either an open source set of accounts or paid accounts.

A

Who wants to help with figuring out how we would plumb those credentials through to everything.

D

All right, I think, I saw a hand from ben yeah. I I probably know that best um I'd be curious to know up front. uh If there's any possibility of like setting up registries that aren't limited on that end or if it does have to be the client. If it has to be the client, then yeah we'll need to plug that through based.

A

On my read of the faq, it has to be the client.

D

That that's my rate as well, but if we do wind up talking to docker hub I'd like to confirm that, um because that's that is not as nice for users versus something like gcr, where we can just pay on the hosting site and not make all of our users worry about it. Right.

A

And so that would lead me to it sounds like a good stop. Gap for kubernetes would be to take all the images that are used in e3 tests set up a cloud-build job that pushes those into the e2e test images staging repo. Does anybody want to take the lead on that.

C

I already started with the busy box image. I can just amend the request to also address the other images as well. Okay, yeah, I already tagged you on the busy box request.

D

It if we drop that, um can you send me like a ping periodically on slack or something um github. Notifications have been a bit of a tire fire for me.

E

The linking chat, thank you.

C

And I'll update this pull request with the other images as well.

C

For example, for the box, image is just a no up docker file and after that it just creates the manifest list and pushes to the staging repo registry.

C

A

uh Okay, have I left anything out.

F

Are there any other open source projects dealing with the same issue right now that we could reach out to I'm, assuming that other people have this issue.

D

I think most projects move to alternative free hosts like um or possibly doc or github's package registry. I think we should look at those for summer usage, but um we may not be as inclined to migrate to that versus, say gcr, which we kind of already have tooling around and whatnot.

C

I've been trying to look at the gcr io public registries if they contain images that we could use or is useful for day-to-day tests.

C

I found a couple of them which had busy box image and nginx and so on, but they're not manifest lists, which means that there are no images for other architecture, types on gcr io for busy box and nginx, which doesn't really help a lot.

A

Okay um uh and then I feel, like the other, the only open question I have is: what do we need to do for kind.

D

Yeah that one's been a little bit more of a single point, I would say for most of our upstream stuff, like uh the ew images, that it makes sense to host them alongside other things in case of gcro. If we need um for kind, we risk uh causing issues with our chinese contributor and user base, uh which so far have been able to rely on being able to create a kubernetes cluster by pulling one image from docker hub, which is actually available behind the great firewall.

D

um So just moving to case.gcro is not a great one there. The other thing is we actually have been using mutable tags for the moment, which is not something the image promoter allows.

D

Bit tbd on what we should do for that one um and then there's also like. If we were going to rely on mere.gcr.io, we may need to look at in what way would it make sense to plumb that through nci or any special, like docker poll credentials or whatever, which should both be doable? I'm just not sure how we want to do it.

D

I think the next thing I want to do is investigate uh the github uh registry, because I think that's also a potentially attractive option for our smaller projects um that don't necessarily need to be like hosted in case I gcr dio, or have all the promotion and whatnot. um There are some projects that are, you know using get of actions. It may be nice if they can just do that, have all the credentials handled for them and no fuss.

D

um I've heard it's supposed to be available, but I need to confirm more about how well it works. I'm not sure if that's an option we should move forward on, but I want to investigate it.

A

That sounds worthwhile. My knowledge is probably a couple months outdated, but I feel like reli. The release engineering team uh was messing around with this for a little while, and they found that github's package registry was too flaky when it came to pulling artifacts reliably.

D

Interesting uh well, I also believe they've had a major rewrite which is part of what I've waited um they had. You can pull by digest, which was a non-starter for both my own concerns around petting images and um container d, uh which always derefs the digest and pulse by digest.

D

So um I'm not sure if that fixed rolled out yet. But I understand that involved a pretty substantial rewrite, so I've been meaning to check it, post that and there's been fairly recent movement on that it was in beta, I'm not sure if it released.

A

Yet, okay, um I kind of want to move us forward in the agenda. Do we have any other open questions on docker.

A

Hub, okay, um this. uh So this brings me to my next open question. I sent out an email thread about this a little while ago. um We are gonna, not have our regularly scheduled meeting on november 3rd, just because that coincides with the u.s election.

A

uh A number of other sikhs have also sort of canceled order for their meetings. When I started looking at scheduling uh the next meeting after that, as regularly scheduled falls during kubecon, and I imagine, people might be elsewhere or otherwise occupied.

A

So my question to folks is: we could have no meetings in november. Her next regularly scheduled meeting would fall. On december 1st, um my thought was: given everything we've just discussed about docker hub.

A

It might be worthwhile to have a meeting on november 10th, just to kind of see where we're at as far as mitigating the docker hub stuff um and the other option is, if we're shifting one meeting off by a week, we have another meeting, two weeks from then november. 24Th um do people have any strong opinions on how many meetings we should have in the month of november and when we should have them.

D

I do think we should at least have one, and I actually think it's a good idea to move them around uh these events, which we're otherwise going to have understandable attendance issues with I don't know if we need all of them. I think the end of the year is usually a slow, slow time anyhow, but we actually seem to have a number of things to discuss at the moment.

D

um In particular, I think docker hubs issues roll out between now and the next possible. One it'd probably be good to have a meeting after that to just circle back with everyone on this.

A

Okay, I'm gonna say.

B

A

Cool, so my suggestion would be we have one meeting during november, it would be november 10th and then our next meeting would be december. 1St, post kubecon post thanksgiving for some people.

A

Going once twice.

A

So all right, I will uh reschedule the november 3rd meeting to november 10th and delete the other meeting. um Okay uh next thing on the agenda I'll share. My screen for this again is just to kind of check back in on where we are at with ci policy updates.

A

um So we've kind of stalled out on rolling out everything that was suggested in the ci policy updates.

A

The the main concern for me is finishing out making sure that all the release blocking jobs have to run in a dedicated cluster over in kate's in front and the same for the merge blocking jobs and the only jobs that remain are the build jobs.

A

So uh I tried to lay out what next steps are supposed to be for all the build jobs. um To recap, there are two google cloud buckets kubernetes release, dev and kubernetes release poll, which.

A

Live in a google-owned project that only allows google.com accounts to write to it and due to the nature of that project, we can't use it so we're going to have to migrate to new buckets in a new project.

A

So I proposed the replacement, buckets, kate's, release, dev and kate's release poll and what we need to do now are create jobs that write to those new buckets and then start changing over jobs to consume the builds that are placed in those buckets, and so I think carlos from the release. Engineering team created a canary job for the build fast job and arno has opened up a pr for the build job just for release blocking right, um and so now the next step is to have jobs consume those artifacts.

A

uh So once I I was curious. uh What the hold up was on implementing that and when I went looking, I discovered that cubetest had a bunch of hard-coded locations in it. So I added this pull request to add flags to tell cubetest where to extract its release. Artifacts from and my plan was to change over a job this afternoon, and if that looks good, I think we should change all the other jobs over.

A

The complication with the build job uh is that we discovered it doesn't just write its artifacts to a google cloud bucket. uh It also has this flag: um it publishes images to a gcr repo called kubernetes, ci images. I have no idea what uh what project owns that repo. So it's unclear to me whether we would be allowed to continue to write to that or if we would need to create a new staging repo to write to.

A

Would anybody be interested in investigating that.

F

I could take some time to look into that. Okay, thanks.

A

Great um at the time I tried to go, look to see like what are these uh images and what are they used by um uh code? Search was down at the time, uh but I I think uh I I grepped around in the kubernetes code base and it looks like kube adm makes some references to it, so I'm assuming that means most of the cluster api providers, uh probably have that repo hard-coded somewhere in them and they may need to be.

A

We have to figure out if they have flags to specify another repo to pull their images from and if so plumb, all that stuff through, but for now, I'm just concerned with making sure this job can write its images to to some place.

A

Assuming we take care of that, uh I think we are good, like I think we'll have then um uh this is just to be clear. This is something that, like does not require any google.com access to do. It just requires somebody to make the changes to the jobs and maybe send them more or less and make sure that they look okay.

A

So it is my hope that this is something that release engineering team. uh Maybe the ci signal team uh could help out with, um and I think we'd be able to finally close out the big three uh from our original uh policy issue, which I'm trying to find right now: okay, so I'm just trying to migrate back up to the umbrella issue.

A

uh So then that leaves us with uh these other three issues that we discussed, uh which was mandating that every job that runs on prow has contact info and also setting up a policy of removing jobs that were just continuously failing.

A

Rob I hate to put you on the spot, but I feel like this is something that had been discussed uh by the ci signal folks over in siege release. Do you have any updates on that.

G

um um Would you use this then I'm just I was just thinking about the previous one that you were talking, I'm just going to say that um I'm happy to work with those build jobs, but I think you've moved on.

A

Yes, I I had sorry so if we, if we finish out the build jobs, we'll have finished out the top three right and so then the other things that could happen right now too, in fact, um are mandating that all the proud jobs have contact info. uh So this issue, um we both had to look at this yet yeah and then uh sort of removing any jobs that are just egregiously failing forever, like that. Maybe is the signal that nobody's actually maintaining them or watching them.

A

uh So that's sort of like this is the the band-aid or the stop gap, and then how would we carry this policy forward? So how would we declare look if your job is failing for more than n days or and weeks, we're going to remove it so figuring out what that policy should be crafting it and rolling it out.

G

Yeah yeah, I think I think it's.

G

I suppose, from from our point of view, like I agree with the policy- and I just I suppose, I'd be concerned about doing that in such a way that that we um explain the policy to the to the to the community and, um I suppose, there's um yeah, there's multiple, there's multiple ways of going about this. um I think I'd have to take guidance on what the best way to publish the policy will be and communicate that, but I presume just um sending out your formulation policy here, which I think is reasonably straightforward.

G

We can put some numbers on how long a job takes to continuously fail whereby we go okay right. This is suspect and um publish yeah, publish a policy and then just announce it and then implement it.

G

um Is there anything here? That's really fundamentally hard.

A

Not that I'm aware of right so.

E

A

Recap: we've got uh this, this query is run daily and it shows which jobs have been failing the longest, so we probably have a couple. These would be probably the egregiously failing jobs that have been failing for coming up on 900 days for some of these, um and we could probably have something that just looks at this periodically and could generate a report or send an email out or something in terms of order of operations.

A

So generally we blast stuff out to kubernetes dev when we're announcing uh policies. uh Looking for comments on them, you know allowing enough time for it for lazy consensus, but another thing that could be helpful to do here is like uh to help figure out who owns these jobs.

A

um We could maybe establish who owns the jobs and then not just send a kubernetes dev, but also send to those sigs or those people specifically, as maybe a way to to raise signal. So it could be that we want to finish this issue. First yeah and my my rough guess and how I would assign ownership uh would be to just take a look at where the jobs are in test grid. Right now,.

A

Whoops, that's not going to work uh sort of when we were working on. um You know measuring how how we were going on csi policy or sorry. Ci policy had created the the proud job report tool and test infra under the experiments directory, which outputs a bunch of outputs, a bunch of stuff to csv, and then I imported that into google uh sheets, and so uh that report lists uh all of the dashboards that a job work.

A

It takes a good guess at which uh dashboard that job should belong to so for cases where jobs have multiple dashboards assigned to them. uh The most important dashboard is whether or not it's on cigarette uh master blocking or not right, but then secondarily, like which other sig name or work group name, uh does that job belong to so like the the evil.

A

Noisy approach we could take is just assume that if it's on your dashboard, you own it, and then we put your sig's mailing list as the alert address, um and then we say you're going to get alerts for these jobs. Do you want to continue to get alerts for them?

A

If, if you don't, you can either fix the job or you could remove the job, um something like that. But that would also give us just a quick way to like know which sig we should cc for, for which jobs sort of.

G

um Let me take a look at this. I I um I backed away from this for from a time to to break the back in the report, and um I'm and this week I'm hopefully going to finish out the the status report and and um I'll get stuck back into this and and and figure some things out.

A

Sure yeah, my thought is this could be. uh This could be. uh Oh sorry, I see I see stuff in chat too are now, were you saying you wanted to work on this or something else? I missed that. I.

G

Think it was the previews.

A

Okay, um but like this, uh I think ci signal has a bunch of shadows too, uh so it could be a great thing to point out that.

G

Yeah absolutely yeah yeah.

D

G

Lot of subs on the bench, as I said before, yeah um yeah, we could perhaps talk about this offline. If that was okay and then yeah and we'll uh yeah I'll raise an army to to work through this. I think.

G

Yeah once we've established the policy and then and then decide what actions need to be taken to implement it and yeah, we can get a few people on it.

A

Okay, uh any other questions, comments or concerns on the ci policy stuff.

A

All right, I super appreciate everybody's help in rolling this out. I think it's uh it's pretty clear. Ci has gotten a lot better in the project since then. um I know uh I locally have this uh grafana instance that I'm running eventually one day, maybe I'll figure out how to actually get it up in the cloud and make sure the credentials aren't broken.

A

uh I appear not to have it running on this laptop, my bad, I forgot, but uh you can really obviously see how the times were super wacky back in july and august and they've really come down and stabilized just in terms of overall duration of the jobs.

A

Okay, I see some other items popped up on the agenda. While we were talking um hippie. Do you want to speak to the release blocking job.

H

The release blocking job, we originally ran that out dot cncf.io and it was able to function agenda, but I think the first two links there were where it succeeds on our pro and it fails on the um kubernetes pro and we tried three different variations on it, uh just to make sure but there's something slightly different. I suspect it's in the sidecar, maybe the uh I think there was there's like decorators and we weren't able to get decorators working on our proud config.

H

Yet and one of the decorator uh things is the entry point so possibly there's some different behavior. On the entry point, I think I also included a link to our docker uh image or the code underlying it.

H

This we, I would would love some help, because this is one of our. This is our signal to say: don't, release the don't release this version of kubernetes, because we still have untested ga endpoints that were added this release.

A

So, just out of curiosity, it looks like you have this entry point script down here that tries switching to the postgres user. uh Have we tried that that's one.

H

It seems to work on all variations on our side um and it's just been difficult to understand, what's happening um when it's running the.

D

Go ahead: are you saying that you are running it without the pod utilities, in your cluster and you're running it with the pod utilities in our cluster? That's correct, yeah. So I would guess, then, that it is so the entry point mechanism in a docker file that says run this as the entry point and then you can have like arguments to it dynamically.

D

The pod utilities does not respect the entry point in your image. It totally overrides it. So it's probably something.

A

Like I guarantee um what what this output is here is the uh the entry point container provided by pod utils, so first off the purpose of the entry point container is to be able to automatically wrap the command of the test container, that you're you're providing and then it will take standard output from that and dump it to the build log which gets uploaded to gcs, and it's also responsible for automatically uploading anything that lands in the artifacts directory to gcs.

A

uh So if we don't use the entry point container, you don't get any logs. um The entry point container almost undoubtedly runs his route. So uh I think what ben- and I are trying to say, is one suggestion is to uh on your cncf prowl. Try using pod utils that way you could debug it there, where you have maybe more access to the the nodes and the cluster, um and you could maybe have the the command that you use for the container.

A

um Be this like be the script that tries switching to the postgres user before it runs stuff. What I this is my lack of kubernetes knowledge, showing what I don't know is whether, like a security context, is going to yell at you.

D

For doing this, I think I don't think we have jobs doing currently, but you also probably can just set the user id on the container in the pod spec in the proud job. um You'll just need to find out what uid that is.

A

So we did go through the approach of trying to set the user id on the the test. Container entry point doesn't respect like entry point runs as an init container and it seems to run as root regardless.

H

A

Part of this is.

H

Also that our entry point script, if you want to bring that up, there's a point where it compares the arguments provided um and it's right there in the same info, the arguments provided and the uuid to decide whether to set up postgres and that's probably, where we're running into this difference in arguments so um I'll I'll well, so to close things up real quick, we'll, try and get potty pills up and running, we'll see if we can't modify the args to uh to just have a simple entry point and likely just run his route, because this does allow us when we run his route to chong and then, if our first argument on line 51 is postgres, it will do the emit db that we were currently lacking.

H

So either we weren't running his route or either the db wasn't getting set up and we need probably both those if statements to trigger, uh including the one on 30, looks like line 32.. um Okay, thank you for that help.

H

The one other issue that we have is uh related did it get on there? Yes, advanced.

A

H

A

H

It is so much fun um in order for us to to correctly measure the amount of conformance coverage that we have. We need to have the audit policy not remove stuff for us, however, the the recommended, or at least by it's, not documented anywhere, but there is a variable that says. Please set this to override the audit policy and it's all deep in that ticket. I have gone through and run that manually like this script and and generated the audit policy.

H

It is a valid audit policy, but when you look at the logs it says this is a it's missing values or invalid yaml and I've not also in another on our proud cluster. I have not been able to get up or running with uh pr kind run, proud jobs on kind. The ability to run google drops like this. In order for me to be able to debug it plus the scripts are pretty it's hairy.

D

Are you saying you can't give kinda run on your cluster? That's something we can also develop with offline.

H

Kind will run um it's the jobs that were so. This isn't a kind job, and another thing I thought about is rather than trying to modify by this google job.

H

Let's modify the kind conformance job, so we don't have to run it in google and I can run it locally, but we need to modify and have an audit policy so that it logs and creates the audit log in the same way that the other, the other jobs do upstream, because we consume that audit log and bring it into api snoop db to see uh the user agents that they're passed the api server to calculate coverage.

H

That would be my preference honestly right now we're using this job, because it is the primary signal indicator for conformance, but I am very keen to use something simpler. Yeah, it's a release, block yeah.

H

I don't think the kind conformance one is release blocking, but I'm it would generate the data that I need. I'm pretty sure.

H

I'm look forward to your input on what direction we should head with regards to which job and if we should try hard to patch versus debugging, why the yaml file that gets generated, which looks valid, is not getting consumed by api server.

D

uh I mean finding out why it's not valid seems like something we ought to be able to do. I don't know if that's the correct password, though um I'd probably want someone else's opinion on which jobs we should use, as, given my inherent bias.

A

Let's see kind of has his conformance, so we could use kind. um It's valid. What's what's like what's happening here is uh what what you do for the next step of debugging. This job is, I would recommend you use cluster cube dot sh or you could you could even run this on your?

A

uh You can run this on your pro instance, where you're gonna have you know more access to ssh to things you probably set it up. So instead of automatically tearing the cluster down, uh you could modify the job, so it just leaves the cluster up and then you could go ssh to the node in question and see for yourself what the yaml actually is.

H

On that node I I sort of did that. I ran this job, but I I set the entry point to uh sleep and the arcs to infinity and ran the generation of the ammo file and pulled that out and used kind, and the policy was fine, so it right. If something else is happening um again, we probably need to figure out how to get pod utils up in such a way in prague, which is running on aws to start up the google credentials correctly, because without those pod utils we're not going to get the are.

H

I think the things mounted that we need.

A

Yeah, so if this is just about capturing api audit logs, you can always do that with kind. That's true, I know what api sleep has been doing in the past is looking at, like the default jobs on logs versus the conformance drops, so you get coverage of what has been tested versus what is tested with conformance tests and the reason we said. No, don't turn on don't put. Events in the audit logs is because kubernetes generates a lot of events and the audit logs gets huge in their performance implications.

A

So we don't want to slow down our default release blocking tests, but that seems valid for.

I

Confirming that conformance.

A

Tests cover what they say, they're covering and if kind can do that. That seems like a friendlier, faster approach.

H

I have a um a link: it's actually on cncf api snoop, the repo there's a kind folder, and that has the kind config and the the audit policies. If we could just find a way to update the existing uh conformance job for time to do this. And we just need to figure out where the odd the logs go, because I'm not sure but.

G

Yeah a couple of other items on the agenda to to hit open we're kind of running out of time. I think.

A

Yeah thanks for that, I so I think I would say, look at using kind for this, that that seems like a faster approach. um Waiting into debugging cluster kubo seems, like uh the other. If you want to continue using clip. Okay andrew uh you asked to create a new repo called e2e framework in kubernetes. Six, uh I'm not gonna lie the only reason I haven't slapped my plus one on this is because I'm wondering if we should use the existing repo we have in kubernetes 6 called testing frameworks.

A

Which, apparently, is retired, never mind uh so yeah we should probably create a new repo. Do you want to speak to it a little more? I know we've sort of we've chatted about this uh in in group chats, but I don't know if uh it's chatted to more people. That's.

J

Pretty much yeah I can yeah I can try to be. I could try to be quick here since we're a long time. um So it's the.

H

Right place to bring it up.

J

Yeah, okay, so um to just add more context, um there was some work that was being done by alejandro and tim and I sometime last year um and the main motivation from from my end. uh So so the work we're trying to do was we're trying to take the internal uh end-to-end testing framework in kubernetes kubernetes and move it to a place where it's more easily consumable from uh for cloud providers, csi drivers and kind of the entire ecosystem. That has been writing e2e tests based on that, um and so the main motivation was.

J

Let's move that the first motivation was: let's move that thing to staging so that um things can um depend on it without pulling in all the go depths from kubernetes kubernetes, um and then um I know uh you know. Stick testing folks here had concerns around um moving the staging moving, the current e3 framework to somewhere that's more publicly consumable, and so the alternative were that we kind of landed on was okay.

J

Let's just build a really minimal end-to-end testing framework from the ground up that is going to address cloud providers, csi drivers and other projects that need some sort of like base end-to-end testing framework, but may not necessarily want to pull in the giant framework. That's in kubernetes careers today, and so. Hence I'm making this request uh to put something in community six.

A

Yeah, uh the only other thing I'll I feel like came up in our discussion- is he wanted to make it clear that this was experimental? uh We, you were you're looking for some place to kind of experiment and iterate uh before sort of declaring okay. This is the way. Let's all use this.

J

Yeah, we should definitely have some like big warning signs to say: don't use this yet we're going to break it until we're happy with it. um So yeah agree. I know some folks have strong opinions around. You know what libraries and tools to use, so I definitely want to hear input about that as well.

I

Sorry go ahead there one quick question: this is for the whole e3 framework or for the cloud provider and all this external dependencies testing.

J

I think that's tbd, um I think yeah. So I think the starting point is like getting the test. Runners up and running, like wiring, keep config and all the stuff, um and then we probably do want some sort of like pluggable interface, for doing like pro cloud provider operations to test certain behaviors um but yeah. I think that's tbd at the moment.

J

D

You this solves that all the cloud providers need to move out, but they've been using the entry uh testing, um but we don't necessarily like while we're maybe moving the providers themselves out we're not necessarily moving the test binary framework out. um I super appreciate uh taking this approach.

D

I think, um if nothing else, I'm really hopeful that this proves out some as like a testing ground for ideas, even if it winds up being cloud provider, testing specific, we may know what we want to do to provide something uh for any other use cases uh based on how this goes like like we don't necessarily have to make this one completely general. um This is probably my only comment is maybe if we are focused on the cloud providers, we might even want to um tweak the name.

J

To suggest that yeah I'd be open to yeah like revaluing that once if it turns out to be just something that we tell cloud providers to use, and it's only for cloud providers, we can definitely repurpose the repo.

D

Yeah I I'm hopeful, it won't be, but I'm also given the you know already big ask of uh you- know trying something different here. I wouldn't want to block on getting too obsessed with being general.

A

Okay, well, uh I slapped my plus one on there as a chair. um I want to be super respectful of everybody's time and thank you all for staying three minutes late. um uh Thank you all for your time. I look forward to seeing you all on november 10th, happy tuesday.

J

Happy thanksgiving.

C

H

Thanks everybody.