Kubernetes Reliability & Testing Resources, 5 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Flake Finder Fridays #000

Description

Rob Kielty and Dan Mangum kick off Flake Finder Fridays, a new Kubernetes community livestream where we explore building, testing, CI and all other aspects of delivering Kubernetes artifacts to end users in a consistent and reliable manner. In this first episode, Rob and Dan are going to look at recent failures in a Kubernetes build job and chat a little bit about why it was failing, what tooling is used to build Kubernetes, and the infrastructure underlying all Kubernetes CI jobs.

A

A

Hello, hello and welcome to the first ever flake finders friday, with myself, uh robert kilty and dan mangum, how you doing dan doing pretty.

B

A

Are you I'm not too bad, not too bad excited to go through um yeah decided to go through the stage we have. um So what we're going to do on this live stream? Is we are going to go through an issue that we've.

A

uh Can you hear me.

B

Yeah sounds good.

A

I I I've just realized that I'm actually watching this stream, so I'm seeing myself slightly slightly delayed, I'm just gonna fix that.

A

Okay, sorry about that yeah I was wondering who's interrupting me and it was me okay. So, let's start over again, so um my name is robert kilty. I have worked as a ci signal team lead and as a ci signal shadow um and as has done um what we do on ci signal for the kubernetes project. Is we actively monitor um the state of ci as we go through a a release cycle for kubernetes and in doing that, work?

A

um There's a lot to learn in relation to how ci works in kubernetes and how the testing framework works, and indeed how kubernetes works.

A

um So this uh this live stream is um we're planning to do this once a month over the coming year and um it was born out of uh two previous um uh two previous videos and two previous live streams, one from jordan ligus, who um described some principles on how to deal with and how to troubleshoot and how to fix tests.um flake um on the kubernetes project and then myself and dan before christmas, at the the online contributor experience for 2020, uh we did a tour of ci and in doing both of those talks.

A

um What motivated us to do this talk was that it's great to get nuggets of information in relation to um principles and how to deal with test maintenance and how to fix test maintenance related problems, um but it usually leaves you hungry for more to see how those principles are applied and how to and what we're going to do in this live stream is basically go through issues that happened over the previous four weeks and how we figured out and how we figured out a specific issue and the whole process of doing triage root, cause analysis, figuring out the fixes and- and so I think, yeah that that's the raison d'etre for this uh uh for this live stream.

A

um Is there anything you'd like to add to that dan before we kick off.

B

No, not really, uh that was a a great intro. uh One of the things that I did want to add uh was we were talking yesterday and uh rob, and I have done quite a bit of kind of flake hunting and fixing failures and that sort of thing, but we're always continuously learning and one of the things we want to do with this stream, is definitely encourage folks who are in the chat or who want to reach out afterwards in slack or something like that to definitely feel free to reach out to us.

B

There are no stupid questions, and if there were, we would ask them all on the stream, so we'll definitely be learning alongside folks. The kubernetes code base is constantly evolving, um so there's always new stuff uh to be found.

A

Absolutely um so I think um I'm just trying to think so. Should we just head into the issue that we're gonna look at here we have. Am we have a link to a hackmd that I think um I shared on twitter, so I'm at rob guilty on twitter. If you want to pick that up and I'm trying to think what the best way to share that with the audience is.

B

Yeah, you could just drop it in the youtube chat if you want. It looks like a few folks are in there. Okay and I'll go ahead and start sharing my screen here today, yeah this is nice.

A

Excellent, so I can see that and so yeah, so so should we just let's just hop in uh straight into it I'll mute myself and see if I can drop uh links into the chat and uh I might chat with uh jeff who's supporting us on this. Thank you, jeff and uh yeah. Let's go ahead.

B

Awesome sounds good all right, so uh today we're going to be looking at uh an issue that came up actually this past week and it's specifically related to uh some of the work that I do um as a sig release, tech lead um and a release manager, and I think it's really interesting because it touches a number of different areas of the kubernetes infrastructure, a number of different teams, um and uh so it allows us to kind of explore um and connect some different pieces together.

B

So typically- and- and we want to show this as kind of like the actual process that we go through and that other folks go through um when they get ping done on something flaking or failing, but typically the ci signal team, which is a release cycle team right. So every release cycle, there's new ci signal, team members and there's a team lead in shadows.

B

They are responsible for monitoring the test grid dashboard, so test grid is basically where we have a view of all of the different um uh jobs that are running, so each job is made up of different tests.

B

This is not a great example, because there's only one test running which is building kubernetes, but maybe we'll look at this skew cluster latest and you'll see all of these tests, and these are coming from the kubernetes code base, and you know, through this stream and some other ones, we'll definitely talk about how you know these get selected and how these dashboards get created and that sort of thing um and definitely feel free uh to ask questions uh in the chat.

B

If you want to know specific things so, like I said um we're specifically looking at a build job right, so um things that uh uh jobs that build kubernetes and then you know, push up the artifacts for folks to be able to consume, and these are ci build jobs right. So these are incremental, builds, not you know major releases or something like that, so the first two that you'll notice uh are build master and build master fast. These are kind of our two primary build jobs, but then over on our informing board.

B

Here we also have this build master canary, and this is going to be the subject of our stream today and it's going to talk about kind of some of the work that's been happening over the last year or so or maybe a little longer, and it's also going to touch on how the release tooling connects to the test. Infrastructure connects to the kubernetes code base and more so it's going to be a fun one.

B

I also see a bunch of folks who uh are in sig release in the chat. I wanted to welcome all you all and everyone else as well. uh We appreciate y'all tuning in uh it's exciting, to see folks here and rob definitely uh feel free to jump in here and and interrupt me as you see fit um and we can go from there.

A

Yeah, I will do yeah cool, I can. I can represent the audience and ask and ask questions if needs be.

B

Awesome sounds good all right, so uh the first uh thing we're gonna do right is look at the issue that we were tagged in on so joyce opened this issue and joyce is the ci signal lead for the 1.21 cycle and they are awesome and they are very active at opening issues, and this one uh was opened pretty quickly uh after this job started. uh Failing actually um difference between failing and flaking would be failing as consistent failures on continuous jobs. uh Flaking would be sometimes failing, sometimes passing so.

B

This is a failing test and joyce has denoted it as such and you'll see that joyce went in and got us some information about what was happening from the build logs linked to the issue and gave also some some helpful context right for us to be able to troubleshoot this and for anyone else who came along and wanted to help out with fixing this to do that. So the ci signal team is really great in that regard, and this is an awesome demonstration of that right here.

B

So um you know this is labeled with sig release, so we know kind of the area in which this is happening and it's sig release, because it's a build job right, and so we were able to troubleshoot this and uh kind of triage it between sig release, members uh and joyce actually uh did another great thing here and pinged us directly in slack and said: hey, this is failing uh release managers.

B

Can you take a look at this, which is how uh sasha and I showed up here um and we had some additional context around some changes that were recently made that could have happened so you'll see this kind of triaging here where we assigned ourselves and said we'll, take a look at this all right, so we already took a look at that job that was running and to understand what the purpose of this job is. It's important to understand some context around some changes that have been made with release engineering over the past year.

B

So, as I mentioned earlier, basically, we used to have in the k release repo, quite a few bash scripts and one of them a very large one, uh called inago was used to do a lot of the the release, machinery and that's actually been replaced and and other operations have been replaced as well by a tool called krell.

B

So okay release right crawl um and it's a lot easier to maintain, because um you know it's an actual binary, that's built, and it's not just bash scripts that uh can be kind of hard to troubleshoot um and can run differently in different environments and that sort of thing uh so there's some great work specifically led by sasha, who I see is in the chat today uh to replace some of that.

B

uh Likewise, there's also, this push build script, which does a lot of the operations around building kubernetes um and it is kind of getting replaced or a lot of parts of it have been replaced as well in the crell release tool and that's what we're gonna be looking at today.

B

So uh there's also the test infra repo, which, if you're familiar with kubernetes test infrastructure, uh you've probably been here before um this is where we configure all of the dashboards to say what tests should run, what job should be present on what dashboards etc. All of this is a result of the artifacts that are in test infra, there's also some different tooling, as well as images that run jobs and kind of specify how things can be invoked all right. So once again, let's get back to the actual issue at hand.

B

Here there was an error showing up in the build job and let's actually go well actually. Joyce gave us a great link here um with a spyglass example. Failure here, which spy glass is kind of um which you'll you'll frequently hear, and we mentioned this actually in the controvex talk. I remember uh you'll you'll hear spyglass referred to as prowl you'll hear everything referred to as prowl. You can refer to me as prow.

B

It doesn't matter everything becomes proud, but this is actually called spyglass and there's a number of different tools and rob's, probably actually better, at picking out all the different tools that you can use to troubleshoot things uh frequently when I'm going through how I've done things uh rob gives me a little more insight on how I could have done it better.

A

So so so the way the way I described this is that that spyglass prow is the job runner um and what what proud does is it launches and the ci jobs for kubernetes, and what we're looking at here on this page is is called spyglass, um but the only weird thing about this being called spyglass and the reason that not a lot of people know this is there's no reference to spyglass on the ui, um but spygla spyglass is the component that renders this ui for us, and I suppose one of the one of the one of my spyglass tips is to click on all of the things.

A

So there's a all of the information is present here. So there's a there's, um a lot of links at the top um artifacts point to all of the job run, artifacts that are created when a when a ci job is run and and that's um and then a lot of those artifacts are bubbled up into this view that we see here.

A

um The other sort of top tip I have for this tool is, is that if you hover over one of the line numbers on the left hand side, you can actually get a link to that line number.

A

So that makes an individual log line um shareable when you're logging an issue and that that's a really useful feature that that you kind of have to hunt around and discover, um and then the final thing about spyglass is there's a lot of runtime information at the very bottom of the page in relation to the containers and the pods that were spun up in order to in order to run the job in question and from a troubleshooting point of view. All of that information at the bottom is useful too.

A

B

I actually did not know that you could link directly to the line numbers, so I just learned that today and that is super helpful and I need to apologize to everyone who I've shared this view with, with with a snippet pulled out of it, that they had to control off to find. um So. Thank you.

A

For that rob it's subtle, yeah, it's subtle, yeah, just click on the number and then click on the number and then on the on the address bar. You will see a hash, a reference within within that view, absolutely.

B

Awesome yeah, that's super cool, um so this is looking at one of those uh failures that was happening and and, as I mentioned, joyce had already put in here- a kind of little snippet here that said unknown flag platform and then showed the output here, and if we actually go down to the bottom, we should see previous docker build failed um and that caused the job to fail, and then it just cleans up from there and our artifacts were not getting released so um hopping back over here.

B

um I did want to take a moment to compare this to what was here before so uh two things: kind of happened in tandem uh to uh cause this to be kind of a complicated thing to troubleshoot.

B

One of them was the actual dashboard configuration was changed and one of them was uh the code path that was getting executed, which in this case is just a build, um but a code path nonetheless, nonetheless um was also uh changed around the same time. So I want to look at. Let me see if I can find this test infra issue or this pr all right.

B

So, uh as I mentioned, we've tried to move from uh bash things or or complicated, hacky things to more standardized tooling, um and that's and that's no, uh no offense to anyone uh who has worked on some of those hacky things because they've gotten us to where we're at today right uh and it's a joint effort where people are aligned and moving forward with new tooling. um But this is uh where we moved uh this krell job into um the the build canary job.

B

So if we look at what actually changed here, we used to have a different uh build canary and it was using the bootstrap script which, if you look in any of the where's, a good example of this, probably actually just the main, build job. I believe, uses it.

B

If you look in any of these jobs, there's going to be a big disclaimer at the top where it says um that is not what I want hold on one. Second, it's going to say at the top. Please do not use bootstrap.pi and we all ignore that and continue using it. uh So don't be like us. Bootstrap pie is deprecated.

B

We do not support it. It's building kubernetes, though right. So uh it is implicitly supported, unfortunately, um but please do not use it with jobs going forward, and so what we want to do is move off using things like bootstrap.pi, um and so we had a canary job, which the reason for this build canary is another effort.

B

So keep track of all these different efforts that are happening simultaneously, but to move from google-owned infrastructure to kubernetes community-owned infrastructure.

B

um So google is very generous and uh contributing uh compute resources to kubernetes to be able to run all these different jobs, but it places a lot of burden on folks at google to be able to maintain these things and to have to fix everything when something comes up right and we don't want that- we want it to be a general thing that community members can help maintain so we're trying to move off using a google infrastructure to moving to kubernetes community infrastructure and that's what the purpose of this build canary was, which is different from build master here, which is this.

B

Did I have this up over here? Let's see build master should be using uh something like kubernetes ci images, which is the old kubernetes project, and then the newer one here is likely using something like kate, staging ci images right, so we're moving over to new infrastructure um to make it more maintainable from a kubernetes community perspective.

B

um So there's there's that aspect of it and then there's also the I'm losing my tabs now, but so there's that aspect of it, which that's what the bootstrap job was serving its purpose of and then there's also moving off the bash right. So this is a bootstrap job that was pushing to the new infrastructure.

B

This was a krell job that was pushing to the new infrastructure, and what we're doing here with this pr is we're making the krell job the canary job um so we're eliminating that bootstrap job, we're saying this crow job has been running really well, it's handling pushing the the ci images to the new infrastructure, so we're going to drop this bootstrap job and replace it with the krell job.

B

So that was a really great thing and we are excited to do that. You'll actually see noted in here that the no bootstrap job, which is the the krell job, uh has been running successfully, but there was also some recent failures which uh we didn't think was an issue at this time, but it turned out around the same time as this change was being made. um Let's see if I can find the pr up here, we switched to uh using build x for all of our image builds, uh and I have some links down here.

B

To uh some articles about what build x is. What build kit is how it interacts with your typical docker experience, etc, and we're not going to go too deep into it today, but essentially it makes it a lot easier to be able to build multi-architecture images.

B

So sasha has been very instrumental in leading us towards us, so this pr here actually updated um us to using buildex, which you'll see we are actually using build x for some alternative uh architectures here, um but this switches to using it for everything, um and so we switched over and what happened was the krell job started failing and then we promoted it to the canary job, which made this a little bit difficult to troubleshoot, because one of the things that rob- and I would probably like to improve about test grid- is you only get?

B

You know what you're presented with right now, um so this build master canary was doing one thing and then it was updated. But all we see is one continuous run right of the same job, and maybe you know we could do things like you know, use a different name or something like that. um But what you'll actually see if we go back in the failures is that some of the failures which we had quite a few here, um we're initially still running the bootstrap.pi, I believe.

B

Well, actually, we had some failures that were related to infrastructure, things, which this is another good thing that rob point out earlier. uh These are running in pods in a kubernetes cluster, um so you can have issues with the infrastructure itself, but anyway, so this started failing and then was promoted. So then the build master canary starts failing, and this manifested in some interesting ways which we can show an example of. Let me actually just pull over this slack conversation here.

B

um So the cube admin folks pinged us because they use what's called a version marker.

B

So uh this is pretty common with ci systems right when you build a new version, uh we want to, uh you know, put whatever the latest is uh so that folks know what version to pull down and use if they want the latest ci build. um So if we look at this right here, it's just showing the version of the latest build so that what the issue that was happening was this version.

B

Marker was getting updated, but the ci images were not available because our job was failing right uh in cube admin like very good kubernetes. Citizens uh have already changed over to consuming uh the community infrastructure right. So these images and artifacts were available on the google owned infrastructure because that's what our build master was doing, but because our canary job started failing uh the the version marker no longer had artifacts that were present on the new infrastructure, which cube admin uh is rightly switched over to consuming.

B

So that's an example of how this affected users, even though it was kind of like intermittent, builds right. These are not actual releases, and I actually just saw a question pop up in here: do the canary ci workflows run on prs also, or how can we validate that that the fix works um so we're going to look at a couple different ways how we troubleshooted this?

B

um The canaries do not run on pr's right because you wouldn't be publishing uh pre-submits, essentially, which are tests that run on a pr before it's merged, but we can talk a little bit later about what does run on pre-submits to help validate some of these things.

B

All right, so that was promoted, it was failing um and it was confusing right because actually, the uh the krell job uh and the build master job, which is uh using bootstrap still um one of them, was passing the build master one and the crow one was not um and they're running the same exact commands. Basically, even though they have kind of different infrastructure around them, they all end up running in the kubernetes repo essentially well. Actually, we can just look at that pr from sasha.

B

They all end up running kind of the same commands right, but the the image it runs in as well as some of the setup that happens beforehand is different. um So if we look at the output of the logs, let's see this is a a crell job here we can actually take a look at the um actually, let's look at the prow job, and we can see that the image we're using is kate staging relinj, which stands for release engineering kate's ci builder.

B

This is an image that we build in the kubernetes release repo and that's going to be under images.

A

B

And if we go into release inch kci builder, um this is kind of the the files that we use to construct this image and then the bootstrap jobs which, let's see if I can find where I have that open. um Let's see if this is one of them.

B

uh No, that's also a kci release. Boater. Let's go back to this one.

B

And check if we can see right so here we have kate's test images bootstrap, and that is built from the test, infra repo under images.

B

um Here we go and if you look at bootstrap we're not going to get into all the different things that get set up here, but basically these different images set up the ability to run docker and docker right. So we need docker present to be able to build these images and push them, but we're running within a docker container as well right. um So um the the difference here is that one of them was failing. One of them wasn't right uh and they're running the exact same command. So how could that be? Well?

B

The most likely situation uh is that we have different versions of docker running right. Our environment has to be different um somehow so that specific flag um with the dash dash platform not being recognized um that has to do with uh how build x works, um and we can actually take these different images right and we can run them locally and kind of see what's inside of them, there's other ways to troubleshoot as well. I particularly like this because it allows me to play around in it a little bit.

B

So this is the bootstrap image uh that I mentioned, and I'm gonna run this locally and let's just check out kind of what's inside of here, all right, so we're inside and let's see what docker version we have all right. We have docker version 20.10.2 all right. um Let's now take a look at the ci builder image and check. What's inside of there um docker version all right. We have 19.0 1903 13., um so obviously an older version right so right away.

B

We could start to see um that there might be some issues and I'll go ahead and show you that if we try to run build x here- and it doesn't really matter what platform we use, because we're not actually going to run this.

B

So it's not a real platform, but we're going to see the same exact uh issue that we're seeing in the jobs right we're seeing this unknown flag platform which to me this is a bit of a uh uninformative error because what's really happening here, is it's not recognizing build x, but the the main error? It's giving us is that um it doesn't recognize this platform flag if we actually ran it without platform. I think docker build x, yeah build x is not a docker command, so that would have been maybe a little more helpful.

B

But essentially buildex is a newer part of the docker cli. um It's basically a different engine that you can use, um and so we don't have that enabled by default and how you can enable that by default or not by default, but how you can enable that is by setting docker, cli experimental um to enabled- uh and let's see here if we actually have that set um and if you're watching from home you're, probably like no dan, we obviously don't have that set, or else this would be working um and you're right.

B

We don't um and in the other docker version right, docker cli is, is no or build x is no longer experimental, um so it was working by default. So when we went ahead and switched over to using buildex uh in one environment, it was saying okay, I recognize build x and platform and all the things that go with it, um and so that's going to work and in this other one it's not so the there's a number of different solutions.

B

We could have from that, and I kind of enumerated them here and I'm sure there's other options as well, but the the first that would kind of come to mind is like okay. We need to update our docker version in case ci builder, so that would solve the problem.

B

uh You would think uh we could also just make sure, in that case ci builder environment, that docker cli experimental enable to set, and that would also solve the issue and the last thing- and this is what we actually ended up doing- is setting docker cli experimental, enabled in the actual command that is run, and so this is the pr for that, and basically just in that in that uh image building and in the conformance image building we're just setting this uh before we actually run the command and why this was viewed as the optimal short-term solution, at least uh is because we'd like for these scripts to work for as many people as possible right um and, if folks are still using.

B

um You know older versions of docker. We would like for this to work out of the box and there's some build x setup that I'm not going into that could also make this complicated, um but this was the kind of the the best solution for the greatest number of community members, um and it was a short-term fix right. So we could get that in. uh We actually didn't have to build new images um or anything like that, so that was nice um and we thought all right good. We solved the problem.

B

We are great kubernetes superheroes. Now, um unfortunately, we did not solve the problem.

B

We got to a new problem which is progress and if you do any of the flake troubleshooting or failing test troubleshooting you'll start to appreciate more of getting a new failure.

B

But let's look at a later failure, particularly the one right before it, it flipped to passing um and see what we're getting now inside of here, all right, so we're building all of our go targets. Okay, so that looks good before what was happening is we were having an issue with docker build? Okay, it looks like that.

B

Our docker builds are running here, at least some of them, um and this this is kind of the the output you get when you're running uh build x, um but we see that there are some issues here with quiet currently not implemented. That kind of just sounds like a warning to me, um but you know could be something: sasha noticed it as well and sasha said we can clean that up and also we want the build logs to be verbose if possible.

B

um So while sasha was kind of working on this, um I was taking a look and trying to fix it and we're like. Oh, this will probably help right we're going to get more verbose build logs here. uh Maybe dan will be able to fix it and we'll be able to move forward, but we didn't really think this would have a big impact on it. So this got merged, and this was a few days later, as you can see, there's lots of failures um and all of a sudden it turned green.

B

And at this point uh I was pleased or we were pleased, but uh but we didn't know why right and that's that's an issue as well. That's uh almost as bad as something not working right, not knowing why it is working.

B

But if we hop over here and take a look at a successful run.

B

We can see that the quiet currently not implemented is no longer present there right and what was happening was and once again I want to reiterate that so we added the docker cli experimental, which should have solved the difference between the bootstrap image and the kate ci builder image, but it still had they both had the quiet flag right. So why was the the build master job still passing?

B

um Well, I ran this locally with my own docker installed, which I believe I have 1903, probably which a lot of folks are still using, um and uh I also got the error. So, let's just try that with docker build x, build there's nothing actually here, but quiet currently not implemented, and you see it didn't even try to build. And if I look at what the output is, we get exit 1 right, um so that is causing an issue because in our release script over here.

B

And actually I'll do this in the terminal, because rob gave me some good feedback about that. It's a little bit easier to follow in an editor, um so I'll try to do that and if folks, uh viewing also have more feedback around, you know how this can be easier to follow along definitely give us that that feedback as well. um But basically this is the the script that uh sasha was modifying right with the build x commands and then I later added the cli experimental flag.

B

um And so let's take a look at this um and let's see we were searching for um docker, let's find where.

B

A

You search for the minus q.

B

That's true: I should have done that, but.

A

B

Am here now so? Okay, right, like I said earlier, rob always does things the most efficient way and I do them the the painful way, um but anyway right. So so we have this. This minus q here um present and they're both executing it right.

B

So why is one of them passing and one of them failing um and we'll also notice that if we go down a bit further, these are all executing in parallel, which is why we're seeing some successful outputs- um and so this wait for jobs, though, is waiting for all of those docker builds to to finish up and make sure they all exited with zero status code, and so some of them it appears, are running and specifically, actually, the conformance image was running because we didn't have the quiet flag on there, but the other ones are failing, which is why we are seeing that um and the reason for that which I'm going to go ahead and jump back in these images again to show the difference is that build x itself is versioned differently um than than docker is, and so, let's once again look at the bootstrap image here um and we can say: docker build x version and it looks like we have v0.5.1 and if we look in the ktci builder image.

B

We can see well actually I'm going to need the docker cli experimental to even recognize buildex.

B

Enabled docker build x version all right, so we see that we also have an older version of docker buildex in this image, um and so- and I I think I actually did this prior to doing this. Little exercise wanted to search for the issue and see if we could find where anyone else was seeing this quiet not implemented, and here is the pr to build x.

B

That replaces the error generated by quiet option with a warning and if we go and take a look at this actual commit we're going to see the helpful github hint that this was included in v0.5.1. And if we look back at these images, we're going to see that we were running 0.5.1 in the bootstrap job.

B

So it was just warning about that quiet, flag uh and in the kci builder job uh we were actually erroring on that which was causing our job to fail, was causing folks to not be able to consume uh from the new uh community infrastructure and actually the way. I think that we ended up catching. This um was that sasha had a newer version of buildex on his local machine, so it was actually kind of happenstance uh that it was caught, but we were chatting about it in slack and that's how we got to it all.

A

Right so um this is kind of yeah there's a kind of a whole inception vibe, as is often the case in in container runtime stacks. uh um I suppose, if you were coming to this cold, this would be quite intimidating, be fair to say, um you're, not coming one of the things or even, if you're not going to a cold yeah, but but in terms of in terms of in terms of scaling in terms of scaling the wall um yeah, it's not a learning curve.

A

It is a climbing wall with an overhang, and um you know a couple of holes taken out. It can seem like that, but but it's important to note that um that that, as a team and as a community that there's a lot of leaning on each other in terms of uh in terms of getting to the root cause of of these issues would be fair to say. I think.

B

Absolutely absolutely and and lots of that happens uh in the slack channel, so it can be really helpful just to jump in there and ask questions and that sort of thing I know when I was looking through this failure here I was asking stephen questions. I was asking ben questions and they're all providing really helpful context, including looking at how we enabled build x in k, release, which is with a script that we actually borrowed from kind which obviously ben does a lot of work on.

B

But we have this uh init buildex script, which helps me be able to identify right how we are actually setting up um the virtualization right to be able build for these different.

B

Go ahead and note this real quickly, um but uh one of the things uh that happened in the meantime uh is steven noticed uh that those uh cube adm folks um were having trouble consuming the latest. So uh he went in and went ahead and re-enabled that uh no bootstrap or that bootstrap job excuse me um and just named it uh build kate's infra, and this allowed us to get those images available again right until we could determine what the issue was and provide a fix.

B

And then, after that we just reverted it uh and that's why we only have to build canary now.

A

So um one of the things that I was going to say down just if you're, if you're coming to this um uh completely cold, um that that the kubernetes as an application uh resides in the kubernetes kubernetes repo on github, um the we have a separate intro. We have a separate repo called test infra and that and contains all of the tooling that supports and underpins ci in the kubernetes project and then, as a separate as another repo.

A

We have sig release and and a lot of the tooling that we've been talking about today is uh resides there and is tracked there and that tooling is used to, uh I suppose, build and build uh kubernetes releases to be released to the the public um and and one of the. If you're coming to the project. I knew uh one of the things that you kind of have to um uh sort of feel your way around and learn is, is what repos do?

A

What and and we've a couple of links at the bottom of the bottom of the show notes to to provide some sign posting into that whole world.

B

Absolutely um so yeah, that's that is most of the uh the failure we actually wanted to um show today. I don't see any uh major questions here that I don't think that we haven't covered um in the uh chat, um but I know we've been going for 40 minutes. We have about 10 more minutes that we could fill here. Rob was there anything uh you wanted to specifically look deeper into uh that. I was walking through there.

A

um I'm trying to think like I mean I think, for somebody to go through it all those links are there. um uh The main thing I just want to reiterate is is that everybody doesn't know everything and that if you keep up, if you keep on asking questions and tackling issues and ask those questions in in the slack channels and that that you'll get the support you need and any contribution that you want to make will be, uh will be. Welcome.

A

We're always looking for more people to work on on ci, cigna and, and the project really does need more people to get into this and get into uh test maintenance. uh It is challenging, but there's a big community there of people who can help you out with this work. You know.

B

Absolutely absolutely well said: um well, I think we can wrap it up a little bit early today. um Definitely this was our our first stream right. So uh we're learning what's helpful for folks, um just as as you all are learning alongside us about kubernetes uh infrastructure.

B

um So we are very open uh to feedback and hearing about how this can be more effective for folks and how you can you know best utilize, this information uh we are talking about where this should live.

B

I think the the last thing we mentioned was somewhere in the sig release repo for folks to be able to kind of like go through, and um you know run some of these commands themselves and get familiar with um the the different test infrastructure um but yeah where please let us know um you know, what's helpful and what's not.

A

Yeah thanks so much for that dan. Like I mean that's, that is a proper, deep dive. I think we, if we do this once a month, uh we should all learn lots of stuff that'd be great.

B

Yeah, absolutely and maybe maybe even more than once a month right. I think we were talking that uh there's there's plenty of uh failures and flakes to uh to fill lots of hours of content. Absolutely.

A

B

Yeah for sure cool all right, well uh jeff! I I assume that you'll sign us off here, but yeah. Everyone have a good weekend and thanks for tuning in today,.

A