Kubernetes SIG Cluster Lifecycle, 11 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CAPI e2e Deep Dive - 2023-05-11

Description

Cluster API e2e test deep dive session discussing CAPI issue 8641

A

Okay, it's the 11th of May 2023. This is a cluster API uh n10 test deep, live session, uh we're going by the kubernetes code of conduct, which comes down to really be nice to each other and yeah. Today, we're just going to go through um a couple of tests that are inflating on the cluster API CI, and so you can get to the bottom or white.

A

uh So first I was just looking at some of the issues we have here. So the first one get a rate limit.

A

This is where I think we've spoken about this one before, but we're hitting the rate limited GitHub, because cluster cartel calls GitHub, when it's trying to find out what versions that you use and download manifests to use in our tests. So I have an issue open over at test. Infra, it's not linked there. Maybe I need to check yeah. Let me check to link it back, but basically we're trying to get a GitHub token created uh that will allow us to hopefully not hit that rating or cluster Global.

A

um If we can get a GitHub token I think our other option is cert manager. That mostly calls us to hit that limit um so yeah. We need to maybe render that yaml, but that's not ideal for a couple reasons uh this one.

A

Nobody is working on right now, I create the issue, there's a unit test, that's flaky! So if anybody wants to pick that up, that would be really useful.

A

uh This one we did a deep dive on last week. Christian has a I think, probably a fix yep.

B

uh The below link- that's.

A

Fun and he's probably waiting for me to review it. So um if anybody else wants to review this, please do um and yeah. Hopefully we can get this merge soon and that'll help us with flakiness, and then we come to what I want to talk about today. So I just create this issue earlier.

A

uh We see this issue fairly regularly. It's probably the most common single error in our triage for flaking right now, with some intermittent spikes of the rate limits which I guess are correlates with each other because of form customer, then a bunch of them probably do in the same time period. So there's no control plane machines come into existence.

A

This means we're failing at creating a cluster at all. uh No control, Play Machines means absolutely no cluster speed, no API server and so on. uh So let's look at it in I think this is the triage Network.

A

S, oh here's, two of them from triage. Let's look at the overall triage first, uh so when two failures to today, look like no control plane machines come into existence.

A

um These are the jobs that's been failing on. It's failed in six jobs in the last I'm, not sure, oh since the fourth or the since the 27th of April I, think um these jobs are all on 1.4 um but they're between the intern tests and the main case and 10 test. And then, if we scroll down here, we have them. So this is sorry. Another very similar error, I'm, not sure exactly how this triage works, but make some of these errors, which look very similar, sometimes be split up but yeah.

A

This is obviously the same type of error and it's also Happening Here on 1.3 and on Main. So this is dispersed across all of our end-to-end jobs. If you look at the actual tests, it's failing on. It's also failing all over the place, cost a couple upgrades and no drain timeout upgrading a workflow cluster testing on self-hosted clusters like this is just happening in general across all of our pests and what that means is we can.

A

We need to look at the commonalities between the tests so because this is that probably now some of these tests, such as the upgrade test, create multiple clusters during their run, but most of the tests only create one cluster and by the way, uh the way I have my screen share, set up. I cannot see the chat and I can't see if people are putting their hands up. So please just interrupt me if you have any questions or anything to say or ideas or anything to yeah contribute it at all um so yeah.

A

So we need to look at the commonalities, so we do use a single function which is like apply cluster template and weight across nearly all of our tests, probably all of our tests to create the initial cluster we're going to test so that it might, it could be at the code level in that um that function or it could be something to do with the test infrastructure itself and prow and GCE, or it could be part of update. So we use cap D, the docker infrastructure provider underlies all these tests.

A

So this issue of the control plane machine now coming up could be an issuing capital or it could be some combination of the above, um so I just picked out four of these or five of them. These uh correspond to the first five here and so I think the first thing I wanted to show was just how to actually download the artifacts.

A

So if we go to this proud page, we have this test run on this job. We go to artifacts. This is a Google Cloud, Storage bucket that the kubernetes balances uses and if we just copy.

A

This part of itself from kubernetes Jenkins up I, think I already have it here. So basically, this uses gsutil, which is uh Google Cloud Storage. This is util. um You can look it up. I'll also paste this into Slack um and we're just copying from the Google Cloud bucket, which is of this form into a temporary directory. Here, I've already done this and I'm going to skip it now um or what we're saying, Mr, Schultz working, um it's not working this.

C

Is silly, do you have to authenticate in any way, or is it.

A

um I, don't think so for the kubernetes stuff. This is all available on the web. I don't think I have um I've logged into Google Cloud, but for if there was any authentication on the bucket you're using the same place, three or any other I guess Oracle storage um yeah, you can authenticate with your Contour, but I think the kubernetes stuff is um freely available. So there's I'm.

B

Sorry that um these buckets are well readable, read-only readable, so you don't need authentication, for that um would be super helpful to get that command pasted to the slab yeah.

A

B

A

um We should also probably put that command into the book somewhere, because yeah I had a hard time putting it together as a snake, because I haven't done that in a while and I didn't have. It noted anywhere obvious.

B

Last time did not work well,.

A

Yeah, whatever was going on there, where it was chunking the file or something um so I'm just going to cancel this because it does, it can take a while, depending on how fast the Google book is dropping you.

A

um So all this does I've just.

A

So I put under the there's a tank directory in the um it's happy repo, which is a very useful place if there's yamos and stuff you're working with, because this is you know they get ignore and everything. So this is a place to just dump, stop you're using while you're in the repo, but you don't necessarily want to get to know about it.

A

um So I've just talked to know facts one. This folder is one eight five, seven six.

A

Yes, that corresponds to this wrong and we want to find no control, Play Machines, coming to the distance. So the first thing here is that the build log is what comes up here. So this is a bit log self research, no control permissions come into existence. Let me make this bigger, for a second is the text? Okay, for everybody does that we want me to make it bigger.

A

I'm gonna have to assume nobody does because yeah.

B

A

um So the build log is here: no control Dimensions come into existence. That's what we're looking for what we want to find in order to drill down on this is what tests that happened in and what the name of the cluster is. So here's our namespace, here's our cluster name.

A

Oh sorry, that's the machine and the namespace, so our cluster name is probably this.

A

I am so frameworked closer with name using the topology single load, cluster template and we were looking for one control, value, machine and zero worker machines.

A

A

So we can go into artifacts going to clusters.

A

Namespace is sorry. These are the cluster names right, yeah.

A

We don't have this cluster here, which I think is fine, because we never actually got a cluster yep.

B

So yeah we had not been able to fetch resources during cleanup because there was no IPI server to talk to so.

A

And just to yeah to.

A

Explain this so we have this artifacts folder um clusters here contain so just open. Another one clusters here contain items from the actual clusters, particularly logs from the machine. So from the machine pool test we have machines. We only have one machine that we logged from because container D log information on like Integrity config, the journal, CTL log um I'm, not sure what current.log corresponds to Design Reno.

A

No, uh the achievement log, which can be super useful and again cubelet version stuff and a lot of normal characters that don't now, but that's what we usually have here, but we couldn't get anything because we got no control play machines where our happy objects are in the bootstrap cluster.

A

So uh this zero zero Q hug.

A

This is the cluster template that we applied I'm, not sure why it prints out this next stuff. I didn't realize that was there.

D

A

uh That's a close template. We applied.

E

A

A

Name of the cost was kch, so let's take a look at the.

A

Sorry this is the namespace, so in our bootstrap cluster on the resources, which is where our copy resources are under this namespace and there we have our cluster zero zero Cube hook. uh This is the cluster that we take sometime towards the end of the test. Alexa.

A

Okay, if anybody has a question, it's okay, um we'll continue. This cluster is picked up towards the end of the test. It's not always the exact final object, and sometimes that happens. We saw this last week where sometimes this is script. Actually after the test fails, so you end up with actually a slightly misrepresentative State here, um but first thing we check here is maybe the status conditions, scaling or control point to one replicas.

A

So this is where we finished pick ready was false because it was still scaling up the control plane. It had zero right right because everything else is broken.

A

um Yeah, there's not much else to look at there to be honest, uh presumably we did get machine deployments. We did get machine sets. We did get a control plane. Let's have a look at the role play and see. What's going on there same message right, it's just scaling up into one replica ready false, no other action. There may be in the machine itself.

A

I, don't see any machine.

B

It looks like it has had not even been able to create the machine object yet.

B

C

B

A

Yeah, maybe the docker machine.

A

Well, there's no Docker machine I was Docker machine template no Docker machine.

A

A

So at this point we could start going into the logs here, um which is kind of what we're doing last week and, let's say, search for you hook.

A

Boss, we do have this super cool tool until so I have run I'm running towards here in the background I'm always doing, um but sometime last year, when we worked on logging, so I might link that issue in the thread later on.

A

um We still have open issues around logging, we're still trying to improve them sort of land rehearse cycles, and we have a bunch of tasks up there to kind of improve our logging. But one thing that Stefan I think did last year was: allow you to take a link like this. So this is the private job that we failed on. Internet576 I went to tilt, so let me just show that for a second sorry,.

B

This is also documented in the book for anyone to later read up.

A

Right yeah, that's I'll, make sure to link that well, um we can deploy observability until we also have like Prometheus and stuff, but I just need refund and locally Loki, and then once we do, then it's helped. We can put a link to any GCS file result there. You can also link I think directly to their artifacts page and press import logs.

E

A

The machine created a machine set, but, oh sorry, the machine set would never create a machine because there's no control plane, sorry, so under the machine. So we wouldn't expect the machine.

C

Trying to figure out, you know as I think he came late. He can't see hands. So if you have questions just again.

F

Sorry, clean I joined a little late, so when you brought a tilt, are you running any other flavor or it just make tilt up? And you imported the logs.

A

uh So I'm running on a local cluster running tilt up. This is my what I run like every morning. uh Delete clusters, hack, is kind and soft cup T, which is a helper script in the copy repo cell top. uh In my till settings, file like this grafana and Loki under the observability and I think the addendum here is. You cannot run promptail if you want to import logs um but yeah. There is a section of the book that we can link in the in the thread, but yeah that's all built into the copy repo.

A

So that's a tool.

B

A

B

In in the slack thread, and also hearing which explains that.

A

So this is the tool that does its log post, so Stefan added that last year um and yeah. Basically, it takes logs from our prior instances and push them into a locally so that we can actually have a look at them.

A

uh It looks like our upload finished, so we look at them through grafana, which is all should be at least all hooked up to our local instance.

A

um We get controller. Well, let's get a map.

A

A

It's slow, um the name of our cluster was.

A

A

Wait it's located and I think they keep kind of improving this and changing it. But these are all labels. The value pairs are in our logs from klog, um so there's app controller, no name stuff like that and I'm just looking just for the Raw uh name, the poster a lot more sorry logs are decorated. So this is a single log line in Json format, a controller controller group, controller kind.

A

We don't have cluster on that. One.

B

Well, the logs are sorted by the most, the oldest are on the top, so this may be from before the docker class got created, maybe but.

A

um So this, for example, has the cluster, so most our logs should be decorated with the cluster they relate to. So normally you look at whatever controller. You can also look across controllers um and you can get cluster.ame and put in your cluster name. You can also look at the namespace. It's pretty useful for indent this, because everything is run for the most part. There's only one cluster breakdown test, so the namespace is roughly equivalent. The different.

A

The only difference type is the upgrade test which create multiple clusters because they create yeah, a second bootstrap cluster, um so just to show this. Some of these are erroring out, I'm, not sure why so these are Json logs. So this red here means that they're not incorrectly formatted as Json, let's just click into one of the ones. So you can do this uh in the query box or I can do it down here through the UI to the UI, which is I, don't care about most that stuff. What I want is the message.

A

So this triggers visibility and I want I, want an error somewhere.

A

ah Perfect messaging error so now for the lines that go down here. I only have the error so fail to preload images into the docker machine.

A

Okay, there's our first click. uh The container is not running.

D

A

Here's our second flow.

A

um I appreciate this might be pretty small. Are people able to follow along here or should I go back to maybe the ID window, where things might be a bit bigger, but just to a I, don't think we're going to get further than this, except for um go back to the IDU window? We don't have Docker machines here, but possibly they get cleaned up because.

A

We felt creative.

A

So since we tried to create controlled by Machine container.

A

So there was a Docker machine at some point as well, but we've lost it.

B

Maybe it got remediated or something like that could not be the case. Yeah.

A

I think we've got a machine, but yeah I, guess kcp. If this fails, this is on 1.4, so the kcp remediation thing should be there right. What am I not being able but previous to the kcp remediation thing, the the new feature. If the first control plane failed, then PCP just doesn't try it just says: well, your cluster bootstrap fails and it doesn't try to get control plane up again.

A

I think that might even be the case with the new case view remediation and it's just if the second one fails that it tries to put it back together.

A

um Let's check on that, okay, but we have a lead here.

A

Which is this user-line proxy thing, so what I'm going to do before I do anything else?

A

Is I'm going to go to a different job, 8576 7248, another one that fails no control, plane machines come into existence and I'm going to go to locate and I'm going to import those logs to see if we've got a similar issue, go back here.

A

D

A

This one happened on hey: it's upgrade, runtime SDK.

A

A

Oh this time we do have a blaster.

A

Oh there's some way around machines. Oh.

A

I've got a controlled by machine okay, so this is I, guess a different issue: we've got no Cable log, a local pen or D log.

A

Okay, we have apparently got machines, but there's nothing in them.

B

um Can you take a look at the low closed because localhost directory because we locked the docker PS output there.

A

This just Docker log, no.

B

Docker Dash PS Dot.

A

um But this is just at one point right: yeah.

B

It should be at the very very end.

A

B

A

A

If we could change that to PS minus a it might be useful.

B

But I think it's I think it's actually PS minus a okay.

A

But the container is probably cleaned up.

E

And we can look for um okay.

A

I begin to see a pattern here.

A

Okay, that's actually useful, um so let me go back to this one thanks, I I! Never, look at that question. I didn't realize. Stock loggers are hello.

E

B

Seems like we found a pattern.

A

Yeah, okay: let's forget the logs for the minute I think we have something to drill down on.

A

That's actually just find out.

A

On the bootstrap cluster, looking at the logs looking at cut these systems, this is manager. Log now I've been locate, but I just want to find what logline that happens on.

A

A

Or sorry fifth grade worker machine.

D

A

Here we are so contextualize us we're in the docker infrastructure provider. This is where Docker machines are created. We are in the disposal, letters run the docker Machine controller, so this controller uh where's, the alias.

A

Okay, so from the server reconciles is I'm watching on Docker machines, which should be created by the machine or Sorry by kcp. In this case, kcp should look at the template. That's in the um basic P object machine template and create the docker machine out of that Docker Machine controller. To pick it up, it'll add owners which is fine. This is just the owner, or this is just the owners for the log actually, and then it has owner references. Probably.

A

It's just actually more logging stuff. That's the cluster adds that to the log.

A

Scripts, if it's paused.

A

We go to reconcile normal, so here we have context the cluster API cluster, the cluster API machine that Docker machine.

A

The unfortunate name Docker dot machine, which is I, think this is just an abstraction that we use inside of the docker infrastructure providers. The docker Machine controller just pass information around me, so it has a cluster name, the machine name, the IP family, uh the container which again I think is just a helper yeah. It's just another helper, the node creator, which is an interface to actually create the node from the machine.

A

So we're going to reconcile normal.

A

A

Text the infrastructure ready in this case it must have been we check if the provider id is not equal to nil. So the docker infrastructure provider sets the provider id on a Docker machine which is basically yeah the same as the cloud provider would do for, let's say AWS or Azure.

A

um We just do that here and if that's not equal to nil, we return early. So after everything is set up, we normally return here in our case in this error, we're not properly set up.

A

We got down to if not external machine don't exists. This is probably the only time this happens um so once we create our first control plane machine, we clear our first control plane, Docker machine. uh We check. If it exists, it doesn't exist. We call our external machine, which again is this Docker dot machine, which is our abstraction we do create and we get failure.

A

So how do we pick? The volume amount.

A

So it looks like we failed here.

A

We set the port to zero, which I assume means we let Docker pick it.

A

Random whole sport.

A

B

But this is picked by kind. This is not picked um inside Docker right.

B

um We just said it if it's zero, we set it here to other. You just gets determined here by not listening.

A

So I mean it's not set by time to set by the client manager.

B

Yeah, like we create a port and close it and use that port number, which got assigned randomly to return that and pass that into docker to find a free port.

B

A

Okay, let me stop sharing for a second.

A

So, if we're picking a random port, we're just asking the OS to give support, we find out that it's open, we close it.

A

How is it that it's already bound.

B

Yeah good question: either it's not properly closed or in the time between closing and re doing the docker thing. Something else uses the port I.

A

Think that's got to be super unlikely right, because this error happens constantly.

B

A

It's it's a very common error, so it can't be that very kind and also we there should be only cluster API and see other things running on this VM. It's not like this is into a cluster. That's running many many things.

B

Let's assume so yeah.

A

So unless somebody is following us and finding out what course we're opening and then trying to grab them, it's unlikely to be this common.

A

So the thing can't be released that awesome right.

B

Or there's some time between reclosing it and the kernel, freeing it in some way.

A

Yeah, that's actually a good point. Do we get any signal back from that listener.close.

B

Just to recap, so um there are two ways to go down now: um one is to check why this overlap happens. The other way would be try to find a workaround if we can't get a better fix right.

B

Just to recap,.

B

Like we seem to never tried with a different port in that case, if we hit this this case, um yeah.

A

A

We could just wrap this in a retry and try it three times and if it fails three times in a row.

A

Like the the container gets fails to get created, so it doesn't even leave any. It doesn't even leave any resources hanging right.

B

A

I, don't like that.

A

Okay, but we definitely have a place to drill down on um that's actually much more than seems likely from the No Control Play Machines message, um and it's likely an issue somewhere between cap D and the OS are not actually in cluster API code or in our testing code, which is nice.

A

Because, like it's still as flake as it is, but at least we don't believe that um people who are actually running this in production are suffering from the issue.

E

I have a question so in the first issue that you opened, uh can you open the pro I want to ask something.

A

Sure everyone's actually sharing again.

E

D

E

Yeah intro, in which so first edition I think it's first one.

E

Yes, the um this TCP are errors. You see um these ones yeah. Are these ones also like uh this? uh No.

A

These are annoying.

D

E

Are just uh like it recovers itself, or is it like an actual error.

A

So it's, it is an actuariable. Let me show you one. Second, yes, um just on test grade I'm gonna, look at some of this stuff.

A

I know I think I picked the exact wrong job.

A

A

We've enabled that Dinko, which is our test Runner, um has failed fast on which means that if it detects one failure, uh it stops all the other jobs. Okay, um we've disabled this on, for example, happy to email. So we get that coverage because before, if one fails, we just get lost, we got much less coverage of what was actually flaky um and actually I'm planning to make it so that all of our jobs, I think disabled, fail fast, but I just haven't finished that period.

A

Yet thanks for reminding me, um but what's Happening Here, is that we're like randomly deleting all of our kubernetes resources, um while pests are still running so you're? Getting a bunch of these issues uh failed to list pods certificate signed by online Authority, but they're, basically, because the API server might be down like these are just yeah.

A

It's trying to run tests against clusters which are being torn down is what's going on here.

E

Okay: okay, okay yeah, because we were facing similar issue in our district when we thought uh okay, so.

D

E

Is just when we are deleting the cluster, so in that process we are getting these log. Okay, thanks thanks.

B

So in our test, we also have some informers open which try to or things which try to, for example, stream locks, and this will also cause error messages if yeah the resources are not there anymore, or something like that. In this this case, it seems like the API server got deleted or something like that.

E

And the logo is trying to access it. Okay, yeah understandable thanks yeah.

A

Does that realize any other questions or any bits of that you'd like me to go over again, I think on the actual issue itself.

A

um I need to dive into the API of the net lesson and listener clothes to find out. If there's a way to get a signal that something has actually been released. If we can wait for that or if we can just wrap it in a retry and see if that's okay, it's not an ideal solution to just accept that we can't ever get exclusive access to a port, but maybe there's some check. We can run State. Port is free or something um this one.

A

It probably is, like you said Christian, it probably is something like we are closing it, but the OS isn't freeing it.

B

Seems to be the most likely thing.

A

Yeah because it's flaky, so it's probably timing.

A

But yeah so is there anything else, um any other questions or anything. Otherwise we can break a little bit early, but yeah I'll have to go over anything else outside of even today's run. If anybody has stuff they want to know about.

A

No all right thanks very much for joining us today.

E

It's awesome. Thank you. Thank you. Thank you.

E