Kubernetes SIG Cluster Lifecycle, 4 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cluster API - Release Team / CI & Automation - E2E test deep dive - 20230504

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, hi to our CI hacking session today, um I'm gonna try to run this and um what we're going to look at is first of all at this issue. um If we have for the time at the end, we could maybe just pick another one from try it and take a look um yeah. So, let's take a look at this issue. um Korean did open this issue some a month ago or around a month ago, um and let's take a look, what it is about.

A

uh Basically it's some tests inside which makes this use use of runtime SDK and tries to create, upgrade and delete a workload cluster, and there we have an issue which often pops up that um some book call doesn't get recorded, which results in the test to fail.

A

um I did already some research there, so let's maybe take a look. First, um a trio so I have a filter here which just shows um to us. Okay. Yes, this flag is still relevant. It happened two times in the last day, a normal CI. So it wasn't.

A

Our periodics remain in end-to-end Main and Min Kate's Main, so still relevant to look at it.

B

A

Still makes sense to do it: um okay, um yeah.

A

What we have is: okay, we have this log message and lots of jobs and artifacts where we could take a look at um what I also did already is I tried to reproduce the issue, um so we can yeah, maybe actually try and fix the issue by yeah, see or find our ways how to further go down on this issue. So what I did for reproducing it um is.

A

Let me open that in the editor um at just put a for loop around the test, so it gets run more often in CI, and this resulted in yeah helping to reproduce the issue. So all I did I did go into this test, which is the cluster upgrade runtime SDK test um yeah, which installs the cluster and upgrades it, and this all includes a runtime SDK extension which records or tries to simulate that or it doesn't simulate.

A

It actually uses a test extension which then records that the hooks got executed into a config map, and uh these tests make sure that this hooks got called and um everything got got done properly and what I did was I just did put around a for loop around the interesting part of the test, which is where the cluster gets created. It just gets up, upgraded and so on. um Yeah.

C

Just for people who aren't aware so runtime SDK um is a new feature uh that was added in kind of the 1.3.

B

C

Cycle um but it the test extension is an external component completely. That runs alongside the other management components in the cluster, and basically, our controller is just can call out the test extension during its life cycle, and this is just testing that that works and the way it does so which I guess from senior manager is a record state in a config map which the test then checks.

C

So that's kind of the overall flow of this test through creation, upgrade and deletion of the cluster.

A

Yeah, basically, that's the it's also documented in a book here things clear. um There are some hooks. We can execute um like the before class to create hook the after control plane, initialize took before cluster upgrade and so on. So there are multiple places where the runtime SDK or you could Define a hook which should get executed for your clusters during life cycle yep um to jump back.

A

So all I did was adding this for Loop here, which closes here somewhere at the end um and I had to work around some some stuff that basically cleans up again in the next for Loop, which enters it does also work. What I also did here is I adjusted the call to dump and delete cluster here to bring all artifacts to a different directory. Otherwise the cluster would have always the same name and the artifacts would just get overwritten.

A

um So that's why I had to do this, but we can kind of ignore that um so I implemented that pull request and if I go back to the actual pull request here.

A

um I Implement implemented that it took lots of um trial and error until I reached that state, because um there were different ways how you could run that test multiple times, but this one, this current state of the pr is the one I think reached a working, State and I did run off from the slash test, pool cluster API end-to-end, informing main, which executes this test.

A

And finally, um when I take a look at the latest, one I started so yesterday it just started the the test again and if we look into it we should have exactly hit the issue again. So we can see okay here the issue got hit um and yeah, so we're able to reproduce it in CI. Using this, this pull request, but I also tried after knowing that we can reproduce it using this pull request.

A

I try to run the same locally, but the issue didn't happen there, so I'm I'm not able to dig into it locally using debugger and all that fancy stuff so um yeah.

A

The only way I currently see is analyze it using this pull request. Maybe if we have some clue add some more information somewhere to dump, so we could re-execute the test in CI and check that out.

A

um I also already did some a little bit of analysis of failed paid shops. This was workshops in April about a month ago already and I was able to try HDM a bit from what we have there in the artifacts of of these tests.

A

So what what I was able to see is there were three times um I did see that at the cluster object, the condition for topology reconcile was said to fail with the message Handler has not been registered um and the other times were a different case. um So I did more of it. Inspection to this one and to see okay did some analysis and try to see if we got further information there, but maybe we just go down that road with an example um to see where we are um yeah.

A

At that time, I I said: okay, maybe I try to reproduce it locally, so we can go into the debugger or a dump more information about a state where it fails, so yeah I think we're at them. At the point where we have to take a look, what information do we want to see to hopefully find the real.

D

A

D

C

One question: I have yeah, um so you have these. Let's say whatever five different places or three different groups: yeah um yeah. Did you check the config map? Do you have access to config map in CA yeah.

A

um So I did check the config map, but that looked totally fine. So as.

C

In upgrade was recorded.

A

C

A

In there at a different place, so um maybe we just take a look and dig into one example and to see um yeah to see all data.

A

um Okay, maybe just take the latest one from today, which was this one um and what I'm gonna do is I'm gonna download this whole.

A

This whole directory, so we can browse it using the IDE, which is way more easy, um so I just get that whole oops set that variable to that URL and I already have that Pearl somewhere here or not I have to complete from snack.

A

Give me a second.

D

Yeah, so so one question to get this right in my head: runtime SDK is another controller like Christian, you know, or copy extension controller um that runs on the kubernetes management class or copy management cluster right.

A

Yeah, it's a deployment, in our case, um it's a deployment which is there, oh, so we also have it in here um for our tests.

D

Maybe maybe it's feasible to to reduce the CPU, takes to um reproduce it locally. Maybe in this case we can, we can hit the arrow, maybe.

A

D

Maybe we can try this, maybe the the maybe the CPUs on the pro infrastructure is not that good as your M1, whatever your CPU is yeah.

A

Totally fair I have no clue how to do that.

D

Okay, if it's a deployment, you can easily reduce the.

A

Dollar, you mean just limiting it in the deployment specs somewhere.

A

At this point, Sorry at this point, I thought I think we're at the place where we can reproduce it in CI and I. Just don't want to waste more time, trying to reproduce it locally.

C

Sometimes setting the limits in deployments as well, her deployment has Reverse Impact, because previously it's not guaranteed um and we don't set the we don't set the limits for other controllers in copy for these tests. So.

D

We don't have any limits, we.

C

Don't have any limits on any road deployments, okay, so setting the limits on the test extension, even if it's very low I mean above 10 millise or something um could actually have the impact of it being more stable and having access to more CPU. If it's the fact that another controller is actually eating up, Cycles yeah.

A

I mean before that we should find out if the issue really is that in our case that the test extension is not reachable or something like that or report is not ready anymore or something like that. So if you know that we could dig into, um why is that so.

A

Yeah, so the test extension itself lives here in test, slash extension um and it's basically a reference implementation of yeah, some basic hook, which does nothing more than right into the response to a config map, responding with a predefined response from the config map.

A

um If I got it right, but yeah, okay, I downloaded the files here, let's open it in vs code.

A

I hope the sizes is okay for introspection. So taking another look at the Lock here um where the issue happens. Okay, that's actually exactly the message, so the call is not recorded in a config map which is in this namespace and called like this. So maybe just take a first look into into that conflict map how it looks like um now. We have to find where it is so it's in cluster in our bootstrap cluster, because the test extension all this happens in the management cluster, which is the bootstrap Blaster.

A

Here we can take a look into I think sometimes the curl doesn't work as I want.

A

Maybe there's some kind of limit there. I don't know.

A

That's weird: okay! Maybe then just we browse at the oops over here. um Okay!

A

So let's take a look at the bootstrap cluster resources inside our namespace there's the config map and let's take a look at the hook responses here um and the interesting part should be the data part here which has the pre-loaded responses, and we can actually ignore that preload responses, if I'm right, um because that's just the answer which gets returned right clear if I'm right, yeah.

C

So that's right and then during the test, once we pass our checks, we change those sort of. We modify the behavior of the test. Extension from the test itself, yeah.

A

And so if we take a look, the before cluster upgrade um one didn't work so, but actually it has recorded it this correctly. So.

C

um Is that correct, I wonder.

C

So uh the reason so, if we go, can you go to the code where the test is actually failing? I think it's four seven, six line, four, seven six indeed, runtime SDK upgrade test.

A

D

A

C

uh So it was four five one I think corporator.

A

We have some local changes here, so I just stash down so but I should be on the branch, though all right, so it's in here where it fails um or this timeout gets hit, which is why.

C

Yeah, can you check the check life cycles hook called at least once.

A

I, guess is our so it gets the life cycle of responses from the config map.

A

And checks that hook, name Dash actual response status. Is there.

C

So one issue we have I think so. um Can you just check the stack of the error again, just to make sure that this is where the air is getting returned from our Omega might modify that the way that's annoying um 376.

A

C

That's probably just where the eis450 yeah yeah, so we can't.

A

C

Cluster yeah, that's just where this one's being called.

C

Yeah, uh just because I know that error gets called in two places.

B

C

um There's a hold at least once and then there's another test called um so there's a similar and two places, but that's okay.

A

A

Yeah, so it actually looks like that. It's okay, right that the hook went through at least a bit later, then well.

C

C

Is that a map or is it a string?

C

Sorry, the top one is a map, but you see the way um on the right hand side. Are they all the preloader response? Sorry yeah. So all the actual responses, but the pre-loader responses are in an object.

B

C

um I guess because they need to be on Merchants, just there's, probably a bug there and how we're marshalling it on Marshland. But that should be fine in this case.

A

Yeah, so just but it's also matched to like before class, to create yeah, which also has state of success this and um this the actual response, so.

C

um If you go to actual response status, sorry where we're checking that in the in the n10 test code, how up to date are we keep the config map.

C

I mean yeah, maybe going to runtime tests.

C

I'm going to get lifecycle hook, responses from config map.

A

Just get the whole config map Loop over yeah, ignore this full printing stuff. Here it just returns to conflictmap.data stuff and then um extracts the data or takes a look. If the data is in there, if there is that there just needs to be hook, name Dash actual response status there, it doesn't matter what the value is from this.

C

And have you got the logs with the film in them for these ROMs.

A

um It should be.

B

A

It's just bad that it does printed before all the test output, but um basically the last one should be the last one. The last time it entered that place so um yeah it exactly prints that out.

C

But we don't have before cluster upgrade actual response with.

A

Yeah the snow before cluster upgrade upgrade response in there and okay I'll respond here, there's one there. So after the test fails, it seems to have it seems to or before dumping the data and after failing the test, it seems like the controllers did succeed in the in the meantime,.

A

You know what I.

C

Mean yeah um so and you go back to your foods and check the times yeah. uh So one issue that we run into constantly with flakes um is controller runtime user cache for objects, I'm, not sure for caching, config Maps. Possibly not do you know Christian. Sorry again, are we caching config maps in controller runtime? Probably right, we don't catch Secrets, probably.

A

C

So that means that, because we're catching them this to prevent us like constantly storming the API server for information, uh everything we have is always slightly out of debt most the time, there's no change between what we have and what the API server has, um but there is always a risk of us getting something in the 100 milliseconds or whatever. It is like what I said earlier.

C

Cpu constraints can make that really important, because the cash might not date as quickly if we have constraints. um So maybe it's the cash so sorry question what I'm looking for is. Were you with the foo logs.

B

C

So to check for this, do we have a second yeah.

A

At the logs I'm, also printing, the whole metadata here all right.

A

So that's what I copied over here now to this file, so I confirmed it and there's also the managed fields in here um fields and the managed Fields can tell us which parts of the object did get updated when the last time- and this here says okay, this fields, which includes the which doesn't include the before cluster upgrade actual response, was, lastly updated at 1751 and 34 seconds, which basically is then before the time the test failed or the last time it was updated before the test failed.

A

And if we take a look back in here into our dumped version and take a look at the managed fields too. It is about a minute later, where it succeeded.

A

So a few thousands, 41 and 5134.

C

Yeah, so can we check the logs of the test extension?

C

The test extension should log when it writes to a config map.

B

A

B

A

We can go to logs of the bootstrap cluster last extension system.

A

C

Maybe just search the before update most. This is so the patch of stuff is called every run, so we've got way more patches called.

B

A

Yeah, so the last call is exactly at 52 41.

A

Which is expected because our config map had that timestamp and before that it was 51.03.

A

C

What we think is happening here as a result, is that actually, our test extension isn't getting called quickly enough.

A

Yeah, maybe it's due to some back off, which is happening.

C

Yeah, can we look at the uh Happy controller manager logs just see if there's anything art happening, because um the reason I'm very interested in this so like we probably have a timeout I'm? Maybe if we just set the um wait period in our test of five minutes, this flake will never happen again, which will be a win for us, but if we can patch an issue, that's going on in controller manager, the core Capital controller manager.

C

Maybe this actually impacts a bunch of our flaky tests to do upgrades, maybe upgrades one every 10 times just takes really long.

B

A

um I just want to place it somewhere nice because that's Jason Logan. It has not the perfect timestamp in there. So.

C

Yeah we have to get a good setup and documentation around using the log reader to get locally up and running. So we have a Loki setup that you can push these into, so you can actually parace the logs in grafana and stuff.

A

Yeah I did it in also sometime in the past, but yeah.

C

It would take me two hours to get it up and running today. We just have to.

A

um Okay, maybe we just but I had I already did some JQ magic in the past.

C

Oh awesome, a new um post that query somewhere, please.

A

A

B

C

About two days is that a JQ function two days, ISO.

A

Oh yeah, these are jQuery functions. Oh.

A

So the point is I, maybe currently drop Fields, but what we could do is we could just dump the whole Chase into it.

A

So done the new timestamp, followed by the road chasing s string Maybe, that's pretty much.

C

Yeah, we definitely want to keep the e-pairs I. Think thoughts will only have one cluster here, presumably so I'll be.

A

Yeah, let me just paste that command too.

A

A

So sadly, the locks end at Alma, 31 53, which doesn't include, and we wanted parts.

B

C

But we are in the middle of it right.

A

Yeah, so let us see what was the 5103 was the last one. So maybe we find we find some.

A

Why this? Why is it 31.

A

That's half an hour.

B

A

Well, these are lots of lost logs. Then oh.

C

B

No, that's this.

C

Yeah, so that's before cluster credit, so it's.

A

Yeah, the point is.

C

When you just dump the thing without your JQ query, just to see um if they're the same.

A

C

We look at the raw logs without the JP pressing just to make sure it's they're the same as we have this one.

A

So, let's Maybe.

B

Just take that one.

A

And let's do divide like a thousands and convert.

B

A

There are more logs here.

A

Maybe the query some point.

C

Yeah, because those are the deletion logs right that we have at the very end there, which is what we'd expect we eventually lose app control work.

A

What's this is that the right, oh, my file, is just not complete, I think so! That's not that are.

C

You getting weirdly like rate limited by GCS.

B

A

A

B

A

Yeah there are some handshake things.

A

At Spirit, Animals.

C

B

A

That looks better foreign.

A

Okay, um you may want to filter them. Some more.

C

So hold on, we've got calling all extensions of before clocks and you just Google before clustered Google to Jesus that's terrible, search before cluster upgrade and just see if it's only been called once.

A

That's already this one should be the one which reaches our final.

C

Right so the first one we call that is, point yeah, so we we only call it three or four times.

A

And before that is 4451, which is.

A

44 51, most of the time, which was recorded in the.

A

In the the new country map, which was recorded in any logs so.

C

I think you've.

B

C

The other file- you have two time stamps to compare the conflict map all.

A

C

A

Sorry for fussy things, so um any of.

B

A

A

And that's 50: oh.

C

That's so long, we've got 34 in two places completely.

A

So that was the full output.

A

It was at 51.34, which misses our stuff.

C

So what's the first call of that is the first color does that should be at 41 or something right.

A

Okay, 5134 could be any hook reached out to the test extension right.

C

So it's 52.41 is the first time we get a call right.

A

5241 is the sorry after our test failed.

C

But that's the first time I can't see the yeah. This is the first instance of this in the app controller manager, logs right.

A

I'm not sure it could be also the 7044, but this could also be the last iteration.

C

Where we're getting it before? Oh sorry, we are sorry if you can just do calling of all extensions, but where where's the first one of these calls.

C

In just in this log file,.

C

So what I'm trying to find out is are these calls? Successful is the main thing.

A

Okay, calling all extensions off.

A

So there seems to be four class to create, and the first before cluster upgrade call is at is, is after our test failed.

C

Okay, so this is the first time it's 1541. uh What do you mean after the test fails? Is the test post failure before that it pulls failure at sorry, 52 34? Is it yeah, okay, yeah? So let's look at what the controller is doing in the compliments before that.

B

Some of the box.

A

You can or replicas already so it's waiting for nodes to roll out right.

C

What what's actually happening is it's not it's not actually reconciling very often.

C

Where are we exiting that's freaking, sound, pretty often I, guess um waiting for the kubernetes note on the machine to report any stage.

A

Seems like it's doing some kind of rollout.

A

But at that point it shouldn't roll out anything because our hooks would get called first right.

C

Hey this could be so the flow of the test. I guess this is still in create.

C

So there's a before close to create hook, there's an after control plane, initialized hook.

C

um Those are bolts positive at this point and then there is a before cross drop. Red, Hook I think that's the order in which they're called um so what's happening here. I guess critically, the machine pools seem to be still rolling out.

A

Before so yeah here.

C

um So what's happening I. Think at this point is that our I wonder is our cluster topology controller, but our upgrade is certainly being blocked, possibly before the before cluster upgrade call has been called because the workers aren't ready.

B

C

C

This could be down. This is all happening on Main, so it's not down to I know. There was an issue, it's machine, pools, yeah.

A

It was a node Watcher. This is the main some days ago,.

A

I rebased I think beginning of the week.

A

um If I look here, I rebased this stuff.

A

Was pushed to it was last week.

A

Where's, the timestamp.

A

Where do I see that.

A

A

Everybody is still on ah pretty old.

A

So it's nine days old, yeah.

C

But sort of what I was interested in was a PR. It was from like a month ago. Okay, um which is in machine pools, they added a node water to machine pools yeah. um So this should be fine for that issue.

B

A

And maybe let's recap: what what do we have currently.

A

um We have the Master displayed fast. It's your. You know see that.

A

You snore failing the test book the before the plastering of operating never got called, so it is not so we're sure. At least it was not the issue that the hook wasn't reachable.

A

The issue currently was, the hook doesn't did not even get caught because something was rolling out or something will still be done before reaching the code for calling the hook.

B

A

So maybe to show one more observation: I had when taking a look at the cluster object.

A

What we can see here is, you know here it was last updated, 50 to 13, so this may be outdated data.

A

And that's where the upgrade continued later on.

C

So, in our conditions are the audio record file conditions. This is doing okay, no, it's blocking oh yeah, but that that's where we deleted in this test anyway, right.

A

Yeah so at the point where this case finished and dumped the stuff it reached the place where the upgrade the hook is blocking. For this some seconds thing.

A

Right but I had other cases um where this was not the state of the cluster. At that point, I also had some.

A

B

A

Make a look: what I did right down here? Yeah I also had this kind of weird ones.

C

Yeah, but because if it's the K, if all of our cases look like what we're looking at now, um we try to run registration.

C

Every time the extension config is reconciled, so what could have happened is that that just gets deleted before the touch controller is deleted. We try to run it again.

C

um That's probably the case in this one yeah. It could also be the case, of course, because we don't have health checks and stuff on our no, but has not been registered, means that the extension config doesn't exist basically or hasn't been read yet, which in that case, we probably already called it earlier in the test right because we're failing at the upgrade, so we would have to have had um in the case we're looking at.

C

Where do we actually fail in the test.

C

It's in the before cluster upgrades, Handler I guess, is exceptional spot.

A

A

Failing here at line, 451.

A

A

So in here in runtime book test Handler and this one got called by.

A

A

Which got called by 1.5.

B

C

So it this definitely happens before um we have any rollouts, but our machine goes aren't rolled out yet.

B

A

B

A

So it's in here right.

C

Yeah exactly so, it's kind of no um freeway for control plane to be upgraded. Yeah, yeah.

A

All right it should be.

A

Is it pre-weight for yeah, because that was locked here right.

C

Yeah, so this is the kind of annoying idea of how we find this in the test versus how it's framed as a real-time hook, but yeah it's you're ready for control and to be upgraded. Yeah.

A

So there may be things we can do to wait for something here before you should actually call that.

C

Yeah, so whatever is happening with machine pools, is it just Machine Tools? Can we yeah in the logs? You know we're getting close to time in the controller logs? Is it kind of machine pulls they're taking a long time.

B

A

Maybe we should take a look, but uh whenever the timestamp before.

A

The timestamp before this happens so.

B

C

Whatever 52 or 5134, isn't it.

A

Somewhat between so 15.

C

A

C

So we have waiting for the Machine Tools to be provisioned online 1710 here.

B

C

But I think if we look at the issue, it's that they're not ready.

C

Provisioned means that they have their provider id I, think yeah.

A

So the reconcile after is, for example, this one. Let's get that line.

B

B

A

That's just that look messages.

A

Waiting for the kubernetes now to on the machine to report release date yeah, but this is the machine set controller which is not the one we're interested in we're interested in a cluster controller ride.

C

C

We won't upgrade if the machines aren't ready. Okay, um that's definitely through the control plan, but yeah I'm gonna have to drop at the hour, but.

A

But I think we don't have enough information in the logs here it just it just says: okay, we're waiting for the node on the machine to report ready state but yeah. The theory currently is the machine, the node. Still it doesn't get ready or do not get ready, which is why it doesn't reach the stadia to call the hook.

A

So what I'm going to do is next take a look at if you get information why the machines are not ready which may get hard.

C

As your mission you'll be able to date, probably.

A

With the artifacts, maybe we get another dump of a different test where the data got done to where the machines were still in that broken state.

A

A

Yeah the point where the continue was exactly here when all the replicas had been ready, so exactly just 41 second thing and afterward. Afterwards, it comes to you before cluster upgrade ready.

C

Yeah so I guess number one thing here is: can we figure out where the technology, if, if it's the topology reconciler that is exiting because it doesn't want to upgrade before machines, are ready? Why isn't that being logged somewhere or maybe it is being logged somewhere or just missing it in the silver logs? But um if it isn't being logged, it should be that's a key kind of decision that we're making.

A

A

A

Why, which did not continue so like that right.

C

Yes yeah now, maybe there is, but we haven't seen it.

B

C

um Just for clarity, which exact um Ron, are you looking at here, so I can follow a.

A

Little bit yeah, let me link that.

B

C

Okay, uh I'm gonna drop, but thanks a lot for doing this question. This was I. Think Super, useful and I feel like I feel like we could solve this right now by just adding longer timeouts because of what we've seen um which we actually had in the test and I pulled them down at some point, um but I think maybe we could actually get to the the root of a more serious problem here. Maybe if we keep looking yeah.

A

So at this point we could do two things we could increase the timeout to for now reduce the flakiness of these tests, and the second thing is: we could still continue with the last time out and the pull request to reproduce it and further dig into it.

C

Yeah I'd be happy with that as a to to do both um if you want to open a pull request, I'm, not sure what time that is right now. Is it five minutes? It's 30 seconds. It's only 30 seconds for that entire Handler.

A

A

Slime, so that's life, one so.

A

Everyone and here's the 30 seconds, timeout.

C

C

And we do that.

C

Yeah I, don't think there's any problem, but without to five minutes.

C

um As long as we record in a separate well, we can continue on the on that issue and maybe change the title of the issue or whatever. But.

C

Yeah, this is a timeout, so, like most of the time, 35 minutes won't cause this test to take longer. It might cause this test test longer in cases where it otherwise would have failed.

C

So I think yeah I think that's reasonable, but we should definitely dig into why upgrades are taking this long. Yeah, okay, cool thanks very much. Everybody.

A

um Thank you. Thank you. I hope this was helpful for others, too.

A