Kubernetes SIG Cluster Lifecycle, 21 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20201021 Cluster API Office Hours

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, um so welcome. Everyone today is wednesday october 21st, and this is the cluster api office hours. uh Cluster api is a sub project of sig cluster life cycle.

A

During this meeting, please make sure you follow the cncf code of conduct which you can find on this document. If you haven't read it at the top and please be sure to raise your hand if you'd like to speak and add any agenda items uh to the agenda right here, um so uh to kick it off. I think vince has a psa so I'll hand it off to vince.

B

Sure, thank you. um So this is something like I just wanted to point out like alpha. 4 already has like a lot of change uh changes merged in um and during alpha 2 to alpha 3. uh We kind of wrote this migration document that we that are linked here um and it was extremely useful. um I gathered also like some feedback, so like probably going forward what we want to do.

B

um We should one backfill the changes that have already been merged to a similar document and two like require that all the new breaking changes uh document their changes within this document. uh So this is more a call out for uh both reviewers that to like watch out for breaking changes and for providers implementers um that, like the the changes that are coming in, could be extensive uh and will probably be extensive. um So yeah take take a look at that document and uh if you have any questions reach out.

A

Thanks vince, uh yes, and you have a question.

C

Yeah, it's more it's more of a no! So for cabi. We started also documenting these breaking changes, or at least the ones we're planning and the goal is to make it also a living document for anyone consuming the provider. So it might be something good to adopt also across providers so that the end users know what is going to change.

A

Yeah great point, um and the one in capi, I assume, is more of a like for providers or like who's. The target audience means is it more like the users or the providers who are going to be adopting these changes.

B

It's probably more for providers um this, like the document that I wanted to point out, um but yeah for users, it's like hopefully like cluster code upgrade, will take care of most of those things uh in terms of like just upgrading and if there are like api changes, those should definitely be documented.

A

Yeah, um and is there an existing living dock for v, one of the four or is that uh to be created.

B

It's it's to be created. uh We haven't, we have merged like a few vrs in. I have to collect like the um the changes that have already been merged, um but I probably start uh like at the end of next week or the week after that.

A

Okay, uh maybe it'd be a good idea to just create a blank document, even if it doesn't have all the changes so far. Just so people can add their stuff in the next few weeks as they create changes.

A

All right any other questions on this topic.

A

Okay moving on to discussion topics and I think we have a demo to start with uh fabita. Do you want to do this at the beginning, or do you want to finish with the demo I'm up to you.

D

It's fine for me.

A

uh We can start with that if you're ready, um I'll, stop, sharing okay and you should be able to share.

D

So have you seen my camera.

D

Okay, so I have a cluster with three uh control: plane, machine nodes and basically, what what I'm trying to to do in this demo is first trying to to make one of this node fades and show our remediation. We work out.

D

So, in order to do so and make it visible, I'm doing I have a slightly modified version of the code in my pr and basically I'm supporting two annotation one. One annotation basically blocks the radiation, so uh basically they could enter into the remediation part, but but it blocks before issuing the radiation. This will allow us to to see one step.

D

And and the second, I have also another annotation that basically uh blocks the machine from being deleted, so we will see basically measure the the annotation being the remediation being started, but not completed. So this will allow us to to follow a little bit so and I'm applying these those annotations on on one of the on those three machines, and now I'm I'm in order to basically kick off the the limitation, I'm applying a label, so this machine will be basically checked by a mhc which is running in the cluster.

D

So now I apply this annotation mmc. Basically it's configured in order to consider the machine unhealthy in me immediately and basically what happened? What it happened that mhc applied on a condition to a machine saying that the node is unhealthy and the node is unhealthy because our condition is supporting falls for more than 15 minutes.

D

15 seconds also mmhc applied the owner, remediated annotation, telling to kcp. Please take care of this machine which is fading and as you as you can see, also they are nearly messaged bubble, ups to the counter plane and and to the cluster okay. Now, if I unblock the remediation process.

D

Kcp will start taking care of the of the of the remediation, so it is unblocked and now kcp is taking care and basically what happened? What happened that the kcp started, processing the remediation and and is a revelation, is in progress and in fact uh remediation in kcp is deleting is deleting the node. As you can see, the node is being deleted and also the the infrastructure machine are being deleted at this stage, and so, if I am block deletion, so I let the process to.

D

D

Basically, the machine is is being deleted, right, okay and a new a and then when the machine deleted. Basically a normal scaled-up process kick kick screen, and so we are, we have. uh We have kcp restoring the deter, the control drain, node- and I I block here for the for because I know it is not easy to follow. So I block here if the right question and then I will show another use case where remediation cannot happen because there are failing tcd nodes.

D

So are there some questions, sorry, but I don't see the chat.

A

Very cool, um I don't see any hands raised right now. Anyone have.

D

D

Okay, so I I move on in the second example in the second example of the remediation. Now, I'm I'm basically what I'm doing I'm creating a a critical situation. So I'm going to this control pane machine and basically making a pcb to fail, and then I will try to remediate another machine, but this is this one to be possible. This will not be possible because if I do remediate the machine I I will lose quorum.

D

So what I do expect is that kcp limitation simply blocks and then form the user. I cannot remediate because this operation can can lead to losing the qual. So in order to.

D

Make a tcd 180cc fail. I I go into one of the machine.

D

I move away the the manifest for the decision, static, pod and I just checked that the tcd bodies is being deleted. So we are, we are down to only 20 cd members running.

D

Okay, there is a pr out that now, as you can see, it is not visible in condition, but there is a pr out from cedar that will make basically the tcd member feeling visible in condition as well, but now I'm going to so this one is the machine with this, defending I'm trying to remediate this one, so I'm basically forcing this machine being being remediated and what what happened. It happens that.

D

Xcd kcps will process wait for mediation, hopefully soon.

D

And this is demo the demo, god.

D

Okay, so here we are basically okay. Kcp is telling me that he cannot remediate his machine because this could result in interestingly losing quarrel, because we are, we have this many failing, and if uh I delete also this one. Basically I will lose corn and that's all for my demo.

A

This is awesome, uh I think we have a question from.

A

Joe, uh I don't know if you're muted joe, but we can't hear you.

A

Okay, vince: do you want to go first.

B

I don't have anything I think, there's.

A

Oh uh and sorry, I thought you were raising your hand.

E

Thanks cecile, just a quick question fabrizio the second machine in your list there, that's the one that you manually deleted. The static pod right has happened, yes is: is that supposed to report unhealthyness uh in a condition somewhere? Eventually? Yes,.

D

It is supposed, but this is in another pr. Okay, thanks.

A

uh Joe did you still want to say.

A

Something: okay, maybe having technical difficulties um yeah. This is really cool. I had a question, um so did you install a machine health check on your cluster in order to be able to do this.

D

Yes, I started a machinery check which is configured in order to basically uh make f make make another fail immediately, but this is only for for testing, so you can have your checks targeting the control, play machines and configure it, as as you do for for the for the nodes.

A

Okay sounds good cool and also a very nice uh ui or like presentation of the overview of the cluster. I'm sure I'm not the only one who can't wait to be able to use this as well very polished all right. um Any other questions.

F

Hello, can you uh can hear me.

A

Yes, we can hear you.

F

Awesome uh I had a question: is this um quorum check plugable um like, for example, if I had a rook cluster and I wanted to make sure that I didn't lose quorum of my monitors?

F

How would I do that.

D

This checks works only if a tcd is managed by kcp and currently checks all the machine which are managed by kcp because they are, we are assuming. All of them are hosting a kcp, an ipc member. So it is not pluggable because it is designed basically on how the pcb manages the htc members.

F

Right right, but what if I had my rook cluster, managed by kcp.

D

Sorry, your I don't understand the word.

F

So so you know, I could host a a a rook cluster on uh k's nodes right, um and so how would I um you know you use the same process. There.

A

Jason is saying something in the chat about uh you could use the upcoming external remediation if your root cluster is managed by a machine deployment.

F

A

F

Don't have details about.

A

That but yeah, I I think, there's is the proposal merged for that or I think so, um yeah andy did you have something.

E

Yeah, so kcp is cubitium control plane and it specifically uses cubitium to manage machines that represent a kubernetes control plane. So I would second what jason suggested in chat if you have rook deployed on your cluster, uh that is managed totally separately from the kubernetes control plane. So you would need external remediation or some other way to deal with that sure.

F

But you could, in theory, run the rook cluster hyper converge with the control plane right.

E

Yes, so your question, then, is: I have multiple things running on my control plane and I want to not only consider etcd health. I want to consider other things before reading it, I think, probably pod disruption budgets would be useful there to prevent uh draining a node until you've got the right number of replicas elsewhere. uh I don't really think this is a q idiom control, plane problem, okay, it.

F

Just seems like it's a common uh pattern of wanting to check a quorum at the higher application layer layer before remediating.

A

Yeah, that's a good point. Maybe it's worth following up on any other questions.

A

Okay, thank you fabrizio for the demo.

A

Let's go back to our agenda.

A

All right actually fabrizio you get to go again. I think you have the first topic.

D

Okay, okay, so this is the yesterday or the day before we got painted by people in cigaretes, because they are failing jobs from key from the gcp provider of from the kappa provider and yeah I want to. I wanted only to point out visa and thank and thanks carlos for starting taking a look at this.

A

So, for what is worth the cab z, conformance periodic jobs are still passing um if that helps, but um I think it'd be a good idea to have yeah either like a periodic triage or maybe like alerting in some way. So we can pay more attention to those failures.

G

For those specific jobs there is alerting we are getting the message in the release: release, email and uh for especially for cap g was like some. They are like quite different from their other providers and they always like building the the conformance testing using the kubernetes kk and using bazer and using a lot of other stuff behind the scenes and looks like the to build them. Then testing for conformers change a little bit and the script was like out of date that there's a piano in place to fix cap g for uh kappa.

G

That is, I I made a comment in uh in this like channel, I'm just waiting, maybe for andy. If he can comment that, like that's some um there's missing, it's a missing template. That's why it's failing.

A

H

Yeah, I was just gonna say with uh fabrizio's work to uh kind of create a more unified dashboard of the various conformance jobs that we have out there. That would probably be you know, a good tool that we can use to kind of triage any issues with those tests during this meeting on a regular basis.

A

Yes, definitely and then I think we've also talked about using those as release informing for cappy itself at some point, which would also be a natural step, uh and then we have a suggestion uh in the chat to add the sig cluster life cycle mailing list to the alerts. If they don't have it. Yet that's a good point, because if it's just alerting sigilies the capy maintainers might not be aware of it.

A

H

Yeah so previously we set up a separate google group for those alerts so that they don't necessarily create too much noise on the sig cluster life cycle list. I can dig up what that group is and add that to the uh to the meeting notes, so sounds great.

A

Thank you uh any other comments, thoughts concerns on this topic.

A

Okay, thanks carlos again for stepping up and investigating, and let us know if uh you need any help on that front- um all right, uh jason! You have the next one.

H

Yeah so, um following up from the meeting last week, I'm trying to schedule a kickoff meeting for anybody who's interested in working on the load, balancer provider uh proposal.

H

um So there's a link to both the issue, if you're not familiar with it and to the doodle, where I'm trying to uh trying to find the best time for next week to try to hold that kickoff meeting.

A

Great, do you want to give a bit of an overview quickly for people who aren't familiar with this effort or what it's trying to solve sure.

H

Yeah, so the basic idea is that we currently have the idea of a load balancer provider within the vsphere provider, and there are other uh kind of provider implementations that can't rely on kind of a default cloud managed load balancer. So the idea is kind of bring up the load balancer to kind of a first class provider within cluster api itself, and then that gives uh the ability for other providers similar to vsphere that don't have like a default like built-in load balancer. They can kind of share common implementations across those.

H

um It also will open up the ability to more easily swap out load balancers even for cloud providers so like in aws. Right now we use an el or classic elb. It would potentially give us the ability to create like an nlb equivalent that could be swapped out relatively easily so yeah. That's that.

A

Thanks yeah, so that sounds interesting to you. You can sign up on the doodle to give your available times and check out the issue and yeah.

A

I don't see any hands raised, so we're going to move on to the next topic. Jan, you want to talk about scaling, kcp.

I

Yes, last week, I I brought this out that uh I I opened the google doc for for this feature, like I don't know, maybe three weeks ago. This is more like a question uh because it happened like I said uh and asked people to comment that the google doc last week and it doesn't collect that much comments anymore. So what is the current uh kind of a policy? How long we should keep this open, or what are we expecting to uh from these google docs or so I'm kind of asking?

I

Is it too early to open a pr for this uh proposal? Maybe it would generate more discussion in as a pr I don't know just.

A

Yes, um so I'm not familiar with the proposal itself, but um is it targeting v1 alpha 4? Yes, okay, great! um So uh there is a proposal process documented in the project and I think the general guidance is to leave it open as a google doc for a while, unless there are like any big blocking comments and then opening it as a pr- and I think vince just shared it um and it should be in the implementable um uh state, um but other than that.

A

I don't see any blockers for opening a pr if there hasn't been too much movement on the google doc.

I

Yeah, I thought so uh so if we don't have any other uh other thing on the agenda, I would like to quickly ask because I I got the comment about the in-place uh upgrade.

I

Obviously it is the different thing, but uh have you ever discussed anything uh about the could we could we have in place uh option also, uh while we upgrading so basically, for example, in in bare metal, we might have uh the certain disks uh attached to the servers that we we might want to uh reuse during the upgrade, so that we won't take a new fresh server. We probably want to have this new machine to have the same node that the the earlier one but upgraded.

I

uh Have you discussed anything about the in-place upgrade, or is it even or should should we kind of handle that in infrastructure provider.

A

Yes, so it is a topic that comes back every once in a while. uh I, if you search through the doc, I just uh found the like notes from last time that there was a pretty extensive discussion. I think you can find the recording uh it was from september 2nd, but I think jason has a comment so I'll. Let him speak.

H

Yeah, so I was just going to say, it would be very difficult with the current implementation to add in place upgrades mainly because we rely on a lot of the in general, rely on all of the binaries to be pre-baked into the image as part of the workflow.

H

um That said, um I I would expect it would be I I would expect right now. It would take another kind of control, plane implementation to be able to implement in place there shouldn't be anything that would prevent that, but it hasn't been some that we've talked about supporting, because of various reasons that I'm sure linkedin. That um discussion cecile mentioned.

I

Yeah yeah that that's actually good to know, so we we probably need to find the other kind of way to to implement this, maybe in metal tree uh how we select the nodes uh and uh yeah. uh So this is good to know, even if it's not even being thinned off, so I'm not staying to wait for the for the discussion to go further at least right now, thanks jason.

A

um All right, um so, unless anyone has any topics, I think this is the end of the agenda, um just a reminder that if you have any current uh proposals that are open, please make sure they're in this list, if they're still in the google doc form, so we can track them. And if uh you are looking at this list- and something is interesting to you- please make sure you review them and yeah until next time. Oh andy, did you want to add something.

E

Yes, one quick follow-up on the alerting around the test failures. um We do have a google group, that's linked in the notes right now and there's a few of us who have ownership rights on that. So if you are interested in receiving alerts that any time there are failures from prowl related to cluster api jobs, please feel free to reach out to me and basically I need your name and your email address and I'll be happy to add. You.

A

Great just a question: is this supposed to be targeting only the core cluster api test jobs, or is it appropriate to also alert on infrastructure.

E

I see alerts from cappy kappa cap g, so I would say everything at this point given what's currently in there great cavsy.

A

Will join the party, then all right uh cool. Well, thanks everyone for joining and we'll see you next week.

I

Thank you, bye.