Kubernetes SIG Cluster Lifecycle, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Cluster Lifecycle - Cluster API 21-09-01

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello folks and welcome to the september 1st cluster api office hours meeting uh just as a reminder, we have a meeting etiquette if you would like to speak up or add some comments. Questions uh just feel free to raise uh with your resend feature it's under reactions on zoom.

A

This is the agenda dock. If you don't have editing editing rights, uh you have to join the c cluster lifecycle mailing list. It might take a bit to get those uh permissions in place uh but feel free to join, and you should also get the meeting in five as well um before we start look at this is anybody new here would would like to introduce themselves.

B

Yes, uh I'm prosper in nigeria, I'm actually here to communicate this community learning communities for a while now and I actually thought I could actually get involved with the community of the slack and just join meetings and learn more about at the moment. I'm still learning more about journaling and on meetings that we've been having so far.

A

Awesome welcome.

A

Anybody else who would like to introduce themselves.

A

Going once twice three times: okay, let's start with the first psa fabrizio. You have the first item.

C

Yeah and just want to highlight that yesterday or.

D

C

Monday, I posted a link to do a first name on cluster class and manage topologies. The feature is still work in progress, but I think that we start seeing uh the value that and that this feature can bring to the project and uh so the the damage there.

C

uh Thank you to everyone in the community who their feedback and super valuable feedback during the past, uh the work so far, and we are, of course looking forward for new feedback about the demo and feel free to reach out uh slack comment on the issue that that we still have open or review the pr. So everything is labeled with aria topologies and let me also give a kudo to the entire team, because.

C

They are doing an amazing work so far.

A

Awesome thanks for greets here and uh does anybody else have any questions or cluster calls before we move on uh definitely check out the demo? uh There is a link here that.

A

And it's definitely amazing to to see this come together a little over a month. I think without counting the proposal, but that takes time cool. um So, let's jump on to open proposals. We have this one has merged. uh I think we can probably remove it from here.

A

At this point, um spot instances proposal update with termination design handler do we have any updates for on this? This has been on the on the list for a while there's a lot of comment open here.

A

A

So at some point, like you know, if we don't have any more updates of folks that cannot work on it like we should probably consider closing some of these um or if someone else wants to pick them up. That would be great as well.

A

um Let's go on the other one uh mike do we have any updates on the opt-in scale? I was getting.

E

Yeah uh I've updated this. I think with all the comments, um so I think I think this is in good shape. I think I've captured everything that we've talked about, so I could just. I could use a few more reviews on this. I think we're probably really close to being able to merge this, and then I've got some pr's ready to go on the cluster auto scaler side. So we could you know once we merge it on this side.

E

We could probably within a few weeks uh you know- maybe in cluster auto scale or 1.23 we'd be uh we'd, be using this. Basically.

A

Awesome I'll assign to take a look, and um probably maybe can we target like next week to potentially uh do a listing consensus and merge merchant.

A

Season plus one all right, so uh lazy, consensus, starting, I guess, monday's off technically but you're, nines or seven.

E

Yeah thanks uh thanks vincent dear appreciate that I was I I don't know if you saw it or not, but joel mentioned about the preview, the spot instance one that he's gonna poke alex just to see if he's got any updates for it. Okay, thanks joe, I didn't see chat. So thanks for coming back.

A

Awesome, um okay, so um is cecile here. No, um let's see here so something that, like we have talked about, is beta one and how do we get there? um I do see that here there is like some proposal for zero- five, oh um so, uh but if we do keep minor releases on alpha 4, we need to make sure that, like a new proposals are actually backward compatible.

A

uh If there are any breaking changes, let's please bring them up at the next community meetings. So, let's talk about, why do we need a breaking change and like? How do we make it happen, maybe in a backward compatible way?

A

um So something that we have discussed in the past meeting for folks that were not here is like to get the beta one uh we. What we would like to do is to focus on the stabilities of our types today and the jump and convert the the the alpha four types to beta, one with minimal amount of changes and so spread out. The breaking changes in code over time would be beneficial to all infrastructure providers um so that they can adopt like breaking changes like uh in smaller chunks, but also like in a quicker fashion.

A

I have you seen in the deer.

F

Yeah, so I agree that we probably should minimize breaking changes going forward. I think a good practice would be to update the proposals template to include why we're making a breaking change and not just making it backward compatible instead.

D

uh Yeah just there's, I guess two potential classes of breaking changes. One is uh api change um such that you would need to vap the api and one could be like a behavioral change so which has come up in capo, where we've made the architectural decision to make some sort of behavioral change and that's going to result in an increment of the minor version release. Whilst we still keep the same api vision revision. Is that going to be the case in capi core, or is it going to be stricter.

A

That's something we need to discuss right. I think um we should definitely document the um the cap template uh for, if, like a cap, is introducing either api or behavioral changes, so behavioral changes are sometimes considered as contract. So, if we're talking about the cluster api contract right now, we are using the api types version rather than the cluster api version, to define a contract.

A

So maybe we should switch that to something else like maybe my major and minor release in the future we have to talk about. I don't I don't have like a good idea yet uh just putting this out there. That's like you know. If this comes up, we should probably um discuss it like in the next few months to come up with a solution.

A

Behavioral changes, though, like in terms of like how the controller behaves, for example, after you set a field. I think those are fair game uh with a good amount of motivation. uh They're, usually I guess improving things rather than like necessarily breaking clients, um so I'll see it depends.

A

It depends on, like you know, case by case basis, but it's I do like the idea that uh you've seen proposes like let's work on the cap template once we decide what kind of changes do we want to allow before beta one, and um I think cco also brought up like um if we do allow a freeze date. Let's stick to it, and I do like that idea right like if we do want to put more stuff on the roadmap, we need to stick to it, and um so how do we do that? And what?

A

What? What are other things that we want to uh add to this roadmap?

A

And do you guys still have your entrance joints or something.

A

So you just joined, we were talking about beta and how like, I guess, we saw the zero five zero here, um and so what does this mean? And how do we make these changes a little bit more backward compatible.

A

Does anybody else have any thoughts about these things.

A

Questions concerns.

A

C

A

Let's move on to discussion topics: uh stefan you have the cluster class patch amendment.

G

um I would share my screen. Do it? I have to give me some rights.

G

Doesn't show my intelligence.

G

uh Let's just movie movie back from slot, I'm gonna try to figure out. What's going on with it.

A

Okay, um so next up is nadir.

A

You have the security, self-assessment, update.

D

Yeah, I just I just want to give a bit of update, so uh there is a draft document that um it's based on the number of previous assessments done by the cncf, uh so we met for. We had our first sort of kickoff meeting on monday um about what we're going to do so as a prerequisite for doing any of the threat modeling, we need to document the data flows, so uh ankita swami from vmware has volunteered to start doing that so I'll be assisting and basically getting the view of like.

D

Where is data moving about uh so we're going to look at core cluster api and because of who was on the call we're doing aws as a provider. I know there was some interest on the azure folk as well previously. Well, it would be good to do another provider as well. um So it's basically a call out if you want to be involved and have your provider as part of an assessment or fast follow on then that's what's happening so we're going to try it.

D

Our aim is to get the initial sort of set of data flows documented for our next meeting with doctor on october 4th, and then we have some probing questions around like how does this happen, etc. So uh that's been organized by pushkar from the sig security, so um I'll probably start a slack uh slack for edit other people want to be involved in that, but yeah just want to. Let people know what's happening.

A

Awesome thanks india.

A

Any questions on the security, self-assessment.

A

All right, uh alex.

H

uh Yes, I to remind everyone: last week we discussed creating a new provider for keyword and we do have an issue created for it and got a couple uploads for it. So hopefully the repository will be created soon. uh Last week we discussed uh two alternatives: uh one one ended up being pretty long and this one is a faster track. So hopefully, once the repository created will just merge the code that we've had so far by several prs uh as contribution to this new repository.

H

So that's our current course of action and yeah. So looking forward to collaborate with uh folks, uh whoever is interested. We've already got uh pink by uh microsoft and red hat. So looking forward to collaborating with with a team yeah.

A

Awesome thanks alex thanks.

A

Alex think I don't know, I don't know after this like how like long it's going to take to um to get the repo it might take like a few days um but yeah. I think we should be good to go on on there and the yeah so definitely feel free to uh reach out. In slack, I think after the repo is created, we could probably also create a new selection.

I

Yeah hey, so I wanted to bring this up. um I guess for discussion, get feedback, but also just for awareness from everybody. um As of I think, the v041 release, we had changed the capd provider to actually use docker api calls for docker engine calls to be specific uh to get away from the cli.

I

My opinion at least cli isn't a great api, but really one of the big benefits of doing that work is the next step that I need to get to. Yet it enables us to easily mock out that interface so that we can have unit tests to to exercise some of this code that we couldn't do when we were actually executing out to to calling the docker cli.

I

um Since then, though, a couple things have been brought up in the community on the slack channel.

I

One user realized that that means you can't just alias the podman executable for those environments that had previously worked, and the more recent one is. Is the change in the docker desktop licensing for end users that it's still available for non-commercial use. But if, if you're working for a large organization or using it for commercial software, then it had been free and now there is a fee um really not an outrageous fee, but it's no longer a free product.

I

So at least on those platforms, um it does add an extra step that that I guess this doesn't introduce you. You still have the docker issue, but there are. There were easier ways to to drop in alternatives by aliasing illiciting, the cli that you can't do now that we're actually tied to making the actual api calls.

I

So um I wanted to throw it out there to see if anybody else had any strong opinions on that any any ideas of the best way that the community should support um these other users or these in other environments and just just collect feedback.

C

So I think that the problem has several layers. So if we look at kappa d, so cup d is not impacted by, let me say the docker licensing changes because it isn't not using docker desktop, and I I think that from a company point by view using.

D

C

Is a good thing because we are already seeing a better error message, as you explain it, we have now an instruction layer behind an interface and they use visual, walking and stuff like that. So from the campfire point of view, I'm really satisfied by the work I've done so far. However, what we have we have that the same. Let me say, library we are using the same library from the uh for the end-to-end tests.

C

And this is impacting the user which we are starting uh end-to-end tests using uh podma podman, and this will also.

C

Have impact due to the docker lines, and so I think that we have to try to break down the problem. Understand where the change or license of docker is impacting the user and try to provide a solution which is limited to that. That area. Instead of seeing all the problem as a unique bulk.

D

E

uh Yeah, I was just curious. Sean like you'd mention people using kind of an alias method and I'm guessing they were trying to alias podman to docker or the other way.

C

E

um But I had talked to antonio hey uh a few months ago, and I thought there was still like a. I thought. There was still a problem with using the nested pod man uh with the privileged options that we needed, that are you like hearing from people that that's actually working for them.

I

I didn't get a lot of details. I I'm it may be that it was just the end to end tests, because I think I think those can be run without that privileged um access. I think it's just capd that needs that.

E

Okay, cool yeah. Thank you very much. I mean. I think this sounds great, like my only addition would be like I, you know, I thought we had kind of talked about doing like a cap container kind of abstraction around this stuff. I think it would be cool if we could. You know still work towards that stuff.

I

Yeah I should mention that I should have brought that up that the way it's structured now it should be easy to plug in different uh container engines, so um that that is another option that for podman, for whatever else we want to support.

I

um We should have a way that we could still do that just by implementing the the specific interface for making the calls that we need more of a driver approach.

A

J

I wanted to clarify on the docker desktop change of licenses for these operating systems that this will affect person. This will not affect personal use, educational use and non-commercial, open source projects. So now it's debatable well, we obviously use quest api products and commercial as the basis of commercial software, but the project itself is not commercial.

J

It's open source, everybody can use it for their hobby as well.

J

So I I mean this probably should be consulted with a legal team, but I don't think uh the usage of docker desktop is uh impacted in this case.

A

Yeah, there's definitely a gray area there right like it's uh that we need to understand a little bit more, um for I had a question for you, though. It's like you mentioned that docker desktop, but doesn't impact cap b.

C

Yeah, I I think so because basically cup d only requires a docker socket and it does not require you uh for you to shell out or stuff like that, the the pro, let me say, the original problem from the user uh having podman and podman, is that it was uh trying to run the end-to-end test and during the end-to-end test. Basically, we spin up a kind cluster, and then we use.

I

C

Docker call to I, if I remember to retrieve the ip or stuff stuff like that, so it is what it was. Only this specific call that that was creating problem, because then everything else inside happened. It does not happen on the user environment, but it happens inside the cathode controller.

C

As far as I understood the problem, so only the first part is impacted by the user environment. The second part is already sandboxed.

D

Yeah, so if you're on linux, for example, you don't really have docker installed, so if you like me and michael will, use fedora, so we actually have mobi engine installed if we're using docker the thing that is called docker in the cli, so there's uh docker for desktop wraps up all the vm stuff that happens on mac os and all that. So there isn't any issue there. uh I did so it's a good target to then implement a cri interface and then that would work with quite container d creo, um possibly like mini cube.

D

I think kubernetes has one parameter correctly. Yes, exactly.

A

Yes, that that was actually like, where uh I was kind of going with this like when I asked the questions like we definitely probably don't want to go back on the cli, even though I thought about it, because there's also another project, it's called nerd ctl uh for mac that then uses lima, which is linux machines on mac, to create a vm and then pretty much do what docker desktop does, but with much many extra steps, but it's only compatible with docker cli as an interface, I don't see the docker rest apis do in there as well, which is the main thing that we that we need, which I think what men had or we need to test that out, um but yeah.

A

It seemed then, after thinking, it's like why? Don't we just use at runtime and contain runtime interface right? It's already there. We just need to figure out how to package it up a little bit nicely.

A

My main concern with all these changes is that gap d is the you know like how people will test kel's jvf for the first time in the quick start, and so uh that's a pretty big impediment that, if like, we cannot make it work uh like seamlessly, um especially if like uh they will, they might require like a um license over time, and so if we can support multiple runtimes, that would be the best way and then we can have docs on how to install like other runtimes as well, not just docker, for mac or for windows, or I guess linux not impacted.

G

Stefan um yeah just another dimension, I'm not sure if it was already explicitly mentioned, we're using kind to create our first bootstrap cluster and as far as I know, kind itself is only compatible with docker. um So I think we also have a transit of dependencies through dependency through kind on docker.

A

Yeah we should chat with the kind folks and to see like what their plan here is as well.

C

Kind of support apartment already.

A

Okay, does it super important from the cli, though, where yeah.

C

I have to check.

E

Yeah, that's yeah, that's the biggest thing, sorry to sorry to go out of order, but I I've been using podman with kind a lot recently and the current versions of kind. They will automatically detect podman from the from the command line when you use it. The problem comes in when you try to like, if you create the management cluster in the pod mankind that works fine.

E

But when you try to start creating, um you know workload clusters, that's where you start running into problems, because you know it wants to run podman inside the the first pod man that you're running and that's where I kept running into like issues where it couldn't talk to the outside host. To actually do the podman work that it wanted to. So that's where I kind of ran around.

J

I've subscribed to the kind repository, so I see what what's happening there from my observations, the support for podman is less mature compared to our docker, the general idea to support multiple container runtimes from the host where the management cluster is created. You have to like stephan said you have to depend on what kind sports really.

J

I honestly, I saw the pr from sean about moving the from coi to docker api, but I'm trying to understand like why, like? Why is this happening? uh Are we not happy with the cli abstraction uh to basically not abstract, co-wise and instead started using apis for the underlying container runtime.

I

I think really one of the main drivers was testability is that the calling out to the cli made it very difficult to write tests, so I'd consider that the the first one um calling just using a cli as an interface, especially with changes that have been coming out of docker. I was a little concerned that that changes would be introduced. That would somehow cause issues, but maybe that's not a really valid concern.

J

I mean it's a it's a pretty valid concern. I think another concern, however, is that if you want to have the abstraction and the api of these other container types is not sufficiently mature, you end up in a situation where you have to support an abstraction. There is a mixture between cli and api to support, all of them, which might be harder to maintain I'm not sure but yeah. I may eventually dig more into this topic, but then just trying to familiarize myself with the ideas.

G

Just forgot: uh okay,.

A

um All right, so, let's keep an eye on this. uh You know. So it's going to add like kindness just for like a development. uh You know which is true. uh We just need to be mindful of the quick start to be honest and um which is the main entry point for lots, lots and lots of video users, and it's super easy to just not have an infrastructure provided just tried. It out tried, tell strip yeah one kind, uh one sorry.

A

uh Adjust free, you have the next one.

K

Hey uh I had a question regarding the cappy upgrade logic um I think like usually for kubernetes upgrades. uh The order is such that a cd nodes undergo upgrade first, followed by control, plane and then worker, but I've seen that if the user applies kcp and machine deployment updated spec. At the same time, then the rollout of all machines starts around the same time, so I just wanted to know. Why is that?

K

So um I also had a question that when kcp upgrade starts, I can see that the controller updates this cubelet config map and could there ever be an issue such that the worker machine rollout started before the cubelet conflict map got updated and it does not have the required changes.

A

In gen, does anybody want to answer that question before.

J

Oh, I cannot answer the question about the machine deployments with respect to kcp, but the correct order to upgrade components in kubernetes in general is to upgrade the api server first and then you can upgrade the couplets.

A

Yeah, so so that's um that's that's correct and like that's what we suggest as well uh right now, it's a manual process right. So if you do change those versions, uh they will just be rolled down at the same exact time, so you wouldn't get the precedence of like making sure that the control plane is rolled at first.

A

This is going to change with cluster class potentially uh fabrizio. Do you want to speak to that.

C

Yeah sure so, with cluster class, there is a pr out right now is being reviewed that basically in cluster class you will. We would have a single point where you define the version of your cluster and and then, when you change the version in the single point, the controller will take care of upgrading kcp first and then start upgrading uh machine deployment in well well-defined order, predictable order, which is a machine deployment order by name this. This will be the let me say the first upgrade strategy that applies to the entire cluster.

C

We already have another an issue suggesting for a variation of of these cluster-wide place strategies, but yeah we will see if we can, we if we can get there in the same iteration.

F

Yeah so one quick question for breezy, so if someone wants to upgrade uh say uh their workers in a specific order and their machine deployment in a specific order, I guess they have to name them for now in alphabetical order, so they get picked up.

C

So now let me say this is the first time that we are trying to automate the uh this process and we are finding a a lot of corner case like what happens if uh a user starts and upgrades uh while another upgrade is still completing or stuff like that. So, uh while implementing this, we we needed.

C

A way to identify uh a predictable order, so the user can it is easier to test the user can resonate about it and stuff like that, so we choose the to the alphabetical order of the name, because it is something that is already there.

C

If you have a requirement around these, please open an issue uh we will try to to to consider them, but first first we have to, let me say, try to and to really understand the problem space and all the possible corners case that that we have to face.

A

Registry does this answer the question more or less.

K

Yeah, so I get that with cluster class, that order will be used by default and once cluster classes in uh like will users to be able to create clusters like without using the cluster class api and what will happen in the upgrade order. In that case,.

A

So once cluster class is going to be, uh you know if you create a class, a manage cluster topology in cost class and you attach to a cluster class and you change the version field on the cluster. What happens is that the there is contracts in place for control, plane providers and the mission deployments already there, but from control plane providers like we do expect a version field to be both in the spec and the status there's a contract out for that.

A

So what we do is try to set that field uh and then wait for the control plane to actually finish the um the upgrade process. So the status field should reflect the new, uh the new version that has been updated and once that's done, we proceed uh with uh with the machine deployments um on the machine deployment side like I think right now.

A

It's a list so like it's just gonna we're just gonna, go through the list and just upgrade them one by one and that's not configurable, there's an issue to make that a little bit more configurable. So if you have feedback, definitely feel free to chime in.

K

Yeah uh yeah, so I think this was for the cluster class right, but uh what I'm saying is, if I create a cluster without using the cluster class api, um this issue would not issue like this. uh Change in order would still exist right, like there is no okay.

A

So you have to orchestrate it manually like at that point. Oh.

K

It has to be done manually but like.

A

K

Reason for them to happen at the same time, right like it's; okay, if it yeah, okay,.

A

No, no there's no reason to force that it's um it's just that like because our controller are separate and they they were not able to talk to each other before now, trying to make that better and improving okay.

K

A

Cool any other questions comments, concerns on on this point.

A

All right uh mike kept the capture g purely functionality.

E

uh Yeah, so this is kind of a question to the community about information sharing between our cluster auto scaler and our capy providers. So there is an edge case that exists in the cluster auto scaler, where, if a user is using gpu, related or gp machine deployments that have gpus in them right, it's possible for the user to deploy a workload and then too many gpus are create or too many nodes are created, and this is based around kind of a timing issue that takes place with.

E

You know between the time that a node says it's ready in kubernetes and when the gpu driver is actually made the gpu allocatable on that node. And so what can happen is the cluster auto scaler can become confused and it sees a node ready that it thinks a pod would be schedulable on, but kubernetes says the pod is still pending, and so then that forces the auto scaler to create another node and, depending on the timing, you know you can create several extra nodes uh before the job is actually scheduled.

E

So there is a mechanism in the cluster auto scaler that allows a providers to put a label on a node when they expect that a driver will land there for an accelerated type process like a gpu for most providers in the auto scaler. This isn't a big deal, because it's kind of a one-to-one relationship, it's one provider for one. You know auto scaler, but for us, because the cloud the cluster api provider actually gives you access to a wide number of providers.

E

Each one of our cappy providers would need to be able to pro to add this label to the node so that you know to prevent this condition where too many gpu nodes can be created.

E

Now I've been testing this out a lot over the last couple weeks and I have a feeling I have a good handle on how it works and where the you know where the edge cases are. But what I wanted to kind of bring back to this community was a cap or a documentation or something so that people who are writing providers know about this case with the gpu and know how to add it.

E

To their providers, um so I wanted to kind of get opinions here like is, that is the best form that that should take a cap, or should this be a documentation update or what would look the best for this community.

D

um It almost sounds like it fits quite neatly with the scale from zero proposal, so if it's a small addition to that, maybe just wrap it into that. Maybe it doesn't sound like a major change to the stuff that we're already going to do. For that particular thing.

E

My I mean I could add it to that. My only hesitancy there is, I have a feeling there's going to be a bunch of background, I'm going to need to explain to kind of like lay out why someone would need to do this, and I think I it just seems like I don't know if it would fit necessarily there.

D

Okay, uh maybe if we get that proposal in um and then you could open up a pr to the documentation for the contract, when you know more details.

E

Yeah I mean that that might that might be a way to go just kind of like make that opt-in scale from zero, make it a little broader about our auto scale. Interaction and then add this like that kind of makes sense to me. Yeah potentially.

A

uh Also renaming the the cap a little, it's like more, like you know, discovery or something I don't know, uh instead of from zero yeah right right change it to our.

E

Cluster, auto scaler kind of interactivity.

A

L

All right, so I have a question to understand this better because I'll also be working on some parts in involving gpus. So when we are talking about the essentially the nvidia gpu driver not coming up soon enough and installing the required drivers right. So is there a way to mark the node ready only when that is up as in because the boot up of a node involves many things and just the network being up does not mean that the node is ready to serve as a node within kubernetes right.

L

There are a set of other options, so is it possible to have some sort of an end point on the node which just uh says that it is ready, after that is after the drivers are downloaded and.

E

L

E

Mean that's, that's a big question but like that that would require a pretty big change kind of on the on either the kubelet side of things or something, because we would need to be able to add the taint to the node when it's coming up or or something so that it doesn't go into a ready state.

E

And I think that the cluster auto scaler community took a different approach to this, where they just have a common label that you could add to the node.

E

So they you know, the node becomes ready and it's actually ready to schedule workloads. The only workloads, that's not ready to schedule are gpu-related workloads. So I have a feeling you know what you're talking about makes sense to me, but I think that's kind of a different conversation that would be more around kind of the node and kubelet and whatnot.

L

E

L

E

I guess you seen, you got your hand.

F

Yeah, uh just to add a little bit more on this, uh there was originally a proposal led by vish and andrew later, underside came regarding node readiness gates. uh That would basically allow controllers to influence the ready condition properly, but but it wasn't emerged yet so I assume that, even if you go with labels, you would somehow see still see the nerve go as ready.

F

Even if, like your specific condition, didn't happen,.

E

If you have a link to that proposal, I'd love I'd love to see that stuff. That sounds really interesting. Yeah I can post it on chat.

A

Awesome um any other questions on gpu before we move on.

A

All right go for it.

L

G

Try yeah this time. It works nice, okay, so uh topic is uh cluster class patches again uh so last week I showed variables um now I um extended the amendment doc, which is linked in the meeting notes to also include patches.

G

First of all, patches are mainly or the software that are mainly for cluster class offers, so they shouldn't really affect the end user, I guess or most of them.

G

So it's not that trivial to use um main ideas to move the complexity, a little bit away from the users so like we do today before cluster templates, which we deliver with our providers. That's usually something most end. Users don't have to deal with.

G

So um what I would like to do is I would like to show a few examples, how it looks like and then yeah just call out. uh We have the proposal amendment and, in our opinion, good shape. We would like to collect a little bit more feedback, um if possible this week and then try to move it relatively soon to pr, depending on how much feedback we get and how controversial it is yeah. So um yeah, I show how it looks like so just a short recap: a regular classic class looks like this this.

G

So it's very similar to our old templates. You have your cluster class, you have a control, pane reference infrastructure reference and your workers and mostly the same other resources as before, but it's all in your cluster class and then you should use actually creation creates a cluster. It's just a cluster reference uh cluster class reference and you can have some properties. You can have a version, labels, um specific machine deployments. You want and that's it you're, not able to customize it further. What I did to show how the patches and variables look like.

G

I took the the default flavor of azure and aws. You currently have because they're already n subs variables in there and I migrated them to how they would look like with cluster class. So let's just start with azure.

G

So um the cluster class here is mostly the same as I just uh shown before. So you have your qubit and control plane template. You have the machine you have in your q button control plane, you have your uh azure cluster template and your machine deployments or machine deployment classes which will be used to create machine deployments.

G

Then you can define variables, um variables have a name, they can be required or not and they have a schema in the first iteration. They we only use basic types, so we only have string, boolean integer float, they can have default values and that's yeah. That's the same as you would specified in crds, we plan to add more complex types in later iterations, so you can have probably objects and arrays, and things like that.

G

So here you define your variables and the schema of the variables and then you define patches and those are only inline patches. So the current proposal amendment only goes as far as introducing inline patches in a later iteration. um We also plan to add external patches. We can do more. So that's just a no code option here. So what can you do here? So there are two main parts of a patch definition. The first part is the target of the patch, so on which resource do you want to apply the patch? In this case?

G

It's just a azure cluster template and then you define a json patch. Jsonpatch has three cards the operation you want um current state of the proposal. Amendment is that we only support, add, replace and remove, because those are probably probably all we need, and it's a little bit hard to support things like move and copy, and probably nobody needs them. So we plan to leave that out for now and if somebody needs it later, I guess those would be future amendments, so we have ad replace move. Then we have a path.

G

What we want to change and then we have a value. We have a few different options to define a value. uh The option we have here is that we just use a variable directly, so this just takes the user provided value of your variable and uses it to to patch the name.

G

You would provide it like this, so that's templated, because probably classical cluster cuddle still has some templating, but basically you can say. Oh, my cluster identity is cluster identity, create your cluster, the the value will validate against the schema, the patch will be evaluated and your azure cluster template will be patched with display.

G

Then a few other examples. I just click a bit to find something different.

G

Those are all just the same. Just single uh replaces yeah, um that's the first uh additional thing you can do um so we have something like this. This would, in theory, patch every machine template you have in your cluster class, but you might only want to patch those which are part of the control plane, for example. So in this case we want to patch the control plane machine type and that's why we only want to patch the vm size in azure machine templates, which are part of the controller.

G

You can also only patch specific machine deployment classes. So in this case, if you only want to patch the default worker machine deployment class, you just say: oh, I only want to select this machine deployment class. So that's, basically all you can do with the target.

G

That's the full feature set, and the only additional feature I didn't mention yet is what you can do with variables. So we already saw something like plane variables. We also plan to introduce things like build in variables which are for further details. We see the amendment, but we want to provide just some values out of the box. So when you want to use the cluster name somewhere, you can just build and dot custom name, and there are a few more then.

G

What you can also do is you can do more than just use the variable itself. You can template something and um the ideas that does a go template. It can access all your user-defined or built-in variables and the only restriction is the result of the go. Template must be a valid yaml or json value yeah.

G

So there's that and the only other thing I don't have here is: you can also use hard coded values if that makes sense for some reason, so you can just hard code, your your json value, whatever you want, I'm directing the patch yeah. I think that's rough overview. As I said, more details in the dock.

J

Lumiere, just a quick question related to the json patches like: uh can you explain what what library are you using for that.

G

um I assume the same we're using currently and I'm not sure what exactly it is. So we also, we already have some some json patch calculations, some chasing patches, we have in cluster class. So when we reconstruct apology, we have to patch control planes and whatever.

G

So I I think it should be the same, I'm not sure how far it goes, but I guess you try to align to other libraries used in an ecosystem.

A

Yeah, it should be the same one as using the um server okay like I. I think we picked it from the one. That's in the go mod.

J

Okay, I think it has some short comments by the way related to some of the operators. It was missing something from the original uh spec of this whole format to have operators and paths. I forgot the name of the exact standard, but yeah it had some shortcomings that were difficult to fix, but that's not a big broker. uh I guess I like one related question, because uh the cuba deb api recently we introduced patches, but the only way to. Therefore britain knows more more about this.

J

We had patches, but the only way to modify something is from a file uh in this case for the quest class api. We are embedding the the ways for the user to modify something inside the api spec uh did you did you consider at some point annoying the user to have external files to do the same.

G

um What we, what we thought about is um that we or let's say the structure here, is defined in a way that we can also can can move the patches and or we can consume the patches. If you want from config maps or something so that's something we considered, we didn't introduce it yet, but it's it's not a file, it's just another object and kubernetes, I'm not sure. If that answers the question.

J

Yeah, it's not a big issue. It's depending on the size of the patches and templates that users will have, it might make the basic they are a bit difficult to read, but I guess that's. That would be the only problem.

J

I think uh one.

G

One thing why that, but it might be okay, is um that um the inline json patches are the first easy solution just to use it a little bit. But it's not um it's not not uh not a end goal. I guess.

G

If you want to do more complex modifications, I guess the best solution will then be the external patching will introduce later.

G

So maybe when you get into the range that you have, I don't know one or two k case of lines um json patches, then the inline option is probably not the right. One yeah understood.

A

Okay yeah, this is definitely I want to say, stop gap, but it's definitely not the end game. The the it's going to be a combination of these inland patches with some sort of extensions, uh which I open an issue for it and if you folks are interested uh folks that were here for alpha one and alpha two will probably remember something similar.

A

uh It's called the runtime extent it's an rp for runtime extensions. I decided to jot down some notes like what would that mean? This is not just for a cluster class you'd probably be implemented first in cluster class to see how it works, uh but it could be extended to other things as well um folks.

A

So if we we asked like um a bunch of um infrastructure provider and um other folks from the community like here like what what is the biggest problem of claustrophobia and right now, it's like uh the extensibility, it's like always through providers, which you know sometimes are hard to implement. If you just want to add like some little step, but that's like maybe a downstream requirement or a private requirement.

A

um So it's meant to kind of solve that satisf problem and reduce the overhead of this inline patching. At the same time,.

A

Awesome we have one minute left. Do we have any less comments? Questions concerns before we call it.

A

All right, thanks folks, have a great wednesday. That's the week. Bye cheers.