Kubernetes SIG Cluster Lifecycle, 28 Feb 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Cluster Lifecycle 20180228 - cluster api

Description

Link to doc: https://docs.google.com/document/d/16ils69KImmE94RlmzjWDrkmFZysgB2J4lGnYMRN89WM/edit#

A

Hello and welcome to the Wednesday March 28th edition of the cluster API working group, a working group of CID question lifecycle. We have a few agenda items today and it looks like folks are just now signing into the dock. The first agenda item we have is from Robert, who is out for the next follow following weeks so and says he won't be responding by email or slacker will be reachable. So if anybody needs him or anything else, contact us or the rest of the sig, and we can probably help you.

A

The next item we have a pull request is updating. Api server and API machinery. Libraries to 1-9 need to pick up fixes to properly generate machine deployment.

B

Right now, this PR only updates to 1/9. It does not have a fix, there's a dependency on the UPS or a builder. To actually have that fix. I can probably make the CR from the circulatory flow as well there's a problem in how they generate strategies for merging which doesn't seem to so. This PR needs to be in before we can actually we'll need to do another update to get the UK server builder, but for now this only updates all the machinery to 1/9 I think so, and yeah that's pretty much it for this.

B

Unless anyone has any concerns about this know,.

A

That, let's get to me, I, ran into this as well, when I was ven during the API server code last week, where I wanted to upgrade my API be sure to one night as well. So this works great for me.

B

The other thing is the machine deployments. Er is out for discussion or in comments. It's just. It only introduces the tightest because I cannot generate the code because API server builder doesn't work so we're waiting on that. Okay,.

A

Are you gonna generate the code before we merge the pr? Are you gonna do that in a separate peer I will.

B

Do it before we merge it, but I would like to get comments. Everyone on the type definitions is that the 232 here, API sir pure, no, not that one. Yes, that's the one that's required for the second.

B

Six five eight is: what is the machine deployments pure.

A

Cool so I'll take a look at the machine, but ployment PR um anything else on your two items. No.

B

That's about it. Okay,.

A

Chris, you have something on machine class, prototype.

C

Yeah Friday before Robbie went out for a while. He posted today machine glass, prototype and I just want to point it out. It's there for you, I've already provided some feedback on it, particularly with how we specify available capacity versus actual capacity and how that might actually feedback to changes in machine spec instead of the machine glass. But I would like to get more eyes on it, and I will try to think the autoscaler team to make sure that things we just must satisfy what they need. So.

D

Little feedback on that right now from what I've seen let.

D

Us see the parameters and all those kinds of things which is great, especially if you want to provide like a generic way to specify the CPU and all those resources.

D

My question is: should be better to put like those parameters, the capacities and all those different things into the config complete configuration or whatever it was caught in to the parent object and only reference, the cognitive, the the provider config roll extension as a reference or was a point, I mean something from the roots of things we with the current api's that Robert is proposing. You were only able to change to reference that quad, config and yeah. My question was about this: Oh.

D

Are we going to cap on the the quad configures I mean the raw extension? Is it's a reference level or or the entire class can be inlined.

C

I'm not sure I had a person follow. Let me see if I can look at it. Real, quick.

C

So you're talking about the how he changed the machine types to have a machine class reference, yeah.

D

Exactly I mean originally more, as the idea was to have the configuration the the provider config I'd, either to be inlined or to be a reference. Yes, yes,.

E

F

D

The change to machine cause I'm wondering if it's not better, just to have the machine class also cause. You know two together possibility for the entire machine has to be inlined, not only the the provider config. Oh.

C

I see so yeah the so the machine type as I provide a config struct where you can inline the provider config or you can get a value from a provider kindig source, but provider config source is actually a machine class ref, not a provider ref. So it gets more information in there. I.

D

Mean my deal was okay: either a reference or inlining I see.

C

I think that's good feedback. um If you could put that on the PR itself, we could discuss it in long for yeah.

A

C

This is Roberts proposal, but I'm kind of shepherding it for at least the few weeks he's out.

A

Okay, anything else on Roberts, PR, I think the one thing that I would want to just call out about it is if we are trying to push to a stable release of the API whatever that means this would be. This would be implicitly needed to be solved before we get to that stable release. So I'm gonna flag that as well, and we can talk more about what we're going to do with this. As we talked about migrating to a new repository which is conveniently coming up, I, I guess.

A

My next thing is in the new repository I, updated the owners file and used my almighty get privileges to disperse that in so, hopefully other people can start to review PRS as well, and they can start to work. Has anybody had a chance to look at that or test to see if they have a merge button and the bots are working as expected? I.

C

Know, I don't think the owners file affects the merge button. It affects how you can interact with the bot, but it doesn't enable github UI features right.

A

But shouldn't we be able to kick off merges from the bot now that we are in the news file, I'm still new to the whole owners file paradigm, that there's.

C

Some testing, for uh we need to check some things in to enable the bots for certain things on in the testing for repo and I'm, not sure that the auto merge are registering it in. A tight pool has been done yet, but before we even get to that, I'd want to get the unit test over there and the CI testing so that we have tests before merge.

C

A

I so when you say unit test and CI test, you you mean just get the test infrastructure in place so that all pull requests goes through some CI CD before they get merged. Yeah.

C

And that probably requires us migrating yeah code over to have the test run, because the entry point for the CI test is actually scripts that live in the cube, deploy repo, okay,.

C

And again you can you can just assign those to me if you want me to take care of it. I might eat the late getting to it this week, but I should be able to pick it up at some point in the next week. Okay,.

A

C

Someone else doesn't want to beat me to it: okay,.

A

I will I'll CC you and myself and same thing if I get some time, I'll see if I can't look at it, if not yeah,.

C

And just as a side note, we kind of reached out to the Cuban at ecigs org leaders to get some more people to be admin of that repo like and github. So it's not just you that has write access to it. Now, it's a Rodrigo Robbie and myself as well. Okay, just as escape hatches! Yes,.

A

Okay, cool I had the one thing that I've been kind of thinking about. I watched the call from last week. This is relevant to the migration like the bigger migration effort in general. Was one of the issues I brought up and again I'm sorry, I missed a call was what do we want to migrate and what do we not want a migrate, and it looks like just moving the API definition and the common code is going to be a little more tricky than we thought.

A

So one of the things that I was kind of Motoring on was what, if we did an exact one-to-one from the cluster API directory into the new repo, and then we can follow up on pulling things out through some number of other pull requests after the fact, I, don't know how that resonated with anyone else, though,.

C

C

So you say you're talking about just copying everything over and then working on pull requests to remove things exactly.

A

Like just do a one to one just to get us migrated to the repo, and that way it's it's less of a exercise of figuring out what needs to be moved and more of an exercise of just move, all the things and then once we're there. Then we can look at pulling things out as needed.

C

Yeah I think I think before we move even attempt that we need to figure out some things too like right now we use the K eights. That I owe slash, cube, deploy prefix.

C

We need to figure out what prefix are going to move to, whether we just do a github comm one or if we can do some other kate's, that io redirect.

C

So we need to decide on that before we move all the code because it has to have an appropriate path. Name.

C

Yeah I think that's one of the major logistic hurdles. Macke I can't think of any more yeah.

A

C

Know I had a couple more, but I can't remember them right now. Yeah.

A

I think it's I think it's figuring out imports which this is relevant to and then issues and then like, ideally we're not in a situation where we have two copies of code in the same place for too terribly long.

A

So maybe we can like get a proposal, get everybody comfortable with it and then picking like an afternoon to actually go and do the dirty work or something.

C

Yeah is there a proposal or an issue discussing this right now that we could like add these items to it, maybe like in the top level of the issue, have a checklist of things that need to be done and we can track the items there. Yeah.

A

There is an I said, a proposal out. I can update it with what we just talked about and then get a checklist in there, and as long as that looks good everyone else, we can start to go through the checklist, yeah.

C

Okay, can we just link that proposal in the meeting notes right now, so we can all look at this and go to it.

C

A

And I think the the big difference between what I was just proposing and what's actually written here again, it's just in the the proposal. It's more of a. We should only migrate these subsets of features and what I'm saying is. It might be easier just to migrate everything which I'm kind of I don't really have strong opinions either way. It just seems like it easier having you.

A

A

Any other thoughts here.

D

Just wanting for my site during the West Coast Oversight, Committee Brian grant those of you who are not present explain the more place in the new the projects and all those kinds of sick own aunt code. So what we going to do about him in right now, the the cube the post is. This is a project into the sick coast, arrived psycho and are going to get us support of our own or some ideas on the on the for work. I think.

A

Based on what Brian said in sequester lifecycle yesterday- and we are a sub project and not a working group, so there's probably a bit of verbage that needs to be updated there and then I think, as a sub project.

A

We have certain there's certain expectations of us as far as like finding an intruder and our scope of working and things like that, um I think this might be like I, don't know it's it's either we go out of our way and do the the items needed ourselves as a sub-project, or we wait for that mandate to come down from sequester lifecycle.

A

Personally, I'm a fan of the less work option, but I don't know what everybody else thinks I.

C

Have to go and watch yesterday's meeting, apparently I wasn't able to attend to know what this is. Okay,.

A

Yeah there's a handful of action items that both sequester life cycle in all corresponding working groups, kind of have a homework assignment without really a strict due date. Whatever we get, those done.

A

Okay, I, maybe.

E

Bring one comment to that: it seems like right now was: was most of this working groups for going into cube, deploy a repo, it's harder for people to discover that Reaper, it's kind of it's a little bit like you know it's it's a little bit of an odd place. It seems so if there was a I guess what I'm thinking is that sub-project.

E

Is is important for from the perspective of contribution and discoverability right, like this core ability is the perspective of stuff project, discovery for users and potential contributors.

A

Totally agree, I mean I, think it's important that we kind of go through the rebranding of going from a working group to a sub-project um I. Think it's just a matter of figuring out how and when.

A

uh I think a good natural pass out of here would be to have somebody act as a representative on behalf of our sub project, to bring this up with an extra cluster lifecycle, medium meeting and just ask what we think we should do.

A

Justin says: vote Nova I tend to agree.

A

Okay, we got two thumbs up: okay, I'll bring this up next Tuesday I'll. Add it to the sink cluster lifecycle agenda right now, if you guys want to give me a second to pull that up.

A

Okay, I'm gonna have to create a new, a new one, so.

F

There would mean.

F

I think the interesting question is whether, where we want our repo to live and I think there may be a good case for being vanetti slash machines, but maybe that we haven't justified it yet like.

A

Creating a new repo or using a renaming, the cluster API repo I.

F

Don't think there's a huge difference between which which which frustrate gotta repo.

A

So we have anyone, get home, slash, kubernetes, cigs, slash cluster hide any API. Yes,.

F

I think there may be if there may be a case based on what Brian said yesterday. I feel like there might be a case for this actually being one of the Reapers. That is not even under committee, stash SIG's in the long term and I know if we want that.

F

But I can see that, like in three or six months that easily justify being a kubernetes, slash machines, API like a top-level thing but I, don't know whether we can do that yet and I think the reason I bring it up now is because if we, if we agree that that is but something we want to do like whether that informs our immediate paths or whether you know whether we stick but we've got for now and prove that we are worthy of a top-level I.

A

Think it would be cool to be in the top level personally, because I really feel like the infrastructure layers are super important to kubernetes and often get miss. What's the word I'm trying to say here overlooked uh I mean I, don't I don't know I. Should we open up a proposal for that or what are our thoughts here like it? I, don't know how hard of an uphill battle this would be to potentially get you know: kubernetes slash machine cluster, API machines, API, whatever I.

C

Personally think it might be a little premature, but if that's just me, I I don't want to take on like too much movement at once. I'd rather us settle on a repo. We know we can get and focus on getting a really good product and then letting the product speak for yourself. I tend.

A

To agree with that philosophy.

F

A

Okay um I would this is going to be the most made of thing. I say all day. I would propose that we write a proposal proposal that would go into the new repository.

A

That would say why we think there should be for lack of a better term in court, primitive and get some reasoning in there and let some folks think about it and if it seems like there's high-level adoption and it's an easy one for us across that bridge when it comes and in the meantime, continue to work and build out the cluster API and the repo we have it and continue our work in this direction.

A

Okay, Justin! That's your homework, ready break Oh,.

F

Rick or just I just don't know what else to call.

A

A

um A top-level primitive, a first class citizen- okay, um let me add an action item here for you, Justin and then I have one more thing: I wanted to bring up.

A

Okay, that is now assigned to you.

A

The one thing I wanted to bring up is I volunteered to bring up the working group versus sub project, and next Tuesday SiC cluster lifecycle, meeting I'm gonna be out next week. Does anybody who I added it to the agenda? Does anybody want to volunteer to to bring that up and track that work and plug it back into these calls the following week, if not it'll just get pushback, we could just fine I.

D

Can try to obtain but I'm not sure that I'll be able to make it okay.

A

I'll keep it in the agenda if you're there bring it up and feel free to talk about it. If not, we can I'll put a sub note in to say that we can punt if needed and I can bring it up.

A

Anything else for the cluster epi.

G

Orphan machines kisses like that, so you know you have types defined, but we expect users right the controller and- and we also write the controller which might have bugs inside it right so much. There is still a possibility that machining said let's say and secreting more machine that it should be or I would say what we have seen. The way cloud providers behaves that sometimes it gives from the cloud provider SDK. We give a response that it has actually created the machine, but the machine is not created.

G

In that case, machine object becomes orphan in other way around if it happens, and machine itself become orphan. So having even thought about it or any strategy on that, how we'll be dealing with this orphan machines just to prevent machine sight or machine deployment to explode? Maybe in terms of number of machines I mean.

A

I don't know if this is necessarily a cluster API mandated concern, but it's definitely something that happens. A lot I've seen it and almost every kubernetes deployment tool I've worked with where some infrastructure can get orphaned. Somehow, usually that happens like through a failed deployment or a partially deployed deployment.

A

I've seen a lot of different tips and tricks. I know the go team: it's like a scraper that will go in like every night and just destroy all infrastructure and give an account. So there's a lot of different avenues here, but I just I guess the higher-level question for me would be if this is something that we want to prescribe so.

G

Goes in motion controller manager recently, so this thing and we ended up writing a safety controller. So what we try to do is that, if an object that we create, we basically put a specific label which refers to the cluster name. So it's like in all providers supports either aw put stakes and Google supports that where it takes and so on right. So we put this kind of tags and then separate in parallel.

G

Safety control will basically keep an eye on the machines with the same tags and it will they have a map across the actual machines and the machine objects which are created and if it finds that the machines are orphan, it can easily delete it. But the interesting use case is in the second part, where any point, because of the bug or some reason, if machine said, tries to create more machine objects or more machines, then what safety controller? Does it basically freezes them up on the planet? So each?

G

If you see each other machine deployment, ER machine set, is some kind of sink handler inside it right. So we basically skip that loop continuously. The freeze label is there on the machine so that logic we and that I mean that that is actually coming up to be really useful in our case, because OpenStack and some other providers does behave weird, sometimes you don't know when they respond what they say. We kept the response that machine is created, but it's not there and sometimes we occurs and so on.

G

So who is this wondering if that approach makes sense.

G

Because, in our case, if you see the types and something that is defined, but if user tries to write the controller, then the controller is only component which is talking to cloud provider, and this can actually cause huge. This is something which can actually cause the resources because of the my notebooks waiting on of VM says.

A

Yeah I I new is just gonna, be a matter of time before we we brought up tagging sources and cloud provider account yeah Justin's, giving us that thumbs up. It's it's a well known problem, especially because a lot of the cloud providers kind of follow the it's okay to be eventually consistent mentality. So, even with a tag like you can create a tag and then there's some Delta in time before, like you might be able to read or render that tag. So it's it's a hard problem to solve.

A

I think probably the the best pattern- and this is my opinion, for what it's worth that I've seen is every time you create infrastructure, you also it's sort of like an atomic creation where you, you hang until you're able to also recognize the infrastructure from the cloud as well. So that would be like a controller level primitive. That would say if you are making a mutation, you don't want to consider that mutation valid until you're, also able to read from the cloud that your mutation has succeeded.

A

The way you intended it to that is how I've kind of approached this, but again, there's sort of like pros and cons to each different way, depending on what your concerns are.

G

Just that's correct, so, ideally the controllers themselves should be mature enough to take care of this part, but it does so happen that sometimes sometimes becomes so impossible for the controller themselves. At the moment to understand, because, for example, provider is timing outer is not reachable at all. For some moment, the creation called internally might have gone through the VM is created, but then you are not able to reach the cloud and controller cannot so any other process, maybe is like landing in parallel, which can later on garbage collectors produce.

G

It is correct, then there could be a completely separate I.

G

F

Is it a machine controller should be responsible for for the in the underlying instances and making sure that they're all bound to a machine as it were. If you have an extra instance that you know you're responsible for that, you created but isn't backed by a machine, I think you should delete it and hopefully that will just be part of the sort of normal reconcile loop. You know, how would we do that so I can ws. For example, we would tag every machine that we want so that we know that it's not someone.

F

You know some other machine that you choose running and then I think what I think I would probably personally haven't done yet, but I would probably also tag each if I'm watching instances directly I would take each instance with the machine. I knew ID. Do you any of the Machine object and then I think it would be safe to delete any instances that are completely orphaned?

F

Then you can also say to--if. Even if you restart at just the wrong time, you can still reconcile it to a machine. If that machine still exists, you can sort of recover from most scenarios, the the other one, the other case I think is you have a machine, but it's a it's running in the infrastructure, but it doesn't join as a note. Not one I think should be generic I.

F

Don't think that one should be down to the machine controller to do I think that one because it isn't specific to AWS or GCE or any cloud provider or provider with machines.

F

The logic is basically the same where you know after 10 minutes or whatever you configure that you decide that machine is not joining and you delete the machine, I, guess and expect that then the cloud provider or that the machine provider, machine controller, would delete the actual infrastructure, and that can see the other day in siga w ask where we're talking about nodes that go not ready. After about 13 minute people, you wanna delete them, and we don't have that terminated. Yes, that's a similar thing where we see infrastructure.

F

That is, we just wanted to leave, because it's not recovering and roaming in the case of a cloud is better just to leave the infrastructure, but.

A

Yeah, this is all up to the engineer, designing and prescribing odd controlled loop, which I think we will prescribe as a community I. Just don't know if we're there yet.

D

H

Wash handles this in case it's of interest. Anybody is their separate concept, for instance, and UVM, although the MPB container matter and and so the VM is whatever the the ayahs or the infrastructure provides.

H

The instance is a logical concept, and then you know an operator typically interacts with instances and creates, creates instances under the hood that's backed by by VMs, but then boss also exposes like, or you know, delete VM list, orphaned, VMs that sort of things, so it keeps track of these VMs and can also can also, in some cases, associate two VMs to an instance.

H

Example, if you want to hot-swap a VM and have that represent the same instance, low separating those two concepts like machine is kind of a machines right now playing both of those concepts, but the logical inference and the physical machine. Having two separate concepts may may be helpful. I agree.

A

um Another pattern that like I've seen be pretty successful, is when your controller goes through an iteration all of the infrastructure it does know about. It updates a timestamp, and that way you can define TTLs. Let's say if we have a bit of infrastructure where you store that timestamp doesn't matter, it could be in a database, convenient tag, there's pros and cons to each of those. But when you get to a piece of infrastructure that hasn't had an updated Leichhardt, a timestamp, you can probably say according to some policy and save, to delete as well.

G

That's correct, so I think mission controller can take care of this part, except in the cases where so say, for example, the deletion call for some moment cannot go through, but we still want to maintain a number of machines right. So at that point we might want to create one more new machine, get it at this existing cluster and let someone else connector get rid of the older machine which was not which we were not able to delete previously, yeah. Of course, this this logic could be baked in inside the controller itself.

G

Fusion controller itself, yeah.

A

I think in a perfect world it would be baked into a controller with a nice config map that we could easily tweak, if needed, not to inception any controller ideas. But that just seems like a good, stable pattern to me.

D

One more thing so cubelets right now it it actually does the garbage question of orphan pots so and yeah. In some cases, for example, the machine might have been forcefully deleted when the controller is down and asked me to garbage collect in that case as well, but in resources deleted. But the note is up and then you need to do some. Do some magic in that case either delete the machine if it is not, for example, bount or not, or try to create a new machine object and then put it back together.

A

Okay, does that help answer your question and give you some ideas about abandoned infrastructure, cool anything else from anyone.

A

A

A

Okay have a good day. Everybody.