Kubernetes SIG Cluster Lifecycle, 13 Feb 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Cluster Lifecycle 20190213 - Cluster API

Description

Meeting Notes: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY/edit#heading=h.rm2b4redfsar

A

Hello and welcome to the Wednesday February 13th edition of the cluster API project office hours. We have a relatively short agenda today. So if you do have any topics, please go ahead and add them to the agenda and I will go ahead and link the agenda back in the chat again for anybody who's joined recently.

A

To start with, it looks like we have Siddharth for awareness of a PR for changing the default cash sync period.

B

Sure hi so I just so. This is like a behavior change that I noticed after we moved to see our day based implementation and thanks to Daniel helped me gonna, find this issue, so I usually notice this problem in the vSphere provider implementation company weeks back we're in the this controller loop that is supposed to kind of catch up.

B

Every X second was not happening and I notice that it like from an observation point of view, I noted that it was doing it like it 10 hours, but I didn't know where it was set, and then they kind of helped me narrow that down. So thanks to him, so I thought one other thing that I notice is like all: the providers are actually not overriding any of this default sync values much. This behavior is going to be common for all behavior.

B

All providers, so just to highlight one simple use case where this might be impacting every provider is, if from your interest. So let's say once you create a machine object and your actuator actually creates a underlying VM in the you know in the provided infrastructure and if somebody were to go back directly to the infrastructure and delete that we am now usually unless your actuator is activated again by the sync call.

B

Your actuator will never know and realize that their infrastructure VM is gone and will never recreate that now before C are deeply CIB. This time. What I believe I'm, not I, don't know what was the exact time, duration, but I believe it was in some seconds or maybe a minute or so kind of a duration, so every minute or that smaller duration, the actuator, will come in and will detect that or that instance is gone from the back end and the provider would recreate it.

B

But now that behavior will not happen because the default sync is now actually 10 hours, so I thought I just bring it for awareness for every provider as well, and you know in the VR or in the issue. I also noted, where to set that particular time out how to override that but I'm. But just a note, I'm still actually verifying that I haven't really verified that time sing, but looking at the code seems like that will be the right place to set and yeah I mean I.

B

Think maybe this is something that we should document so that any new provider who is going to come up in the future should be aware of this fact: I'm, not sure. Where will be the right place? Maybe in the get book somewhere, you know say: I mean I yeah, that's about it. I mean any suggestions. Welcome to what will be the watch.

B

I mean what should be the ideal defaults in period I'm, not sure what the answer is, but maybe this group has enough knowledge that it can kind of suggest what should be the default, one that you know every provider as a guidance should follow.

A

Does anybody have a strong opinion there well.

C

Before we answer what the the best value should be, why why does the CRD code that good builder generate set this timeout for so long? That's, that's! That's what I'd like to know so.

D

Historically, there were two reasons: I was told for why the rethink period would be set. One was to paper over potential bugs so that, if events were missed, the recent period would ensure that eventually they would be noticed, but there's a another reason, which is what I think this issues about, and that is that, if you want the actuator to respond to external events, it needs to wake up periodically and check to see if the external state still matches the internal state.

D

So because of those two reasons, I speculate that two builder made this longer, because we believe that there are no longer any race conditions which cause events to be missed, but I think by setting it longer, we lose the we lose the ability for the actuator to be able to verify changes to external state which may occur independent of anything the actuators doing so, I'll stop there. That's my understanding of why it might be set that way.

E

I believe that is spot-on and I. Think I think. The other aspect is that there is a trade-off here, which is you know why don't we set it to one second right and it's because the you would then have all these sort of spurious or clean up resyncs and, for example, on AWS. You might well hit a rate limit on on touching your. How often you're allowed to make calls against your cloud.

E

So there is a there is a not too low trade-off there, as well in terms of resources, both in the controller itself and against the cloud that you are probably operating on so.

D

That's an interesting point, because if we agree that this is necessary for some providers, my preference would be not to document it and instead choose a same default. But, as you point out with AWS, it may not be that there's a provider independent default that makes sense so.

E

I would be very happy. Yeah I would be very happy to set a much lower limit. It feels like that is the correct thing to do and then, when you start seeing a device rate limits you can there are other strategies, for example like polling, AWS or polling, your cloud. You know every minute or whatever it is, and keeping a cache rather than sort of polling it on demand and we've sort of started to inch towards those in the cloud provider. But we never got all the way.

E

But that's that's I think it's a good thing to have it low, but I'm, just rying to like explain why we might not or make the case why we don't set it to one second type thing sure.

A

So if I'm not mistaken, pre CRD migration I believe we were rethinking every it was either 5 or 10 minutes.

B

Believe it might be shorter than that, but I didn't haven't I'm, not sure we could probably dig deeper in the code to see.

A

Yeah I just remember: we had an issue with the AWS provider where the exponential back-off for retries wasn't taking in it wasn't happening. So we would try 10 times really fast in succession and then it would be I couldn't remember. It was either 5 or 10 minutes later we would kind of rear eken. So.

F

Yeah, that sounds about right, so.

C

What does everybody want, the big time to be.

G

Something anything so for the rethink. I think there are two different aspects. The first to expect is that we want to continuously sink the entire cache with the capabilities. Api servers cache.

G

So we want to mirror that that's one missing period and the period is what does what was just disgusted by siddharth is that you want to continuously make sure that all the objects are queued after a certain time period right and that what the sink two different parameters- I, guess and I didn't check the issue in detail, but it seems this is related to the actual dish actually of building the entire cache of the S, which is mirroring the API server on your local cache. That could I guess be slightly longer.

G

You don't have to continuously keep doing it, but for the other part where you want to keep adding new objects in the queue, that's probably a smaller, much smaller time period. So, for example, what we do is that we make sure that every five seconds or 10 seconds we continuously keep adding the objects and see that in case. For any reason, if you missed the update event or certain event person with a particular machine object or machine set or deployment, we can pursue it again. So maybe.

H

G

By prediction, we can think of a deeper understanding people. They share an object, and the second point will the use case, which was discussed right so about the actuator continuously, checking with it'll to stateful matching from the VM site, and the so I seems that the better solution there could be is is that we know at the moment. If someone goes to the cloud provider and deletes the VM, then what should ideally happen is set on your machine object. You should have.

G

We should be able to see that Cuba desktop this morning were node conditions and the Machine controller should be written in a way that that mission controller identifies that, because the node condition is saying that it is not responding from Los X minutes now it's the time. I should actually replace that particular mission object other than actuator after visiting period, we invoked actuator and actuator goes and check whether vmx I'm just thinking out loud. So if you go the approach I just described, then we can also consider more conditions in picture, for example qubit.

G

Not that is just one one condition then. Similarly, you can also consider this equation, so let G ket of educators. I guess, is that if I'm, not thirty minutes at this pressure inside the same control, loop will delete the machining. We create the machine because the disclosure is I, then you can basically not use the Machine anymore only in workload. So just wanted to clarify on this that on the first part, there could be two different kinds of times that time or periods and the second part the problem could be solved differently.

G

B

Actually, that's a pretty good suggestion. Yeah I think you luckily identified they. You know. There's two things like one is the cache of the human. It is obvious, and second is actually maybe you know, since this is going to be a common issue for our project providers. One idea that this comes out to my mind here is may be in the core controller like controller loop, should there be some sort of a periodic like an independent periodic riku in that?

B

Might that should happen in a generic way for every provider like it for some same time right that the actuators, regardless of the provider implementation, the actual gets called for? You know verifying all the objects every X seconds, or you know maybe a minute or two regardless right to the point of the node. You know node stopped responding and you know you know detecting it. The other way around where, from the cubelets point of view, I think that's good.

B

But one of the challenge that I see is you know, for example, if there is a you know, how do you trade off like how much time do you need for, for, let's say, kubrick, to be ready to begin with? Let's say after the vm has been created and powered on depending on what kind of things are you trying to do on that vm to bring it up, so I mean maybe once the vm is ready, after that there could be different strategy, but maybe in the Michel process, very very we're actually just creating.

B

It might be slightly different, because at the time you know needed, for it to come up might be different, but I think that also has something that you know. We should definitely think about it's a good suggestion. I think.

G

It's a perfect point, so we have to do differently divided the time out, so we had the same time out for VM creation and VM means unhealthy and then being deleted. So we basically divided the time out into the VM creation time. Oh well. We prefer to do slightly more time to VM for creation and then VM unhealthy is consider different situation. Well, it's actually a problem, a general problem in a way that if Cupid stops responding, then there is no good way to actually identifying that what could have gone wrong.

G

It could be because of the network issue. Well, it could be actually vm is somehow deleted and so on, but yeah I agree with the point that there could be two different announcer and we can handle it that way and it could be some do still decide and the actual parameter could be same across all the cloud providers.

A

Anything else on this topic before we move on I.

D

Yeah I think there was one other part of the question and that is should we document this in good look or should we implement this in code for all providers? Maybe that's something we hash out in the PR, but I think that's that's. An important question resolve think.

A

The question that I would have there is: is it something that we can implement it in a common way, or is it something that needs to be implemented on the provider side? Because of the way that we're implementing today.

A

I

H

I

E

I

On I have to thank my physicians, because when, when we looked at the code yesterday with Siddharth the what where this, where this recent period is said, is in the controller run time manager and that is instantiated in some code that is not generated so every provider- you know handpan wrote that code, although it looks like it's relatively the same, so it just does just something that we'll have to figure out. Oh.

E

Yeah I agree: I would also suggest that we make it short enough that developers are more likely to hit it. I got 10 hours is a long time to be running a single process as a developer, and so, if we make it five minutes, people will see this and at least go like five minutes. Is that too often or not, I wait to make sure it is still over rideable four different scenarios.

F

Yeah I agree with shorten, get. We definitely have to make use. We definitely had to put stuff in the machine controller to do locking, for example, when it was we concerning every 10 minutes, and we might not notice that now.

A

So I think I think at this point we definitely want to document it, and then we can probably follow up and discuss on an issue whether or not we want to implement a shared code solution and down the line.

A

And with that, I think that's actually pretty good segue into the next topic, which is from Pablo for questions around contributing to the documentation. I.

H

Guess it's me I'm, just pretty quick question that has been going through the book and reproduced with it. There I found so far a couple of issues both are solved in the sense that I know how to solve it, but I don't know how to put that solution into the book. There are two problems or one in 1k.

H

These copies were: where should I put there because he's just adding a couple of lines in a code that we already have their say copy paste, and now we need to modify that this very easy and the other one is a little tricky because other nodes that we want to put it so, but in any case I, don't know how to test the documentation.

H

I mean my previous PD, the other documentation system, and that is tricky because sometimes you don't have all the tooling local leaders to generated document to be sure that could you what you put there, is exactly what is going to appear there in the the final document. So I, don't know how to test documentation. I mean I, don't know even know familiar with it.

H

You mean you're using for documentation, so it's kind of meta problem for because it's usually documentation for exists, yes to contribute the initial cause you are reading that and you found stuff that wasn't all at the same time is, is not easy to fix, because I don't know how we actually recommend stop.

D

Okay, so I spent a few hours on this this morning, I have a really old, PR, I'm, terribly truant, on on getting this merged. Let me post this to chat real quick, so changes can be tested locally.

D

They do require you to have all of the someone already posted it. They require you to have certain tools, so you need to use NPM to install various plugins, but testing can be done locally just by typing get booked serve and then browsing, localhost or 4000.

D

D

So I can clean this PR up such that the release notes are clear in terms of how to locally test when I spent a few hours on this morning is I have a problem with how to update the gh-pages branch.

D

So a lot of the get booked work was copied from the coop builder work and COO builder uses something called firebase to do. Their releases firebase requires somebody to create an account. It cost some nominal amount of money, and it also requires either to to have do manual releases of the get book each time you build it now. Gh-Pages also requires manual building, but the difference is it's free and we can PR the changes.

D

What I found is I'm, finding it difficult to update the gh-pages branch while maintaining history, and so this really needs to be resolved and we've got at least three different get book changes that have been made that I have not been able to updates they're, not lied yet and I'd like to talk to Robbie throughout the week or maybe during the next meeting, and consider switching to firebase I. Think that will resolve the release process.

D

Problem and make it possible for us to release more frequently in the meantime, what I would suggest is to go ahead and make changes to the git book and then just know that when people say why aren't these live, the answer is because David doesn't know how to solve it yet, and he's trying to I will try to get that resolved by next week, because this has gone on too long.

A

So one thing to keep in mind: I heard rumblings of potential for upstream supported project documentation, sites, I, don't know the actual state of that. So if anybody else is more aware of those conversations, please provide more context, but I think if that does come to fruition, we'd want to jump on that versus having kind of a peaceful kind of Doc's process. I can.

J

Tell you what so bend the other did in the kind project, so he created a metal if I account nitrifiers a portal for dogs and the he created an account and basically hosted the documentation in a folder under master the master branch. So you, basically, when you push you push the metal if I and he has a subdomain, for instance, the subdomain could be question. Api talk, neatly Phi dot, something else, and we can I mean we can essentially acquire as a subdomain, very specific to question API or like a better, better domain name.

J

The problem is that to man, the city, the problem is the push and also this request to move away from get booked to something like Hugo. That's a that's a rather big change at this point, but so I, like David I, didn't understand like once a problem with the keeping history in the branch. Can you explain this.

D

So the way that they get the gh-pages branch was created was by pushing a prefix directory into that branch to populate it and the difficulty I'm. Having is that I can force push additional changes there, but then someone with administrative access has to then force push those changes up to gh-pages the upstream branch. What we'd really like to do is to merge those changes into the gh branch so that PR can be put up. I put up a PR which attempts to do that earlier this morning, but.

D

Let me see where it is.

D

Oh so so I went ahead and was able to merge the changes into the gh-pages branch and push it up, and then the next thing you notice that all PR checks failed, because the gh-pages branch is really just documentation. It doesn't include any code, so the whole thing just kind of ends up being hokey I.

A

See so I think we can configure prow to ignore certain branches so that that could be a way to solve the attract issue. Okay,.

J

Yes, we can. We can skip all the tests except, for instance, like spending checks and stuff if a certain path is updated. I think this is doable so a another question here: is it not possible to use master for github pages.

J

D

Maybe I get hub provides three ways that you can serve github pages. One is by using a gh-pages branch. The other is by using the well known, docs directory, which currently we have additional documentation in. So we would have to take the non get book documentation out of that directory and put it somewhere else and then there's a third way which escapes me at the moment, but.

D

It suffice to say github pages or github pages is a little bit limited in terms of how you structure your documentation within the repo. Well.

J

So are there any objections if we go with option two, because it seems that it solves all the problems, so it seems to me, of course, basically moving out the existing dogs out of dogs and complying with the requirements by github pages to use the dogs folder for something specific.

D

I I wouldn't object. I would need to look and see what's in there and where it might go. The.

A

Big thing that I would worry about there is that in some cases the documentation that's currently in Docs is linked from external sources, and we would break those links.

A

J

So maybe we should do it now. Well, the project is still new, because if we do it later, it's going to be much worse. Yeah there's definitely going to be breakage, but so right, all the consumers they're the call providers.

J

Is a breakage to be concerned and that's my point well.

D

So backing up, though the way COO builder does this so I agree that the Netta Phi Hugo solution is, is the better long-term solution? I think it's a lot of work, and so, as an intermediate goal, I'd like to fix this the way it is what about using firebase instead of gh-pages, so.

J

I've been using firebase in all related projects for dogs, and you definitely have to pay for firebase its requires the politics of like who is going to pay for the accounts and stuff. So that's the only book on my site, I guess so I. Take it an open source project. You can definitely avoid the usage of firebase, but if we think that this is the best solution for cost, efficient definitely go for it.

E

Quick question there: the do we think that would come under the CN CF funds because I it feels like something if it if it comes out of that bucket. It feels like that would not be hard to get approval for that I'm happy to take it up a bit with the infra group. That's looking at moving stuff to the cnc of infrastructure and see if it would fall in that bucket.

J

Maybe we could get a like a global but or perhaps a little bit of global fan base accounts that case.

K

Or projects for similar purposes exactly said this: doesn't that.

E

Doesn't feel crazy and might not be too politically. We have a big bucket, a big chunk of funds on GCP and we have to make sure that we account for them, but it this is the sort of thing that this group is. This working group is supposed to do. It's called WG, Kate, infra or whe infra and I'm happy to relay it to them, and we can see where we just get firebase counts.

E

If, if we want fire viscount's- and that seems like the best solution other than the money, the money is solvable, I think that would.

A

Be great, at least for a short to medium term, and then are you familiar justin with the discussion that was going on around potential for having sub project documentation sites. I.

E

Am not particularly familiar with that! Well.

J

I said I think that everybody is doing their own thing. The other is an example that he did his own implementation of talks. The way he thinks is the best, but so a question about the the transition from github the github Doc's to github pages to hugo.

J

Is it actually that much work? I haven't looked at the differences, but I know for sure that the hugo like markdown parser is super buggy. So that's the downside to hugo for sure. So.

A

I think the biggest challenge would be is embedding the not not transcribing the content, but also being able to embed the goaline structures like we're doing today.

J

So can we do it in like roll call, files might not be possible, I mean you can definitely import the Emma with Hugo, but I guess. This is something that github pages supports. Is that no.

A

It's something that we're leveraging from get booked to be able to pull in other structures into the documentation using oxygen tags. Is that right, David, yep.

J

That's that's pretty powerful, maybe there's a plugin for Hugo but anyway given given, we already went with github pages I think we should definitely think of solutions for github pages yeah.

H

I was looking at the docs folder, it was ready, I am beside the book. There are only two other directories. One is samples we have just other way that is referenced, that's where consumption and the other one is. The only that may have external references is the proposals which cover Lia only has one proportion so I think that moving the content, which ended related to the book upside up shouldn't be a huge problem.

H

We don't have a lot of stuff there right now. I know what I lost you. This machine proposed this proposal worried his reference, I suppose from cap, or something like that. Maybe.

A

Yeah I know the proposal is linked, at least from the cluster API provider. Aws documents and some other documents that have gone around talking about closer API is a high-level as well.

A

Yeah it'll cause a little bit of pain, but maybe now's the time to rip that band-aid off, though.

J

So again, the the easiest- the fastest solution- is to basically move stuff away from the existing Doc's folder, but we should probably go with the firebase solution. The problem is that it might take awhile. So just in the kid do you have like an estimate of like how much time can the state click.

E

The next meeting is on the 21st, which is just over a week away. I can put it on a I, could open an issue and then I'll tag you Luke Mayer and anyone else that wants to be, and we can sort of explain better than needs and that might actually be faster because, ideally we'd then double to create some accounts. Yes, figure out permissions likely order, I guess a month, but less than a month I'd say: oh, isn't.

J

The next meeting of the 19th of February actually.

K

E

Was going by the agenda where I put it in it says the 21st you're right that doesn't add up so I, don't know! Oh.

J

So it's next Tuesday, but where the city I'm.

E

Opening my calendar, it is next, oh, we may have moved yet. No wait, I, don't know honest I, don't know I'm. Sorry, I thought.

J

Yesterday we had a meeting if I remember correctly, the casing for a team, so it is.

I

J

Its weekly its weekly, so unless it's moved to Thursday I.

E

Have it on my calendar for Wednesday at 11:30 Eastern, so 8:30 Pacific Wednesday, the 20th up.

K

Wednesday, so you had a meeting today, I have as a bi-weekly meeting.

E

I

E

Can talk about it? Okay, I'm trying.

J

E

Figure out calendar I'm.

J

Gonna post a message in the coastal ApS like a later this.

E

Is the yeah okay? This is the working group intra meeting that, but yes,.

A

Alright, do we want to take the rest of this offline, async and I? Guess Lube, Amir and David can coordinate on potential changes and then make sure to sink back with Justin prior to whenever we find out. The next meeting is to potentially request the firebase account or not so loose all right. Great any other topics.

D

Going once who do we want to look at where we are on the road to the one alpha? One yeah.

A

We can go ahead and do that. um I can't actually share my screen to do that. If somebody else is interested in doing that, my zoom is on a different computer than the rest of my work. Right now.

D

A

Know we were trending in the right direction as far as issues I had I haven't seen any issues that haven't been triaged yet so I think we're we're on a good path right now, if you're looking for any issues and they're, if they're already assigned to somebody, you feel free to reach out to them. If it's not marked as life cycle active to contribute.

D

Okay sounds fair right.

A

Anything else before you wrap up for the day.

A

All right hope you all have a great day. Thank you very much.

A