Kubernetes SIG Apps, 8 Apr 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Apps 20190408

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

A

Begun this is the Monday April.

A

78Th Monday April, 8th meeting of cig apps welcome all this means being recorded, so anything you say will be reserved for Prosperity and definitely don't have a whole lot for today. There's no demos, I'm Ken Owens. If you don't know me, I'll, be hosting the other. Sig co-chairs are out today.

A

So no co-host and primarily I'm gonna discuss some workload stuff, the caps that are in play, tour and that we're looking at presently and discuss some things about mentoring and what we might want to do with our owners files with respect to reviewers and approvers, so I'm gonna start with the latter topic. Actually, so one of the one of the things that we looked at after talking to Paris was what kind of housekeeping do we need zu and aside from cleaning up the cig, apps main page in kubernetes community?

A

We also could do better, probably with dealing with our approvers and reviewers, like for the bass owners file. We have no owner's alias for sig apps approvers. We have kind of a makeshift collection of approvers and then for reviewers. It's also kind of makeshift, and it's not clear to me that the sig apps reviewer, because I don't think the sig apps reviewers owner, alias, is actually used in the workloads API files, which probably isn't super fair to reviewers for various components of the API and certainly doesn't make it easier.

A

On top of that are people who went through basically reviewer ship training a while ago and have been acting as reviewers or in certain parts of the surface that aren't actually added as reviewers. They just are kind of jumping in and helping out where they can so cleaning up. The owners.

A

Files, I think, is something we can do immediately to kind of improve that my thought is to still leave some people, who are obviously concerned with a particular portion of the workloads API who have been contributing to it as owners of what reviewers at least and approvers. In some cases across the surface like there's like Mike Denis has contributed to daemon set and Klaus has is also very interested in daemon set because of the scheduling, logic and the implications from moving to default.

A

Scheduler, though you may not be at this point because that move is complete and most of the fallout from the move, we've already worked our way through so but the the notion of that change, and even that the demon step has to deal with certain scheduling predicates. It might be good to keep him there as a reviewer or an approver I. Don't think he's going to be contributing as actively at this point.

A

So I want to clean those up and move a bunch of people into approvers who are kind of like localized approvers use the aliases across the files for what we own and then probably add some people who have looked at their contribution.

A

History have enough contributions to warrant, or at least reviewer ship and from there we can, probably if people are interested in becoming reviewers and approvers and so forth, set up something a little bit more like like actively asked to be assigned reviews and one of the reviewers will go through and find stuff for you to look at. There are plenty of open things that we could triage and we could spread that out a little bit more and that would be a good way to help. People who are interested grow.

A

Reviewer kind of questions I have is from the batch side so like, or people who have been primarily working batch but of also over the years, contributed to the worth of the API in general, interested in being global reviewers, or do they do you want to just focused in batch.

B

I'm, okay, being a global or if you're I, don't see any problems with that. Okay and a brutal ship sure also one question: when you will be putting together the PR moving and reshuffling owners, approvers and reviewers, can you send that one out over to say gaps, mailing list? The reason for that is that will be a wider distribution of the fact and it it might be easier for everyone to be aware that this is the change that we are doing get help.

B

Notifications are not good for this I'm, not sure if you wanna be pinging all of the people that will be shuffling back and forth. It's really not that.

A

Many people you'd be surprised. There's very few boys I was actually hoping, given the.

B

A

Yes, we will do.

B

That law, so piecing.

A

Architecture and we're gonna need somebody with route approvers to do it anyway, because I want to add in your owners, bailius to the route to the route owners, but talking to the guys from the city, architecture and steering committee. They seem fine with the idea. They're actually encouraging other states to do something. Similar yeah.

B

Yeah we're doing exactly same thing for four six Eli specifically Phil was putting together a PPR deprecating, some of the reviewers.

B

And a perverse in any time and yeah. That's that's. Definitely a very good move to yeah do once in a while I don't know annually or something like that. Well,.

A

I think one thing that we can do if we move, if we put it as owners, aliases and primarily use the aliases to control who's reviewing and approving in the particular files in the workloads, API and like at this point with v1 I. Think the right thing to do is kind of grow, more global reviewer ship, as opposed to people who are point specialists on particular areas of the surface.

A

If we, if we manage it from like a global root owners and owners file with approvers and reviewers, it's easier to add people and it's easier to for it to be clearly visible, who the reviewers are and who the owners are. They're, approvers are and I think that helps visibility right now, I mean actually doing a review of who's in each file. It's really not it's not out of whack, it's okay, it's fine! It's just like, and then there are a few people who have like either moved on who are contributing actively.

A

We consider removing them as we do the migration, but it's not so far off of base that I'm worried, like 6cl I, was in a slightly different place. They had very many people and owners files um that probably should never like. They should not be there at this point. So.

B

Yeah that totally agree well. I I was at some point in time, considering that every controller, because I thought I was kind of thinking that at some point in time, we got people focus on specific controllers. Only and I know that this spread across the.

A

Board and I mean I totally did like. There are specific people working on daemon SATA specifics at working on deployment and there's that set at heavy overlap with replication controller. And then we have a specific set of people working in.

B

Texas and yeah.

A

When batch, cron and cron job and job yeah that a heavy liberal ad staple set was kind of its own thing, yeah, but I'm looking at it now and the way. Looking at the PRS and looking at the reviews, it seems like people are kind of broadened out already and that's just kind of taking its natural progression and building that expertise across the various controllers is something I think should be a goal.

B

Especially that, if you well, maybe that it will be a slight over exaggeration, but given, if you know one controller, you should be feel pretty confident skipping between controllers. Of course, each and every single of them will have some bits and knits. That will be specific to that particular controller, but I'm, hoping that those places are usually either covered by tests please or or have a decent document, a comment explaining. Yes, we did it because because of this, and that so.

A

So my great hope our test coverage as a sig is actually pretty good and ie coverage is actually pretty good. Our flakiness is much lower than it was a couple of years ago, so it's we're. Okay, there. um We agree with what you're saying I think one of the things the the parts about understanding what the controller does and how that's implemented, include used to be kind of a lot more important than understanding the base infrastructure, but I feel like with the introduction of shared Informer, like as the the shared infrastructure has evolved.

A

A lot I think the understanding of how a controller works and all the intricacies around what you can do in the cache and the shared and former cache. What happens when you mutate objects, and why you should never ever do that, but why pointers are used instead of immutable copy, not even reference.

A

It's just copied Struck's like that understanding should be global, and if you understand like that, much like you're saying it should be, you should be comfortable doing basic reviews, and hopefully the idea would be that you reach out to people who would be more experts if necessary, if you're, not confident with the review. Yeah.

B

Also, the white adoption off see are these, which, in majority of cases ends up people writing their own controllers for their own. Specific resources um allows to have a broader knowledge of the controller, so the basic knowledge is there. You just need.

C

B

Willing to invest some time into looking in those PRS ensuring that they match the level of conformance that we have and ensuring they're a good fit for for the project itself. Yeah.

A

And then the other thing we might get out of this is if we can hopefully find more commonalities in places where we might be able to refactor out to share a little bit more code across the controller's. That would be helpful to oh yeah, oh yeah yeah. We could do some. We there might be some code clean up that that's that's available to be done in the not-too-distant future, and that's that's all I really had to say on it. Does anybody else have anything to add or when I talk about more.

D

Yeah I liked it we will clean up the aliases and stuff because we also have like there are some shaped tools like I, think controller wraps manager. That's a separate tool which is shared for the controller, so different than the ownership can use a bit of a cleanup. I think end-to-end tests have different owners as well right, yeah,.

A

But it seems to me, like everybody, is a global, easy plastic Ross, the entire we we need to I-I'll take I'll. Try, take a look at that to you and figure out what we can do. There I think we can now that we've broken out. We had sig apps labeled tests that we're responsible for make that ownership a little bit better as well, but the ownership mechanism. There is a little bit different.

D

And I know it's just usually, if you add some feature or something you.

A

Should you can add some tests for sure? Well, one hopes, one hopes were not merging features with no tests. Yeah.

D

But that effectively means you may just lose your approval or something that you would have for your controller yeah, that you don't have there.

A

um Okay, cool, so then the other stuff I just want to talk about current work in flight and kind of the way. So the way we've kind of been trying to grow contributor ship isn't so much like for people who are interested in committing we kind of start with like okay.

A

You've got some feature that you want, or that you think should be done and we try to help grow contributor ship by shepherding that feature through the cap process and in the implementation and letting the contributor actually build something in the system and that's pretty much the way it's kind of always been for the most part like, rather than doing a formal process for like do X number of reviews, and this that and the other it's been like.

A

You grow your contributor ship by, like generally identifying a need that you have wanting to implement it and then contributing that into the community. That process has gotten a little bit more complicated with well, it's just got a little bit more complicated over the years.

A

Now the current process is a cap which you know we review will give feedback on and then, when you open, PRS, we'll review them, approve them so forth and that's a great way to get started contributing for people who are kind of just interested in contributing to the project because they they are really interested in kubernetes and the workload JP I am in particular, but they don't have something that they need to merge for their own use. The other way in terms of you know, reviewing PRS and you know helping triage.

A

That's that's another way to grow contributor ship, but the primary way we've been doing it is through the capsule, and this is when they kind of go over where various stuff is and see. If anybody else wants to reach out and add comments, so the staple set max unavailable is merged currently, but I think we still need to add some. You know and let me go ahead and just bring some of us up on my screen.

E

Oh here is okay.

A

Okay, can you guys see my screen.

C

Yeah, okay cool so.

A

This cap is merged, I mean I, don't think anybody is.

A

Against or I haven't heard any strong arguments about why we should not do this I.

A

Think we have a general idea of what we want to do here.

A

But I think we might want to just go through it before we mark it as implementable and make sure that we're okay with the implementation, the.

A

Sidecars PR is actually at a whip at this point and we might want to go through and review here. I think it's mostly under note at this point. Don is assigned to it.

A

There doesn't seem to have been a lot of.

A

Review of it so far.

A

So I mean someone from Apps should probably take a look. I can do it if no one else is interested. I was.

B

I was looking at the cap idea at the idea because it came from jobs. Specifically honestly, I still have some doubts, whether we should be doing this, but first the cap I haven't looked into the implementation yet, but the cap is written in a reasonable fashion. It's.

B

Explicit of how and what it is solving so I still not in favor of the future by itself by expanding cubelet with yet another retainers type, but overall it it's.

E

A

B

A

Job is one reason that they that people wanted the ability.

A

You have a sidecar that you know interacts well with the container that terminates that isn't run always, but the the actual main reason that people kind of got behind it was service, meshes and service, mesh integration and, like there's two ways to kind of look at it like one is you need to react text your application to live with sidecars and interact with sidecars in such a way that you know you don't have the dependable life cycle dependency on the sidecar or that your tolerance of the interaction between those two containers, which is a fair statement?

A

For I mean that should be a desirable goal for new things that are architected. I think the reason people are okay with implementing this to support like service mesh technologies is that you know kubernetes took an approach of it's kind of different from some of the other container orchestration systems. In terms of we want, we wanted it to be.

A

We wanted legacy applications that were containerized to work well on top of continuous and because of that, there's a lot of applications that can't actually tolerate the sidecar not being present, and they need that lifecycle orchestration to work for them in order to be able to be run on kubernetes without being completely Rhianna connected and a like, so feedback from the guys on the ISTE. Oh and envoy side were like. We should do this. We need to do this and yeah. There is additional complexity, but some of the feedback from tim.

A

So there was actually some proposals out to do something: that's complicated as systemd. Basically, that we'd have an arbitrary nesting of container ordering and have a mechanism to control the pendants ease.

A

I, don't think anyone wants anything that complicated at least Tim definitely doesn't so from you know the networking and node side that was like a non like no, but one level of nesting was considered to be kind of sufficient to solve the problem, and you know generally helpful and making sure they if people adopt a service mesh on top of kubernetes, which is becoming something.

A

That's fundamentally more and more popular to do and it even if you look at native integration with sta like if you want to use those technologies together, it becomes kind of important to be able to do the sidecar injection. Well, it's not. You know. A the complexity versus the benefit. People are generally more Tory. It's beneficial okay, I happen.

B

To me, there's also.

A

Kind of like I, don't think it's the type of bird way of solving this problem, but kind of user demand was like. We really need this yeah.

B

I I haven't considered and play myself with this service meshes on top of it. So I might have been missing that point um so having to.

A

B

Doesn't make sense, yeah yeah.

A

The argument is like you gave us mutating web hooks to specifically do container injection for the purposes of sidecars, so that the user could go and write a system and not be aware of the service mesh or whatever side car has been injected.

A

And then we can inject the side, cars via API machinery and to say that, like you, expect us to architect the application so that it works well with a sidecar, even though the person who owns the application may not own the infrastructure, that's doing the sidecar injection for either storage or network for whatever they're doing it was kind of like you can't leave us like that. So and then to me, that's fair! Oh.

D

That sounds interesting like if you are actually modifying the book template like half of the vertical controllers, break right.

A

No, the only one that was breaking was deployment, and that was if you modified identify.

B

The blood in spec.

A

In like the the workload control, respect, you're gonna have a problem, but that's not where the injection is done. The mutation is done on the pod spec, which actually work well that won't break them since I think, like we already had this with some admission book like we had an admission web hook that broke deployment because it was me hitting the replica set and that that breaks cuz deployment doesn't comparison between the template specs to determine.

A

If something's been mutated demon set in stateful set actually just label the impact and products they will label the pod and say we think it's at this revision and it takes into account that the pod that's created based on the template, may be arbitrarily mutated by admission control before it becomes an actual pod. So we don't do pods like template for the actual pod or pups are a pod spec examination for the pot itself, so I think we're okay with demon set and staples and replication controller.

A

Sorry replica set actually should work with admission control for well for mutating web hooks. If you only modify the pod spec, so that is safe to do. But like modifying the template itself is probably problematic: the API machinery guys Daniel and Jordan based or like yeah. Just don't do that so so.

D

Iii, don't think this was modifying the books back, like I recall a demo set case, but it was modifying deteriorated support through admission and I. Think there is a still issue, open, I, think Jim. It was on there I'm, not sure if you actually find it all.

A

Right, yeah, let's tick it. Let me take a look at that, because there is another one that was open for staple set where the same thing was occurring, but the user was basically implementing a mutating web hook that was ill formed, so that wasn't either the fault of the controller or the fault of the API machinery or a paradigm kind of cognitive dissidence thing.

A

It was just a poorly written web hook and there's been other issues we've seen with it Michigan the communicating well hooks are, in general, a very powerful tool that don't always interact well. So it's like kind of you.

A

Really have to know what you're doing in order to make it work. That's without setting up the PKI infrastructure, that's necessary to communicate safely with the api server to execute your web hook effectively like just it's a lot, but as CR DS go toward ga default way for defaulting and validation is going to be well defaulting will be a mutating web hook, and then validation will be a validating web book. So they're gonna get more popular. We should probably be prepared for them.

D

Like I can check the demo said, but I I think this is the case where we actually compare the demo said: oh template actually matches the boat or we will just delete it and create a new one.

A

Yeah, when we added controller history, we got rid of that logic. Initially we were comparing. Basically, we were just making sure the pod was labeled with the correct controller revision, as opposed to doing a direct comparison between the template of the demon set and the spec of the pod. So if that creep back in I would be surprised but I wouldn't be shocked, I.

D

Have a look for issue, but I recall this was like writing half of the writes on a TV and it started with looping. So.

C

D

Life memory, yeah.

A

I'm not doubting the business trip. um The other thing is we have this cap from the guys at Pinterest, which is interesting, but I'm not sure. If it's something we necessarily want to do, I think I brought it up last week. It would be great if we had some other people take a look at it. They're, basically proposing to start adding I guess the best way to say it is lower disruption updates. So.

A

Kind of like looking at determining, if you can do an update in place.

D

That can actually mean higher disruption. If you have just one boat, yeah.

D

I'm not sure like in place updates they they're like amateur well.

A

The only thing we can do right now, you know in all honesty is: do it for an image right. That's the only thing that's mutable on a pod spec, so that would be the only thing we can touch. Then there's the interaction like we just discussed a minute ago with mutating webhooks, which may in fact modify the container prior to actually launching it in such a way that you can't actually tell if you're doing the right thing. If you start doing image, inspection or the fields of the pods back, that could cause controller tight.

A

Looping then there's the fact that, like because we can't resize the container I, don't know how valuable it actually is like. If all you want to do is an image. The way to do that might be doing something in the open source like base of kubernetes. That did image streams, which is kind of a feature and open shift right now and thinking about it in terms of rolling out image. Streams might be a better way to look at this problem, but they want to do it. For stateful, set and demon said. The like image.

A

Streams is usually more useful for deployment, so we.

D

Actually, using for everything it's, but it's on a higher level like it will inject the image into the deployment and then we're all like it. Look like image thing would work.

A

Well, but when they do an open ship that you do, you actually do a mutation directly on the container and try to do it in place or isn't still doing a rolling update on the there.

D

Are actually two types of well image? Stream is the object that actually holds the image, and then you have. There are two options. You can have one admission that will actually change the image in the port, that's being created as an admission route and the the other one that we settled up for is the to link it to your deployment, and it will inject the image into the deployment and never wrong will happen since this well.

B

In short, it's a it depends whether we're talking about resources that are available and in OpenShift natively. For those we do support directly working with image streams for the resources that are coming from cube api. We always work through through admission, which is replacing the image based on whether you want it or not, and you need to on at a Dem extreme or label image stream with with the information whether it it is possible for it to be used in and in a workload, even more or less excitation how it works.

B

A

I think we should at least give them feedback I'm, not convinced. This is something we should do.

A

But it would be good if other people had time to chime in and just say well have you thought about this, and maybe this isn't something that you actually want to do, even though you think you do.

A

The volume claims I'm sorry the volume, expansion cap and there's a corresponding kept for in place resize, which is basically like. If your storage supports online file system resizing is still not merged, but not from us. If I, because I think this was approved by say, apps I think it's just waiting on sig storage approval, I reached out to Saad and Michelle.

A

I'm not sure why they haven't approved it yet, but I'd like to get this in sooner than later, because basically, everyone is supporting at a storage level in current versions of kubernetes, resizing volumes and staple set is still fairly popular to use. Even when people are writing.

A

Custom controllers that are workload specific for databases like a lot of them, are still using stateful set under the hood as a primitive. They just write in customer orchestration, on top of it so being able to resize disks kind of is a super important feature for those people. I feel like so I'd like to get this in sooner or later. I might well I'll see if the the author wants to try to go at the implementation, because what they've proposed is mostly seen so far.

A

If not, then something either I'll write or somebody from Sega apps is totally welcome to contribute. If they're looking to make a code contribution for this one.

A

The PD BGA is still open. We haven't merged that yet Morten is currently working on looking at how we can make this work with the scale sub resource.

A

So the current implementation only works with the built-in types that built-in workload types basically I I, don't know. So if we like the the API and we're okay, GA Matt I mean I could be convinced that it would be ok to GA the API, as is, and then add, support for the scale sub resource later. Maybe but I think that what we'd like it to do is act more like the HPA controller and be able to act on anything that has a scale sub resource scales of resources available in CR, DS, I. Think since 113.

A

So if somebody writes a custom controller that deploys pods directly and doesn't use a workloads under the hood to manage the pods I'd like them to be able to also benefit from disruption budgets so more in sticking to look at what it looks like to actually do that I think the reason why I consider it kind of a GA blockers because I'd like to see it deployed and operating on the scale sub resource, just to make sure that it works and it works in a way.

A

That's not crippling to the API machinery prior to saying that the whole thing is generally available. I mean part of it is making the API seeing the API is stable, but the other thing with GA is we're usually saying we think the implementation is stable enough, that it won't break you or I'm, not gonna have to change it in backward incompatible ways, and if we supported the scale sub resource and then decided later that okay, we can't support it or it's broken or like yeah. This is horrible.

A

There'd be no way to back that out, like once, we decide we're going to support the scale sub resource from an implementation perspective, removing that would be backward and compatible for sure all of a sudden you'd turn up a cluster, and previously you had PD B's associated with the CR D, and they just worked now. This new version doesn't work or behaves differently with respect to CRTs, and you know the availability guarantees that you're trying to provide via the disruption budget.

A

Just aren't there anymore, so I think we probably want to at least see the scale of resource support it. The other feedback I was wondering, is any one, so we added maxint available instead of been available because it allowed you to use a PI DB and not have to mutate it frequently. It was just a cleaner expression of disruptions in terms of the way people generally think of them. The other company I was wondering is do we want to keep men available, because when an ER did the initial.

A

Edition of the maxint available, he didn't add it to. He wanted to just replace it altogether, but we couldn't do that in a backward compatible way. So GA is a chance to break backward compatibility I'm not to having both I, don't see. It is hurting anybody so I'm not to like gung-ho that we need to do it one way, the other, but if anyone else's feedback it would be good time to give it now. I.

D

Was saying we can remove a field firm, but dr. GA.

A

You could you you might have to do a V, 1, beta 2, but I think you could actually just offer a ga version with that field removed and then support it in V, 1, beta 1 and.

A

Deprecated it have the to coexist and then remove it. If I remember the rules correctly, if not, it will require V 1, beta 2 with the field removed and then promote that to GA yeah.

D

If what we need to do another version.

A

But look at the rules and talk with the API guys and make sure that I'm not talking out of turn. You may be correct. I.

D

Was just asking like I, don't recall them exactly I know there were issues yeah.

A

I'm saying I don't either and I do remember, doing the V 1 beta 2 for the workloads api's, but the primary motivation there was to just put something out into production and let people use it and make sure that we were confident that we had the right thing to promote to GA I. Don't think it was part of the the deprecation and support policy for. Why would do that? But I also am Not.

A

Sure ISM is my general thesis, so it is something worth checking up right and then the other thing is so Jana and I met with Barney who's, gonna author, the cron job cap GA cap and he's a guide starting on it.

A

So we're hoping to see that soon he just wanted some feedback on exactly what problems we thought needed to be solved and what we think about the API in general prior to opening it up more broadly and in the conversation, didn't really change from the original conversation that we had a couple of months ago, like we I think every okay with the API, as is the main thing we want, is shared informers and to make sure that all any bugs that are remain open, are cleaned up and that we have a good working API for GA, well working implementation and that we wanted to release the shared informers version as an alpha feature and kind of progress that up stability.

A

Just so that we don't break anybody's clusters on cron Java as we do it. So I think we should see the cron job GA kept from him in the near future and then we'll probably need some help from the community in terms of reviewing implementation and all of the above.

B

Do we have any written documents with regards to how the planned cap will look like or stuff like that.

A

He's just gonna follow a regular cap, but I mean the graduation criteria will be in there. He's gonna describe what is going to be done, but not necessarily the details. I think probably PRS are a better way to talk about her the actual code. Unless you want to see something separate, in which case you know,.

B

No, no no I I was just curious whether some work-in-progress cap was laying around on someone's I, don't know g-dog or something like that. That I could have a look at and and provide some early feedback yeah.

A

I mean there's: there is like a whipped, Google Doc, but I'm just like pull it and letting the there's no need to do this internally.

B

Or in a smaller.

A

Group, in my opinion, it's like open it. Let people comment, it's not overwhelming. You'll be fine. Yeah.

B

It's definitely it'll take some time to get through it, but yeah definitely I mean, but.

A

Yeah so, like I, think I don't see any pressure to just make it GA just to say it's done, I get think we want AJ in a state where we're happy with it right so exact erectness is more important than timeliness. Exactly so,.

B

A

B

So it should be fine, yeah I do want to ensure that you know there were some discussion that we need to graduate cron job as soon as possible, because we graduated everything else and I'm like yeah, but everything else is different than cron job and we've been trading lots of lots of stuff with a cron jobs, either migrating from one for promoting from one version to the other or whatever else that we've been dealing with for for the cron job consolidation, yeah I mean, and in the worst case, is it. It was never.

B

You know the main problem, because cron jobs, everybody loved it- that we have it, but nobody was able to devote 150 percent of their time to push it further and we kind of stay at a point where we were happy with what we currently have so.

A

Another thing we might want to think about doing in the future at least, is taking a look at the job, the batch workloads, API and maybe offering extensions as a cig, not necessarily something that's built in, but doing like CR, DS and custom controllers that extend the API to do batch a little bit differently like there's a lot of stuff in the community.

A

Around open kind of CR DS for for batch API is like yeah could be flow as a tensorflow job for machine learning. I could definitely see if you wanted to do something a little bit more high throughput how the job API might not be the best thing for you, I think cron job is one of the ones where it's like I, don't I, don't know if there's anything to improve on. To be honest, I mean most of the people who want, like I, want to run this job.

A

Iteratively they're, not using it for high throughput batch they're, using it as like a cleanup task or like an administrative.

B

Task there is one thing in the job controller that I would like to get it get fixed at some point in time. Sooner than later, which is the way we calculate the status of a job it requires of all of the parts all the job to be present. Yeah.

A

B

Kind of a problem, but yes.

A

B

Probably, do you that's probably the only one and also if you run more than I, don't know 10 20 parts per job. If and smart in smaller batches, you will not see a problem. Its.

A

Smaller batches, you don't the only other thing I can figure. Is that maybe you could do so? The problem that I've seen with with this is it eventually podgy see kicks in? Yes,.

B

A

Yeah, it starts reaping the the pods before the job is done and you get jobs that run forever, which is like a real thing. You could maybe use finalizar x' to prevent the GC from kicking in because the the necessity is so in when there is a I was looking at this particular issue, not too long ago, with a bunch of other people from API machinery and node, just trying to figure out what like why. So why are the pod GC settings what they are today?

A

Why did we add the controller and do we still need it at this point, because it I thought it was probably a storage pressure issue on node, but it actually isn't. It was a CPU pressure on a component that wasn't using pagination, which doesn't actually do anymore so like there. There are things we can do here to make that particular problem better, but is this so job is v1 as it is? We had a B to alpha.

A

Do we want to try to do a v2 job, or do we just want to do something different as another API for a job outside of crying I don't know, I was.

B

I'm not sure we need a a new API to solve this particular problem, but it was rather envisioning that we could improve the controller to be more aware of the of the polit so that they could be removed, um maybe something along yeah I, don't know calculating sha or whatever.

B

That deployment does for to be able to figure out that. Yes, we are currently at the version that we care about, and not any other, so something along these law, but I I never got the time to actually take a look.

A

B

It looked what I saw so one.

A

Interesting implementation, where, instead of so technically the status, the job object, should be calculated from what it observes in the cluster. If your.

B

A

Is negative, but you can also add an intermediate object that basically acts as a counter whose spec can be updated so that the job can determine like this is how many pods I've actually launched. So you can have another intermediate like kind of implementation, detail object that you use a counter to kind of track the status of the current job. It's running, so you don't care about the pods getting garbage collected and that way it's also one who's actually implemented that it was an interesting way to do it.

A

The thing I'm not super sure about is: can we do that and still say like? Can we do that in a backward compatible way.

D

Can you do it with expectations or.

B

Yeah you're not changing the you're, not changing the the backwards compatibility in any way. You are expanding. The current information about a status object and the current status will still be updated, as is, and the fact that you're using additional helper objects to have a intermediate data. I don't see that as a breaking backwards. Compatibility in any way.

A

That's fair I mean, but it just kind of changes. Well, I! Guess: if I'm considering the current behavior a bug, then it's a bug fixed rate yeah.

B

Exactly okay, yeah: it is a problem. We know about the problem since literally day one, and that was a conscious decision when we made it back, then that, yes, we will have to fix it at some point in time. Oh the fact that we never fix it is a feature, I mean and then there's their it's a buck. The.

A

Other way we could do, it is like you could, in theory, just add the finalizar and, assuming that we've cleaned up the implementation of the other components that were so like the thing that was breaking was Lister's on other system components. If we added finalizer x' to make sure that the pods weren't garbage collected- and we can actually stand to have that many pods lying around listed a that fixes the problem too yeah.

A

But it's risky. It.

B

A

B

It's the problem with this approach would be that you might end up in a situation when you will be draining your node.

B

Possibilities because you are creating or keeping around a hundred of unused parts, and you cannot remove them because the job has not finished, for whatever reason is there? Oh yeah.

A

They're definitely implement implications for so you're saying basically, I have a node of all these pods I can't make. My eviction doesn't work if I'm doing something like a roiling node upgrade because I have all these pods stuck around in the terminating state, because I have this batch job. That's continuously running but not finished. Yeah.

B

Yeah, you might end up doing something like that and that's that kind of sucks. Honestly. If, if we would do that finalizer, then I would say that that should be optional. Oh.

A

Yeah, it would definitely be opted in I think for people who are using job intensely and they're trying to do high throughput batch across or clusters.

A

The node churn would be less of a problem because I mean basically they're, usually going to be auto scaling up based upon the pressure that they're putting on the cluster anyway and the scale down would be if the if their job is completely done, the it would be a wheat, a resource waster for sure, though, like you'd have to wait for the job to fully complete before you actually GC all those pods, and you can scale back down so in way.

A

B

That's why idea the additional information? The additional resource is well in a way lesser, bat or whatever the English saying for that one is, but it's something that I'm I'm I'm willing to accept more willingly than than the the finalizar.

A

B

A

Mean I think taking a look so I think I still would like to understand how bad the finalizer behaves in the average case. Maybe it's not in it.

B

Well, it won't be a problem, in average case, from what I've seen the average case is a couple of parts. That's all, even if the case is yes, my job is running for I, don't know 50 60 hours, that's still not a problem. If it's creating ten parts, that's not at all ecology, Bowl, even if it that would be fifty, then that's okay, but when you're talking about hundred plus that kind of sucks and and a hundred plus with with a duration of a week or so that that's becoming problematic, so yeah.

A

Actually, I'm talking about average case for high-throughput batch, so I'm thinking more along the lines of thousands of pods, okay and 100 I. Think I mean like for people who are doing the case doing the things you're talking about. They tend to not be troubled, at least from what I've seen yeah. That's true like it's.

A

Only the people who are going, who are like I'm, going to scale this up and I'm using job as a batch, primitive to like get relatively high throughput batch workloads running on top of kubernetes and like the thing I'm concerned about, is just what we talked about before, but also what that interaction looks like if they want to use it with other workloads and then like it's one thing on cloud where you can do different types of capacity planning and resource allocation like I, can have a separate node pool.

A

Just for my batch then have a separate nude pool for my serving workloads and stateful, and then it had different machine shapes for each of them and if I'm looking to get better utilization, it's pretty easy for most clouds. Now, for me to choose arbitrary machine shapes like I, can make my own SKUs that fit whatever I want I'm more concerned for people on Prem who, like you know, I, buy a SKU and it's a fixed size and I'm.

A

Trying to run this on bare metal and what I really need to do to get better utilization is fill holes in the shape of the SKU, with patchwork right so like for them. The cost savings and the cost optimization is predicated on the ability to run batch workloads well, along with serving workloads and stateful workloads, and that's a use case. I haven't really seen people exploring very in a great deal.

A

Yet there are a couple of large companies who are doing kubernetes on bare metal, but they don't most of them, aren't doing like batch optimization at that point or that that level yet or at least not that they've communicated okay but I, mean I, don't know. What's the new thing.

A

What is it called the they're using components of OpenStack to make it easier to run kubernetes on bare metal, I.

A

Don't think I heard about that one: okay, it's it's a Red Hat thing, but it's relatively new. They just announced it recently. It seems, like that's gonna, be a thing, so we should. We should make sure that the use case for a collocating, Bashan, staple or stateless workloads is doable. I would think.

C

So is that issues related to this or.

A

Which one per job.

C

For the GCE The Terminator, that is forgotten, we've.

A

Actually opened an open-source issue, I think there is I think there was one, it's something it's something that is known enough. I guess I might own the issue. Yeah I, open.

B

That one myself.

A

But, but to your point, there probably should be: if there isn't one, we should probably open. One I hope.

B

That there, that one.

C

So one issue that we recently saw with the terminated pods was not related to jobs. Basically, we saw that the default value of when that GC kicks in is 12,500, and we have seen that due to some recent changes when cubelets restart they reset the extended resource sizes and that leads to pods getting evicted and terminated in the in the terminators say. So we end up with lot of these pods in the terminated state and, and the GC never happens until it reaches 12,000.

C

So I mean like is the is the limit for 12,500, something that is coming out of jobs.

A

It was sort of a I mean job is the most common use case that we've seen it papa, and the interaction with job can be particularly bad, because once the garbage collection kicks in the job may never complete, and you just have this constant churn.

A

The actual problem that was control playing him that broke the control plane was Lister's and there the number of times that were set up the GC had to be put in, because when I forget what component it was when the pod listing was happening, it was causing cpu burn, like basically the duty cycle. The CPU got, eaten, live and was control, causing control, plane instability, so they set the GC threshold and added the garbage collector to prevent Lister's across the board from like just breaking everything, but with pagination the the problem.

A

With listing that many pods is well largely mitigated. So the the kind of current thinking is that gar pod GC might not even be fully necessary anymore. So.

C

When will the terminated parts be cleaned up, then they will be immediately cleaned up.

C

So you're saying that pod GC may not be necessary. Then when will the per terminated pods be cleaned up.

A

Well, the threshold would just be much higher. They would probably do a disc pressure based as opposed to having a GC threshold. That's less than that.

A

Okay, so we're running out of time does anything anybody have anything else they want to add before we close up.

A

Okay, all right, then I'll see everybody next week same time same channel thanks for thanks folks, I said bye.