Kubernetes SIG Architecture, 19 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20201119 SIG Arch Community Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everybody: this is the community's link architecture, uh community meeting for um or whatever the day it is today november. 19Th, 2020 um and uh we've just got a.

B

Couple items and we've got light.

A

Attendance problems so we'll see if we're able to handle these items um uh swat. No, we did not hear you um okay. So the first item is alternative platform discussions. uh You don't see dims here.

A

Who asked me to put this on the agenda?

A

The yeah, I'm not sure what we wanted to talk about about it on the agenda, because to me we made some comments and um I think it needs some changes before it emerges, um uh as I mentioned here and derek looks like you're on the same page there as far as this step, four, was this really something that we need to put in here yeah? So my.

C

Concern on the pr was step, one seemed burdensome and didn't have a clear tie to the release process, um which was the bazel requirement.

C

Step four seemed outside the scope of uh needed success criteria, which was basically what I thought you captured as well. uh What was not clear to man, I was hoping we could get uh an understanding. This call is like of the existing architectures, which uh met the proposed rules of the game, um so I tried to reach out to some individuals. I saw cindy commented around s390, but I wasn't sure for arm or power if we could get like a clear ack on either what the missing steps were or um or that type of thing.

C

So I just wasn't sure we had the right rallied set of folks yet on the pr.

A

Okay, well, it doesn't seem like we have the right folks on the call either. So it looks like recent comments, we'll see where that goes. I agree. I'm not sure that it's not clear to me why basil would be strictly necessary, um but not knowing the build and release process as well. Maybe I should um you know that yeah well, as was a whole other topic, I don't want to necessarily um okay, then uh looks like you made your comments and we'll just uh we'll just go.

C

A

Github on this one, so we can move on from that.

A

And then the the only other item we have on the agenda here is this node research, topology um and uh the question. As I understand it is around how to handle well. One of the questions is around. We have an api.

A

um There was there was we want to create this api as a crd for experimentation purposes, but there was a discussion at least that that would that was going to require um something in kk to rely on that crd for the scheduler plug-in, it sounds like there's an option to do that: scheduler plug-in as an external or do an external scheduler all together. What where do we stand on this and I know derek you had some questions around whether the status of the cap as well.

C

Now that we're recording for those who weren't here of the process, because the pod resource api kept merged and was, I would say, 70 implemented in 120- and that was a prereq to kind of do the other items that were discussed here. I wasn't aware of an actual cap that had merged describing the api, and so I thought requesting the repo um might have been getting ahead of uh the process a bit.

C

um But if we want to talk about the abstract on, should the scheduler support out of tree plugins, or should um these types of optional things get uh handled a certain way? I think we can, but I just want to make sure that, like I wasn't missing something or forgetting.

D

C

D

C

Wasn't like a clear.

D

Yes, okay, so uh we had in the cap, we had two gaps for topology, aware scheduling in one of the gaps we had kind of explained that we would have this topology no resource body api and initially we had mentioned that we have this um in in nfd.

D

But then, uh when we had conversations with entertainment, they said that it would lead to kind of circular dependencies because um nfd imports, kubernetes and then the scheduler plug-in would need to import nfd and eventually entertain the circular dependencies so um the process. So we came to the conclusion that the only viable option for us is to places place it in staging.

D

So that's where the process of creating the staging repository started. um Hearing from what you're saying it seems that you're expecting another cap to describe the topology api itself.

C

Right, like I, wouldn't have expected code to go into the core kk tree or korg repo for the api without having a clear cap describing it- and I think, maybe earlier discussions had this treated as a a kind of x-case io like extended api and had some optionality, but I think, given what the um the natural course of events you're describing around circular dependency and stuff. That means like I would have expected a at least it kept describing the api, and then we could have the crd or not crd discussion.

C

Maybe that's what jordan you were trying to call out.

D

Okay, yeah, I I think we because we had this information captured as part of another cap. We weren't sure that we needed like a separate cap to do it. So it's kind of just a misunderstanding in in terms of the process.

E

Yeah, I I was having a hard time understanding.

E

On the one hand, it sounded like this is like experimental, optional, we're, not sure if it's gonna, you know, proceed and graduate like we're just trying to get initial proof of concept and experimental stuff in, um but on the other hand, it's wanting to merge like apis into core repo, and so I, like, I understand the circular dependency thing. I I think that actually is sort of a gap in the scheduler. If the only way we can experiment with.

E

Plugins is by introducing like in-tree apis. That seems like a gap in the scheduler relative to all the other extension mechanisms we have um so rather than saying. Well, I guess what we just have to do is merge everything in tree.

E

um Like I, I know at one point, we had the ability to run scheduler plugins out of tree, I'm not sure when we lost that ability.

D

I believe we still do. We still have the ability to run it as an out of three scheduler plugin, but the reason we were trying to do it as an entry plug-in is because topology manager itself uh has introduced a few gaps that is leading to topology on where scheduling and we were trying to address that. So that's the reason we wanted uh to propose it as an entry plug-in um if it um makes everyone more comfortable with uh like an out-of-tree scheduler plug-in prior to having it entry.

D

We are we're okay to do that as well. I think I mentioned it as one of the options um when we were speaking about the possible options of kind of moving ahead with this project.

E

Okay, I missed that that was an option I think. Just in terms of velocity, you would certainly be able to experiment a lot faster uh running it out of tree like in the time that it's taken to sort of work through the circular dependency stuff, and now that 120 is frozen, and I mean uh just in terms of velocity you'll be able to go more quickly.

E

If you run something out of tree and say, run this scheduler and load this x, kids, crd and play with it and let us know- and then we can rev that rapidly once that's proven out like once, there's it's clear that it's widely used enough and proven out enough and performs well enough and all of those things to um sort of justify being brought in as an entry thing like that.

E

I think that's a reasonable conversation to have, but uh I think it would benefit you in terms of velocity and it would um just make it a lot. uh There'd be a lot fewer people involved in the discussion. If you just ran it out a tree uh and proofed it out. So that's.

B

What we're asking a lot of people.

E

To do for like um auth and policy stuff like prove it out as a web hook like the shape of the api.

C

Concretely, though, jordan right, we shouldn't have been creating the staging repo or the code prs until it kept merged, and I think that just because maybe a cap was open, we kind of have to do the merge step.

E

Yeah yeah, so so what I'm talking about is the the whole like it doesn't even belong entry discussion like assuming it did. You know either now or down the road.

E

There should be an implementable cap before we start merging code to support it.

A

I think the the issue, though, that I'm hearing is that we've introduced with topology manager um a situation where no, where workloads are not scheduled or they're, scheduled to.

C

A node, but they.

A

Can't actually run on the node because they violate the power. So and I guess.

F

Topology manager.

A

Is intriguing, we're saying so um we're not offering any solution to a problem we created, um which I mean that may be fine, because it's temporary right, but I think it does make sense that if we've got something in tree, that's causing this um gap, we would eventually want to solve that entry as well.

E

Yeah I know tim has been more involved in the topology stuff. What is uh is topology manager, alpha beta, where.

F

Just to be clear here, topology here, can you hear me like uh my computer? Is all sorts of weird yeah here topology here in the topology like zone region stuff are not the same, like they're they're, unfortunately saying more concepts just for anybody.

F

B

C

F

Glitched and I lost the last like two minutes of competition yeah, so I think my feeling I'm missing.

G

Is it just me or is tim hard to.

C

G

C

So topology manager, the idea of the cubelet being able to um align a device and a cpu, and hopefully in the future memory to be uh on a common pneuma node aka like the intro node topology.

C

That feature is uh beta now and so to your comment, john on there's a way to configure the cubelet where the cubelet um tries to align things and the policy you can assign to the keyboard to align things that the clusterwide scheduler may not be aware of uh is true.

C

um The impact of that is, you know, and the ways to react to that or you know, um there's a lot of ways to make that not a problem. I guess, but I I for the broader statement of like. uh Would we want the scheduler to be aware of it or not?

C

Like that's the thing that I would just expect to have settled in the kep and personally, I have no objection to the entry scheduler being aware of what the cubit was doing internally necessarily, but um uh we just didn't have that uh clear cap to point to that says it was merged, um but absent that I'd agree with jordan, that um we can continue to innovate, either out a tree with a custom, scheduler or help. The scheduler component today support the types of plug-ins.

A

When you talk about a cap like do we need a, um I don't, I think, sweat swattie mentioned a new cap. Do we need a new couple? We just need to add to an existing. Is there an existing cap that this would fall under? That can be enhanced to include this this api and then, and then I guess tim. If you have some things to say, you're welcome to jump in yeah um but derek. Are you more familiar, you're more familiar with these caps? I don't even.

C

Yeah yeah, all I'm saying is I I had tried to look into prior to coming to this meeting was: was there a kep that clearly said? This is the api. These are the fields on it um and this is where it should go, because some of the questions we're debating now, I thought, would have been debated in a cap and there I found two, maybe open caps that we could consolidate, but neither of which had been merged and um some of the prerequisite building blocks that this feature said that they would depend on.

C

You know, did merge and are in process of of getting added to features in the keyboard to facilitate it, but like we're completely distinct from from the question at hand on where this api is and how the scheduler becomes aware.

D

Yeah one of the key part of the cap, we're capturing in the cap, is where we want this topology api to sit like at this stage. uh One of the caps says that it's going to be in staging, but it seems like we need to have things in staging for the cap to get accepted, um maybe just kind of um lack of knowledge on my part or our part in general to understand kind of the process.

D

We had discussion in signored as well uh explaining why we want this in staging because, like I explained that nfd would lead to circular dependencies, we also considered external repo, which would have similar similar issues, and then that left us just the only option of having it staging when we were just uh assuming that the scheduler plug-in would be entry. um This the out of tree discussions, kind of started after that. So.

E

The the idea we have a thing in tree to I don't know fix a fix. A problem or sort of complete a solution for around topology would be stronger if the plan was for that to be enabled by default, but I thought that this some of the comments were saying this would be an optional thing. Yes, we can hear you tim, I'm just talking over you. I'm sorry.

H

Okay, um I I want to um I want to express support for the idea here now. I'm going to assume that, like the research project, part of this is either been done or is being done to prove that such a thing is useful and and and efficacious, um and given that I feel like understanding topology at this level is a reasonable thing for kubernetes to internalize and for the default scheduler to understand um and for on node agents, whether that's cubelet or something else, to help populate information about so in general.

H

I I'm supportive of that. I don't know if all that homework has been done, and I don't know if the exploratory part of like are there better options with respect to where the api should live? That's when I, when I opened a the discussion about topology api, I was looking at it purely from an api point of view and as somebody who's got some background with the hardware side of pneuma, um not from a like, did you do the research point of view? Did you do the research.

D

Well, I would say from the proof of concept point of view like in terms of how the topology should be exposed and having the end-to-end solution work like exposing the crds and then the scheduler plug-in using those crds to make the scheduling decision we have. We have it working and um like. Is there anything else that you're trying to get to in this question.

C

Maybe just make sure that tim, um obviously the capability in the cubelet to align devices with their resources and their memory with their cpus that that is working well and people are successful with that today and it had no impact to the end user api right. It was a policy block right.

C

The research here that had been presented was basically still having no end user impact to the pod api, um but having an api to advertise back to the scheduler. What the nodes, concrete, no local scheduling decisions were and the feedback loop on where and how to make that known is kind of the tension point here and um uh but from like the earliest discussions that recall you and I had on this- like there is no impact- the end user pod api.

C

This was all internal um finding the right communication vehicle between how the cubelet says, what its concrete resource assignments were and making the scheduler then aware of that, and the prerequisite where that was this pod resource api, that advertised from the cubelet from a grpc endpoint. That says these are the pods running. These are the cpus that's assigned, and this is the devices it's consuming, uh that got done as a part of the gpu enablement work, and that was the building block that was being enriched in 120 injury.

B

C

B

To mention uh uh about another soft apology: api, um if you already took a look, it looks um too generic.

B

uh Initially it was only about normal nodes, but now it could support um a wide range of uh devices by a range of different types, for example hyper trading, siblings and so on and internally we have an extension for cpu manager which support hyper trading and current, not resource topology api, which was in the pr current, also supported. I could support it without any modification, so it generates enough for any fuser modifications.

H

So the uh just just typing the question I have is: do we have sufficient evidence that this level of integration is actually a good solution for whatever the important problem statements are.

H

Like if we do all this work into one percent improvement, that's not super exciting, but if we do all this work and it's a 90 improvement, then that's great. Do we have that data.

D

um So, are you saying that you're looking for performance numbers in this case.

H

I'm looking oh so I'm not I'm not super close to sig nodes day to day. I assume that there's a problem statement somewhere that we have decided is an important problem statement and I'm just and I'm looking forward.

C

Sorry go ahead. Yeah. I just wanted to focus the problem statement, so I do not believe the problem statement is. I want my pod to be co-located with my cpu and memory and device uh as efficiently as possible. I don't think that problem statement is up for debate, because that that work has been done, as is helping people today in cube, as it is today running all sorts of workloads.

C

The the second order problem statement is, um I want to ensure that the scheduler uh schedules, my pod two nodes, whose policy will fit the desired semantics. I wanted of alignment um the I think from what I've seen in saudi and alexi's work that it does minimize miss scheduling problems. The impact of that missed scheduling problem, of course, is uh how much users pre-plan their hardware deployment and.

C

The the nature of slack capacity or utilization they're trying to achieve, but I do believe that the work results in better first attempt scheduling decisions.

H

I I believe that I guess I'm looking for a definition of like what is how do we know that this was successful? Is there a is there a metric or some threshold of anecdotes or or something that says like yeah? This is actually worth the the effort and I guess to go back derek to the question you asked before. Like is: is there a cap that covers this overall.

C

Yeah, I just want to make sure we understood the problem statement was a.

E

C

To a node that you can run not performance, because performance is distinct from this yeah.

A

H

A

If we get scheduled.

H

Somebody showed me.

A

H

Sorry, I was gonna say if somebody could show me: look: we have dollar user. Who has these special requirements and 25 percent of the time we schedule to a node that isn't at all feasible and after this change it's only two percent of the time um that would be wonderful, like is am I asking for something unreasonable.

C

I think your question is the right question. I think the metric for value is is a is important. That is the question. I would expect that this kept would identify and answer right in a deployment of x nodes with this topology with a mixed workload.

C

We see in our poc uh pods being scheduled at a rate of x rather than y. um That seems.

E

Fine, but I think.

C

Performance is completely distinct and there are plenty of users taking advantage of what cubelet has today.

A

um If I may ask what happens when we, when a pod gets scheduled and it's not able to be, does it get, does it get, and how long does that take.

D

So I just uh I just sent a link to you all: it's basically, it ends up in an error, it's topology affinity, error and if that part is part of a deployment or a daemon set, it just keeps creating recreated on the same note, because the scheduler essentially keeps making the same scheduling decision and the pod just it's causes a series of runaway pods.

H

So, just basically that that so that's a problem that we've sort of known about for a long time like is, is that a problem worth solving on its own, like being able to thread two different pods together through a I, don't know a nonce or something that the scheduler says. Oh hey, this thing failed on node x, I'm not going to put it on node x, again, like that seems completely independent of pneuma or topology, or anything else.

A

Yeah, it might.

D

A

Simpler, frankly, a solution.

D

Yeah, the the reason that happens is because the scheduler doesn't have the granular information of the resources available at a new node level, uh and that's that's how we are trying to look at this problem and address this by giving it the visibility and that information to make more topology aware decisions.

C

Yeah so tim, I think the question is: do you want the pod to remain pending, or do you want the pod to loop around all nodes in your cluster in some type of cycle, but uh that that's a complete piece of fair feedback and, from my experience, plenty of users pre-plan um their workloads to their hardware, for the scenarios where this is tackling, but not all? And so.

H

Yeah, that's that's fair. I I'm in the back of my mind, there's a voice that says there are going to be other reasons besides topology, not fitting that you might want to say, don't run this pod on this note anymore, um but there's no there's just no way to express it right. I don't mean to sidetrack and and take this idea uh in a different direction like like I said before I I am supportive of the idea of the understanding of topology. I just would like to know how to know that it was successful.

D

Sure I think that that is certainly a fair question. um Maybe we need to do a bit more homework to have those um those metrics kind of laid out in front of you to show the benefit of this feature um like we. We have the prototype working and everything working, and we know it works, but probably that's not enough uh to kind of take this forward.

H

Yeah I mean I would accept even as a as a result like a model, we took the scheduler code, we built it into a program that, had you know, 5 000, virtual nodes with different topologies, and we threw what we think is a representative workload at the model, and this was the uh infeasible scheduling rate before the change and here's the change. Here's the rate after this change right sure.

D

D

Yeah, so we can work towards that. That's no problem and we can. We can bring those numbers back to you guys and and then get more feedback.

H

And if it turns out it works and it's actually really effective, then I would be the first person to sign up to build it as a as an entry api, not as a third party or external thing.

D

Okay, fair enough.

A

Yeah, okay, do we have uh um marching orders done? Everybody knows what they need to do any other questions, otherwise we're all we're all done here. I.

A

Think all right, thank you. I mean I, I guess I'll say one more thing just because it keeps bugging me is you know, I have to say I'm you know.

A

It it tickles me the wrong way that we are having to build in all kinds of node policy like no topology policy information into the scheduler as well, and what next type of use cases are going to be for controlling what goes on the node and where and then we have to. We have to build that into the scheduler too. I guess we have plug-ins for that. um You know, but um it's not. I don't think it's it's nothing that rises it. It's it.

A

It's a level of uh uh coupling between the two that that sort of it rubs me the wrong way, but I can understand the need for it and there's gonna be some of it. So you know not enough for me.

H

I mean we've been talking about. We've talked about topology, literally since before 1.0 and like we've known forever that it would eventually be a thing that would hurt, and you know maybe it finally hurts enough for us to deal with it.

A

Yeah, like device manager to handle like uh and custom resource types to handle.

C

And the scheduler already understands those.

A

um So that, maybe is handles most of the use cases and we just needed something better for this topology.

C

Yeah, it's a little more complicated, john, maybe um so tim's right that we've been discussing this since before 1.0, um and there were a lot of initial attempts to say, make the pod spec. Actually like no topology aware and give preferences to a bunch of stuff.

C

I would say that we've reached a a a place, maybe a couple years in the cube's life, where we said that the cubelet was going to be responsible for making node local intranode scheduling decisions and the scheduler opaque to it, and over the last I don't know, 10 releases or so of cube. We've evolved that right, so the cube got smart on picking cpus or the cubelet guts assigned devices, and then topology came up because you're like well.

C

I want my my gpu next to my cpu and so that we got to that point and I think we've reached enough of a path where both components were able to evolve um in ignorance of each other, that this is the last thing that tries to bring it together. But I think we've come a long way from not meaning to the pod spec, to your comment on like dependencies between scheduler and cubelet.

C

We we actually have a few that are um somewhat frustrating because the the cubelet re-runs the predicates uh before admitting the pod that the scheduler.

E

C

um And so there are some other interesting situations uh that have come from that from a codependency standpoint that, like independent topology, we have uh code sharing between cubelet and scheduler that in the cublet, submission control is, is not perfect yet either so.

A

Okay, yeah, I mean perfect's the enemy of the good or good enough right or whatever. That is whatever the cliche is. um Okay, uh let's see what's tim saying here.

H

Sorry, I'm just echoing what derek said: I've I'd always been the person who said no to api changes until the cubelet did the right thing automatically and now that it mostly does the right thing automatically. I'm happy to start thinking about what api changes we need.

A

All right, thank you, everybody uh last chance for comments. Otherwise we get back 20 minutes or so all right. Thanks so much.