Kubernetes SIG Arch - KEP Reading Club, 5 Sep 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Arch - KEP Reading Club 20220905

Description

KEPs discussed:
- Dynamic Resource Allocation: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation

A

Hi uh welcome to this month's kept reading club session. um As always this. This meeting as all of the communities meetings, follows the communities and cncf code of conduct which boils down to be excellent to each other. uh This meeting is also being recorded and will be posted online for future reference. So uh please act accordingly.

A

um That being said, we have one cap on the agenda today, which is the dynamic resource allocation camp. We also have the author of the kept with us on call patrick, thank you for joining. um We, let's start with, like a 20 minute, read time, if possible, it's not a hard limit, take as much time as you need after that as well.

A

This is a one hour call and we have this one kept with us so like let's try and uh hopefully get through it in a nice manner and like hopefully, we learn a few new interesting things. uh I will just paste the link of the agenda on the chat in case.

A

uh Anyone wants to you, know, follow along and don't have the link with them. That's the agenda and.

A

That is the gap that we will read or we will soon read: okay,.

A

So also just a disclaimer considering this is a pretty big one. It's fine! If we don't get through all of it. uh This session we can probably um get through some like. However, we can and discuss that and then the remaining we can take it up. Async overslept, uh that's also like a viable and totally okay option.

A

Okay, um so I will start like a 20 minute timer and then we can extend as needed depending on how things go.

A

Okay, uh starting the timer in three two and one.

A

There are about four minutes remaining, um considering it's a pretty lengthy cap. Why don't we can we maybe stop at uh design, details and then discuss up to there and like answer questions so that context is fresh in our minds or would folks prefer to go through the entire thing and then discuss all at once.

A

Either is fine like it's up to you.

A

Okay, that's like the 20 minute uh mark. I think we can probably start discussing questions now uh and then, like after five to 10 minutes of discussion proceed with the remaining of the cat. So up to this point, does anyone have any questions to get started with.

B

That probably means that I've written such a nice cap that it doesn't leave any questions unanswered, come on guys. I can't believe that it's that good.

C

All right, I found the typo.

B

C

B

C

Right but also a question related to it because it's a whole world typo, so in the in the custom parameters, implementation definition. In the first paragraph, it says that for resource clause that object must be cluster scoped, then for resource claim. It must be in the same namespace as the resource claim, and thus the pod.

C

So it says, resource claim must be in the same name: space as the resource claim. uh Yeah.

B

Well, it is the separate object, so that kind of still makes sense.

B

The object must be in the same space namespace as for resource claim and vast support.

B

So the parameters are stored in separate objects. That's the key design question. What kind of object that is depends entirely on the driver? It could be something simple like a config map or it could be a crd defined by the driver offer. I.

C

B

It is confusing in that place because.

A

B

Are so many different objects that this sentence refers to, but ultimately it's not wrong. It's just confusing yeah.

C

Yes, yes, yes, now, I understand that it was referencing that single object, that the resource, class and resource claim have one field with reference to it. Okay, so it's not the type! Why it's just me getting confused.

B

Yeah, but even that is useful feedback. It is sometimes it's bit for me at least being a german uh speaker with an authority for endless sentences and long words. It's particularly hard to keep the language simple. It's very tempting to have long sentences in a cap, but then it just becomes harder to read. So it's very useful to be to keep it simple, avoid fancy words, um make sure that the language doesn't get in the way of understanding what it's about.

C

How technical is the design implementation supposed to be looking at this cap? It is quite detailed, but I I don't suppose this level of details is expected on the initial phase when the cap has been just proposed. Is it.

B

Well, it has to be at the point where you get from provisional, where you just outline what you're trying to do to implementable at the implementable state. That's where you are allowed to merge code into kubernetes.

B

It needs to be detailed enough so that your technical reviewers from the sick that, where you're modifying code, actually understands and can convince themselves that this proposal will work.

B

So in that sense, this cap had to be that detailed, and we spend a lot of time going back and forth over exactly these implementation details to make the description clear, to discuss, corner cases and, in the end, have something where we all agree. Yeah this. This can work the way it's specified with no well with no open questions a bit too strong, but no no unknowns left. At that point. We knew that this would work. We knew also that there were alternatives that we were still discussing, and that's a bit unusual in this cap.

B

You normally you don't even get to merge something as implementable when you have such open questions that still need to be explored or prototyped.

B

We made an exception for this cap here, because everyone felt that holding it back another release wouldn't necessarily make the cap better and would just block valuable process that could be made by say implementing an initial alpha, as described in this cap and then perhaps iterating on the implementation and another kept update with potentially a different different design or slight tweaks to the design.

C

Right, if I look at the index table before the design details, it describes quite a lot about the proposed framework. How much of that was necessary for for initial proposal before it got attention of the technical review and so forth?

C

D

C

Bare minimum with which one would approach the community.

B

Which section specifically, are you looking at right now.

C

Well, I'm looking at the index, they have a table of context there yeah contents, sorry so uh summary motivation, proposal after.

D

C

Follows the design details which is, I presume already after the well in this case? Probably the design design was important so, like you said part of convincing the sig and and the community that this is a sustainable approach.

C

But if it's something smaller, not so complex or complicated, uh would there be, for example, a motivation and a proposal? Is that sufficient to start discussion? It often.

B

C

B

um What then happens usually is that you go to a sick. The sick looks at the problem statement. Basically your motivation and what you are trying to achieve and then decides that. Yes, this is something that physique wants to address and then perhaps you can get a cap merged as provisional, with just the motivation and proposal sections filled out or perhaps partially filled out.

B

If you don't even know how to do it, you might not be aren't able to to answer all of or to fill in all of these details, yet like risks and mitigations that, partly in my opinion, in my opinion or experience, depends on the actual solution before you can answer those both parts, but then the sig might decide yeah. This is worthwhile. Let's work on this together and then you can add more more details write a more complete cap later on.

B

I myself tend to try to have more complete understanding of a program space first, but that's also a risk, but I'm taking, because I might. I did invest quite a bit of time trying to come up with a technical proposal that actually worked. I think I even had a prototype at that point, or I explored at least some of the technical aspects and then wrote down the technical side and then started to circulate this cap a bit more broadly.

B

This may also help when you first approach reviewers, but it depends a lot on the reviewers.

B

It also can be overwhelming if you just put in too many details, and then people are supposed to read through that without really knowing what it's about so can can be. Good can be bad depends. I guess.

C

Thanks, patrick.

A

So I had a question you mentioned. Most of the contacts needed by the uh new apis are present in the object itself, but any additional context needed, while port editing uh may be referenced in a power scheduling object. um So I'm assuming pod scheduling is a new object that is being introduced in this cap or.

B

It is just okay.

A

Was there a the part I wasn't able to understand was why is there a new object being introduced? Maybe I didn't get to that part yet, but uh is there a reason for introducing uh like.

B

D

B

Yeah, it wasn't in the original design, originally uh all of the fields for what or where, where a specific claim could be satisfied, we're all inside the resource claim status.

B

But then tim morgan looked at this and basically pointed out two shortcomings of that original approach. One is this information about suitable potential nodes, suitable nodes.

B

All of that is not something that user normally expects to see when he looks at a resource claim, so it becomes noise for the user that just happens to be stored in a user, visible object.

B

And introducing the pod scheduling object hides those details for from a user unless they specifically start looking for it, so that that was one advantage of having a separate object.

B

The the other was that a pod scheduling object is one level above resource claims, so it can describe multiple resource claims being scheduled or allocated together in a logically consistent way, which wasn't possible with the earlier proposal, where one had to look at all resource claims to figure out whether they are going to be uh allocated for the same part, and that opens the door potentially for future extensions of a scheduling mechanism where different drivers, perhaps even look at the resource claims from other drivers that happen to be needed by the same part, to make some kind of holistic decision, and- and that is all easier to do when it's one object, that that the drivers and the scheduler look at to determine where to do via the allocation that there will be actually a cap update going even further than what is in the current cap.

B

Currently, there are still some fields, I think, in the resource claim stages that will will be moved to a pot scheduling, object and then its purpose and role will be even stronger and clearer.

B

That's an update, but I started drafting and discussing with tim hawkins, and we still need to finalize that. It's it's not even a an official update to the gap yet.

A

C

A

Thanks, patrick.

A

Okay, any other questions, uh otherwise we can probably continue. We have about 15 minutes remaining. um If we aren't able to get through the entire cap, we can always take a taste second slap.

A

E

A

Yes, espresso sorry.

E

Yeah, so um so, if the resource in the resource class right, I mean the driver, uh so basically it's a way to specify to the driver. What resources we need. I mean to take a very crude example. I mean let's say you have an accelerator, you want a specific set of uh uh you know like processing units. You know within that accelerator um I mean like how I mean. What's the interface that you that you use to the driver right, I mean I can see in type resource class.

E

You have something called resource class parameter reference uh is that the one or I mean like uh because different drivers will have different architectures, I mean like. Are you standardizing it or is that, like you know like a white pointer and see where uh like how? How how does that interface between the driver and.

B

The resource the interface is driver, specific there's. No, this is excluded, potentially not a goal for this cap to standardize.

B

Others have tried that before and you quickly run into problems of trying to identify the common parameters for a gpu. For example, that's just, in my opinion, an impossible task. Perhaps some standard will emerge later on, but right now, at the level of this cap, these parameters are defined by the resource drivers of this the api, the entry api just has these parameter references and what they reference is validated and used by only the resource driver and typically, they expect.

B

The expectation is that this will be a crd, so you create a driver, specific api through a crd that explains which parameters the driver accepts for a resource claim and a resource class, and this could be different. It's intentionally separated so that a cluster admin has a way to specify parameters that a normal user can't specify. That's the right. That's the rationale for having two parameters: uh two to two parameter: references, one for the cluster admin and one for the user.

E

E

I I'm not sure whether I got to it, but is there any chance? Is that? Can you provide an example of some sort? I mean like if it's already there I mean like you, can ignore it, meaning that a small example where you know like you, you actually show them, and this is so. This is how the driver would claim it, and this is how you would use it at that time.

B

Well, it's under user stories in in a way uh it starts with.

B

Well, the the first one is cluster configuration that has an example of a resource class with some parameters and the gpu in it kind in the gpu example.com v1 api group, that is a crd with some fictional fields, saying init command where the cluster admin can hook some command into the hardware initialization.

A

B

There's no detailed description of what those types are, but it's kind of supposed to be intuitive or yeah, just just from reading with examples you're supposed to get a good feeling, but these examples is always a bit tricky. If you write an example and then it raises more questions when it answers, you end up explaining the design and a lot of details already very early in the cap, so I I found that part a bit hard to write.

B

Also the proposal, and often you have to forward reference something that gets introduced later on, but sometimes it just can't be avoided.

A

So um I want to make sure, like I sort of got this just of this discussion. uh If you are a vendor who wants to sort of uh support this api right, so they need to implement the implementation, has to basically try and satisfy the grpc interface defined.

A

uh That's implementation, specific for the vendor and then the cluster operator or the cluster administrator. Whoever wants to make use of this feature, they can define a crd, as you mentioned, or as mentioned in the user stories, that references this device um or basically uses the implementation of the device that satisfy the grp system is, is that does.

B

That make sense, so the cluster admin basically needs to install the.

D

B

Driver as a as a cluster add-on, the.

A

B

Plugin must run on each node. Perhaps through the daemon set, the resource driver controller can be, uh it can be deployed in a cluster. It can be run in a port. It doesn't need to be on any specific node.

B

It just needs to connect to the api server, so these two parts need to be installed and then, if the driver uses a crd for for parameters, that crd also needs to be installed and then the the yeah, then there must be a resource class created by the admin and all of that installation procedure is very similar to installing a csi driver, and that's no coincidence, that's that's how this model basically draws on that model of adding something to a cluster that can be deployed itself with a journal file.

B

So the expectation is that a resource driver vendor will provide installed instructions, a jamal file or an operator. Perhaps that installs the resource driver and then a cluster that doesn't have this driver or this device support can install this driver based on these instructions and then, when it runs, there will be another new type for parameters. There will be a new resource class and users can start making, taking advantage of of that new feature in the cluster.

A

Got it got it yeah that that helps perfect? Thank you.

A

Okay. We have about uh 10 minutes. We can take five more minutes to sort of resume from wherever we were and then five minutes for any last questions, uh and then we can call it thanks for sticking around by the way.

A

Okay, any any last questions.

A

uh I had one question but before I ask that, like uh patrick considering this is a pretty big effort across sex right. um Is there a tracking issue in kk for this yet or some other place where folks can sort of follow along or offer help or any low hanging fruit that you may need help with um anything like that is.

B

Of course, yes, of course, there is a tracking issue in k enhancements it you need one. Otherwise, you can't create a cap.

A

Oh yeah, I'm I'm aware of.

B

Yeah I'm aware of that one so, but that's not actually getting that much discussion uh on day-to-day business. uh We have core team of people who are actively working on this from intel and nvidia.

B

We are kind of reluctant at this point to pull in more people because it probably wouldn't help uh I'm covering the core work on on scheduler ed bartos from intel is covering the cupola part, and but that pretty much is sufficient to get the implementation done.

B

uh We have kevin glues from sicknote from nvidia, who is working on an example is converting some nvidia gpu driver to use this new feature.

B

So I think the current work is pretty well covered by by people we are, or we started, setting up a new channel on slack dra for dynamic resource allocation, and then signal pointed out that they think that this fragments the discussion too much. They preferred to have all of the discussion around dra happening on the sicknote channel, so we are refocusing and we are basically whenever we have something that we want to share. We are now using the signature channel until people get bored or annoyed by us doubling the volume of that channel.

B

It depends it's it's a two-inch sword. It's it's all in one place, but then you also have to skip over things that you don't find interesting.

B

But for now it's getting discussed on signal, but you also need to read through all of the other things that they are getting discussed there um yeah that that's probably place to stay up to date and where we post announcements like this prototype, I have a prototype pr pending with this work.

B

It's a pull request against kubernetes, um where I already raised some design questions where I hope to get some feedback from core api reviewers on the best way of doing certain implementation details around the api server uh that that pr probably will see most of the discussion for merging the code in the 126 time frame.

A

Got it yep thanks, patrick, um so one question I had was in case uh the allocation mode is immediate and basically pods aren't getting uh either pods aren't getting created. That request this resource or, if calls are getting created, but for some reason are unscheduled.

A

I saw that if the port is unscheduled, the resource driver may trigger a de-allocation of the resource.

D

Like what's the life cycle.

A

Of the um of the resource in case it's immediate and based and no pods are requesting it for, like a.

D

A

Period like is there a case where the resource is allocated but not being used? uh How is that handled.

B

It will just stay in allocated mode, so uh immediate mode basically means the user is in control of the life cycle and.

A

B

This reallocate thing that you mentioned is also not being done for resource claims with immediate allocation, so the idea really is the use of the life cycle is very simple. Resource game gets created, created it gets allocated and remains allocated until it gets deleted any additional logic like okay, this this resource claim hasn't been used in a while that would be owned by by the user or perhaps some higher level controller.

B

An operator could create such resource claims, for example, uh yes, yeah the for this dealocate that happens for immediate allocation, and that is because the scheduler now has the task of getting a resource claim allocated, so that and not just one multiple resource claims potentially get them allocated for a certain port and don't stop until that. Pod is ready to run, and that means, if it detects the situation where post scheduling can't continue and the only way out of that situation is to de-allocate.

B

Then that's a viable, that's a permitted operation for the scheduler, and it triggers that through the api that is designed- or that is, that is available for that particular case.

A

Got it yeah that makes sense, and I think we are out of time, but uh thank you so much for joining in and answering all of our questions.

A

uh If anyone has any additional questions, uh I will start a thread on stick architecture slack channel on the kubernetes slack, so please um feel free to chime in there and battery can maybe take a look whenever uh yes, the time for it yeah sure. uh Thanks again, everyone for joining and uh have a nice day bye, see you next month, bye.